24% increase with 8 threads.
11% decrease with 4 threads.
33% increase when looking at home-field advantage (8 on 8, 4 on 4). This is what programmers developing multithreaded software make as the default behavior.
I think the point that is easy to miss is that in the scenario where performance suffered, I think Windows can be safely blamed for allocating the threads improperly (e.g. it had two threads running on one physical core). Regardless of underlying reason why it was slower, the processor also was ~50% idle where the other two instances were 0% idle (presumably). Just because this one task took a small hit covers up the fact that the processor was ready and able to do other, unrelated work simultaneously.
Hyper-threading is typically responsible for about -5-50% performance change depending on workload.
The fact the processor handled 200% threads slightly better than 100% threads speaks volumes to both Windows and the processor's scheduler.
I would agree with you if the improvement from 100% to 200% wasn't something along that lines of 2% gain from 8 threads to 16 threads (4c/8t case) and ~6% for 4 threads to 8 threads (2c/4t case). 2% is within margin for error, so I would consider that unchanged from the previous value. The values for "percent increase" is the improvement in performance is over the last number of threads. So improvement for 8 threads would be over 4 threads. The part that makes me think it's the scheduler is not the improvements over the max number of threads, but rather the difference in improvement for 4c/8t @ 8 threads versus 2c/4t @ 4 threads where the instance with no HT threads excelled. Also the other big thing that makes me think the scheduler is at fault is that 4c/4t running 4 threads was faster than the 4c/8t configuration running 4 threads.
Also the table here is confusing, I should redo the percent increases, which really means the amount of performance per core over 1 thread, not improvement from 4 threads to 8 threads which is really want I want here.
Hyperthreading isn't for parallel workloads which most of the threads are processing similar instructions that are using the same execution units in each core.
Actually it is because no workload is purely using one part of the CPU. All HT does is takes advantage of the fact that super scalar system architectures gave Intel the opportunity to cram data into the pipeline while it waits for some reason. It sees an opening, so it puts something in, instructions always get put into the pipeline at the beginning, so it's not really just grabbing an unused part, it wait for unused parts. Also saying that HT isn't for parallel workloads is really funny since the only purposes of more than 1 core or 1 thread is for parallel thread-like workloads, not GPGPU-like tasks where all you're really doing is transforming matrices, just multiple threads in general because anything happening at the same time is "parallel processing", weather that task is the same or different. Stuff running in parallel doesn't imply that they're the same instructions either or that parts of the CPU can't be shared. A lot of reasons for more threads is to do a task alone and asynchronously which would benefit the most from HT. They might share less of the CPU because of similar kinds of instructions used, but not certainly not all of. HT suffers when you have a lot of context switching or if a pipeline stall occurs because it needs to wipe out the entire pipeline to recover from the stall if it was due to branch misprediction. HT suffers even more when you have a ton of locking going on which can slow everything else down depending on how much locking is going on.
Also keep in mind what you just said
The purpose of hyperthreading is to minimize the idle execution units when processing different threads. Like, for example, playing call of duty while recording it with fraps and also listening to music in the background. Or watching some netflix while compressing a backup with 7zip running in the background.
A lot of applications (in particular games,) have a really hard time making code run in tandem (in the sense that two threads are trying to do parts of the same task,) versus say, a game logic thread, versus a network communications thread, versus threads that do network access asyncronously with the network comm. thread, versus a thread that goes GPU dispatch for rendering. So one application can have a lot of different threads, so to say that HT isn't designed for parallel workloads is funny because you're clumping "parallel workloads" with tasks you would give GPGPU devices, which CPUs suck at doing in the first place.