Sunday, July 12th 2020

Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks

"I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on." These were the words of Linux and Git creator Linus Torvalds in a mailing list, expressing his displeasure over "Alder Lake" lacking AVX-512. Torvalds also cautioned against placing too much weightage on floating-point performance benchmarks, particularly those that take advantage of exotic new instruction sets that have a fragmented and varied implementation across product lines.

"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks." Torvalds believes AVX2 is "more than enough" thanks to its proliferation, but advocated that processor manufacturers design better FPUs for their core designs so they don't have to rely on instruction set-level optimization to eke out performance.
"Yes, yes, I'm biased. I absolutely detest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market. Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough," he added. Torvalds recently upgraded to an AMD Ryzen Threadripper for his main work machine.
Source: Phoronix
Add your own comment

42 Comments on Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks

#26
efikkan
dragontamer5788That's actually what makes me most excited about AVX512. All of these new AVX512 features allow auto-vectorization to happen far more easily. The details are complicated, but... lets just say that NVidia CUDA and AMD OpenCL has been doing this stuff for over a decade on GPUs. Intel finally is providing CPU-compilers the ability what GPU-compilers have been doing all along. It requires some additional support from the CPU instruction set to ease auto-vectorization and provide more SIMD-based branching controls. But once provided, the theory is already well studied from 1980s SIMD computers and is well known.
Yes, and the interesting thing is that this would solve most of the scaling problems with code, which as you probably know is branching and cache misses. Most branching inside algorithms doesn't actually affect the bigger control flow of the code, put just 3-4 of these and you pretty much guaranteed one or more stalls. I often call these "false branching", and sometimes do clever things to try to eliminate them, like bitwise operations, conditional moves etc. But AVX can resolve a lot of this, it really comes down to being able to write clean readable code which translates into optimal AVX instructions. I still find it a daunting task to write anything but smaller pieces using intrinsics though.
dragontamer5788Honestly, Linus Torvalds is very clearly out of his depth in this subject matter. I'm no expert, but I can confidently say that I know more than Linus on this subject based on what he's saying here.
I have tremendous respect for Mr Torvalds and am a big fan of his two software creations, and I know he is a very smart man. But this doesn't make every outburst from him gold, and most of what he said here is not accurate.

The only part I could agree about is some of the more application specific instructions (like "AI" stuff). I believe a standard ISA should be generic compute and logic, not application specific. So in my opinion, throw out all the AES, zip, jpeg(!) etc. acceleration instructions, and give us four 512-bit FMA-sets instead.
Posted on Reply
#27
dragontamer5788
I often call these "false branching", and sometimes do clever things to try to eliminate them, like bitwise operations, conditional moves etc
My favorite is "max", "min", and similar operations.

Consider your typical "comparison" for a sorting problem. You'd think you need an "if" statement, but in reality... you can make due with:

higher = max(a, b);
lower = min(a, b);

The max/min version of the code is branchless at the lowest level, thanks to instructions like vpmaxud. And all of a sudden, your for-loop starts to look far more auto-vectorizable and branchless.
Posted on Reply
#28
Kanan
Tech Enthusiast & Gamer
dragontamer5788Honestly, Linus Torvalds is very clearly out of his depth in this subject matter. I'm no expert, but I can confidently say that I know more than Linus on this subject based on what he's saying here.
I'm pretty sure it was one of his usual rants, he does that sometimes. I too agree that AVX512 is definitely far from being useless, BUT, the availability as well as in the feature set per se, is far too fragmented, the point of Linus still holds, that Intel made a mess out of it.
Posted on Reply
#29
efikkan
dragontamer5788My favorite is "max", "min", and similar operations.

Consider your typical "comparison" for a sorting problem. You'd think you need an "if" statement, but in reality... you can make due with:

higher = max(a, b);
lower = min(a, b);

The max/min version of the code is branchless at the lowest level, thanks to instructions like vpmaxud. And all of a sudden, your for-loop starts to look far more auto-vectorizable and branchless.
Yeah, that's the kind of stuff I've been doing, like mostly creating simple inline functions with vector and matrix maths, but not whole algorithms yet. But SIMD is very suited for algorithms designed in a data oriented approach, I imagine for things like line intersections, collisions, etc. I'm sure some software architects' heads will explode though :D
Posted on Reply
#30
mtcn77
CheeseballBut for AI and machine learning this is advantageous



Quadros can handle FP64 fine. Whats lacking is FP16
What about tensors? I think vectors count as rank 1 tensors, so we should be able to compare the two.
Posted on Reply
#31
dragontamer5788
KananI'm pretty sure it was one of his usual rants, he does that sometimes. I too agree that AVX512 is definitely far from being useless, BUT, the availability as well as in the feature set per se, is far too fragmented, the point of Linus still holds, that Intel made a mess out of it.
Yeah, Linus definitely has a habit of ranting online and leaving his field of expertise. And to be fair: so do I. We're only human after all. It just means that you gotta be on guard and always critically read what Linus is saying. He's clearly a smart guy (probably smarter than me in most aspects of programming). But don't ever grow complacent.

AVX512's main issues are business related. Its locked out of mainstream Skylake chips (typical i7s), so its not really a common compilation target. It was originally Knights-landing feature (aka: Xeon Phi), which is a dead-end.
efikkanYeah, that's the kind of stuff I've been doing, like mostly creating simple inline functions with vector and matrix maths, but not whole algorithms yet. But SIMD is very suited for algorithms designed in a data oriented approach, I imagine for things like line intersections, collisions, etc. I'm sure some software architects' heads will explode though :D
I suggest reading through this dissertation by the way: www.cs.cmu.edu/~guyb/papers/Ble90.pdf

Blelloch's dissertation from 1990 would seem out-of-date at first glance. But in reality, modern SIMD machines (both AVX512 and GPUs) are heavily based on the CM5 machine he used as the basis of his dissertation. As such, his dissertation reads amazingly close to modern machines.

Dr. Blelloch's more recent papers map more closely to modern machines: www.cs.cmu.edu/~guyb/

Just some food for thought. I wouldn't try to do the "flattened nested parallelism" from the top-down in every algorithm. Its unlikely to be fast on all modern architectures. But what's interesting is that Dr. Blelloch has proven an equivalence between recursive definitions and the prefix scan-operations. As such, we have a "universal gadget" to try to convert recursive forms of algorithms into prefix-sum, prefix-max, and similar operations.

Not that the gadget is always efficient on a modern SIMD machine. Its absolutely not... but maybe restating the problem in a prefix-sum style provides insight and gives you ideas for a more efficient algorithm.

---------

You don't have to go very far to be amazed. In as early as Chapter 1, Dr. Blelloch converts recursive quicksort (yes, quicksort) into prefix sum operations.
Posted on Reply
#32
efikkan
dragontamer5788AVX512's main issues are business related. Its locked out of mainstream Skylake chips (typical i7s), so its not really a common compilation target. It was originally Knights-landing feature (aka: Xeon Phi), which is a dead-end.
It's important to remember that Intel's intention was to release Skylake-SP/X and Ice Lake (client) pretty close together. Coffee Lake(s) and Comet Lake were emergency backup plans. So if anything, their business failure is in failing to have a backported Sunny Cove etc. just in case 10nm failed. This AVX-512 inconsistency was never their intention, but still ultimately their "fault".
dragontamer5788I suggest reading through this dissertation by the way: www.cs.cmu.edu/~guyb/papers/Ble90.pdf
Thanks.
Some good corona-times reading :)
Posted on Reply
#33
trparky
efikkanNo it does not. Unless the CPU reaches a thermal or power limit, it will not throttle the whole CPU, it does not slow down all cores. Loads of applications use AVX to some extent in the background, including compression, web browsers and pretty much anything which deals with video.
Then tell me why there is an AVX Offset in UEFI?

If I understand the concept of the AVX Offset correctly, it's a setting that if you set it at 5 the processor will down-clock from the highest speed by that setting. In the case of a setting of 5 the processor will down-clock by 500 MHz when executing AVX instructions.
Posted on Reply
#34
windwhirl
trparkyThen tell me why there is an AVX Offset in UEFI?

If I understand the concept of the AVX Offset correctly, it's a setting that if you set it at 5 the processor will down-clock from the highest speed by that setting. In the case of a setting of 5 the processor will down-clock by 500 MHz when executing AVX instructions.
Don't know about that AVX Offset thing in UEFI (I don't do overclocks, after all), but you may be refering to this:
stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency/56861355#56861355

en.wikichip.org/wiki/intel/frequency_behavior

It's documented behavior that Intel processors have different frequency sets according to whatever is running on it.
windwhirlTLDR, it seems to affect only Turbo frequencies, in the first place, and how much it will downclock will depend on the type and number of instructions executed. AVX512 does trigger this throttling a bit more, while AVX and AVX2 do it less or don't even do so at all.
Posted on Reply
#35
R-T-B
Also, due to how hyperthreading only lets two threads run on a core tops, you'll never "slow down" an integer thread on the same core as an AVS instruction very often. Most of the time, it will rapidly downclock for AVX, execute that instruction with reduced clocks (and still better performance than if it hadn't), and then switch back and do whatever integer thing it was doing at full speed. No penalty. The only situation there would be a penalty would be if it literally executed some kind of AVX and had TIME LEFT OVER (unlikely) to then execute an integer instruction, which would be forced to execute at the lower clock. This is exceedingly rare in practice, I'd picture.
Posted on Reply
#36
efikkan
trparkyThen tell me why there is an AVX Offset in UEFI?

If I understand the concept of the AVX Offset correctly, it's a setting that if you set it at 5 the processor will down-clock from the highest speed by that setting. In the case of a setting of 5 the processor will down-clock by 500 MHz when executing AVX instructions.
The claim was that any AVX code would impact any other code running on the CPU, and that's simply not the case. A single core can throttle with a lot of AVX, but the CPU runs AVX all the time without any problem.

The purpose of the AVX offset is for overclockers to push non-AVX workloads to a higher clock speed.
R-T-BAlso, due to how hyperthreading only lets two threads run on a core tops, you'll never "slow down" an integer thread on the same core as an AVS instruction very often. Most of the time, it will rapidly downclock for AVX, execute that instruction with reduced clocks (and still better performance than if it hadn't), and then switch back and do whatever integer thing it was doing at full speed. No penalty. The only situation there would be a penalty would be if it literally executed some kind of AVX and had TIME LEFT OVER (unlikely) to then execute an integer instruction, which would be forced to execute at the lower clock. This is exceedingly rare in practice, I'd picture.
The CPUs are superscalar, so the technically it can execute both integer instructions and vector instructions at the same time, and it often does. E.g. you have a loop with dense math, the math is AVX, but the loop is not. But it's not a problem, as the alternative would be to do much more code, so even if a few instructions technically runs slower, the overall workload is still a lot faster.

Running the same calculations as AVX greatly reduces the instruction count and the clock cycles needed. It also makes it unroll even more the loops, which again reduces the loop code and branching associated with it. And denser code also helps both data caches, instruction caches, data dependencies and branch prediction, as the logic is more dense.
Posted on Reply
#37
Kanan
Tech Enthusiast & Gamer
dragontamer5788Yeah, Linus definitely has a habit of ranting online and leaving his field of expertise. And to be fair: so do I. We're only human after all. It just means that you gotta be on guard and always critically read what Linus is saying. He's clearly a smart guy (probably smarter than me in most aspects of programming). But don't ever grow complacent
Linus Torvalds is well appreciated by me anyway. I respect people who publicly are bold, direct and honest, it is a rare trait. The most famous was his moment where he struck the middle finger to Nvidia in a conference, which was well deserved. Big companies must always be tested and questioned, they should not have a free pass or they will always abuse it in the name of capitalism and their share holders.
Posted on Reply
#38
trparky
efikkanThe claim was that any AVX code would impact any other code running on the CPU, and that's simply not the case. A single core can throttle with a lot of AVX, but the CPU runs AVX all the time without any problem.

The purpose of the AVX offset is for overclockers to push non-AVX workloads to a higher clock speed.
So, in other words, nothing to be alarmed about. It's there but it's not going to cause too many slowdowns unless your cooling setup is really that shitty.
Posted on Reply
#39
dragontamer5788
Hmmm... I recall some very, very, very smart people discussing AVX512 downclocking / slowdown issues. I don't recall what they said about it however.

My perspective is that these microarchitectural issues (ie: downclocking or whatnot) will absolutely change by the next major "tick-tock" architecture from Intel. Intel's first implementation of any SIMD has always been crappy.

When AVX was first released, it was executed 128-bits at a time (Sandy Bridge). It was missing integer instructions: that's right, you could do 53-bit double-precision multiplies but you couldn't do 32-bit integer multiplies. All sorts of terrible. Eventually, Haswell + AVX2 came out and fixed the issues, finally making the AVX transition mostly worthwhile over SSE instructions. But all of the flamewars from the early 2010s about "is AVX worth it" look hopelessly outdated in today's environment.

I guess my point is... don't judge the AVX512 instruction set based on its current implementation (ie: Skylake-X). Skylake-X is clearly a "bad" implementation of AVX512. We should instead judge AVX512 based on its future viability. Focusing too much on Skylake-X's performance quirks will make our comments obsolete quicker.

-------------

Case in point: the CNS AVX512 chip (yeah, Via-chips. Surprise!!) can support AVX512 at full clock speeds. It does this by implementing all AVX512 instructions as 256-bit instructions executed over 2x clock ticks. No downclocking involved at all. Maybe this 2x256-bit methodology will be superior in the future, and Intel will copy it. Or maybe Intel figures out the 512-bit power issues and removes the need of downclocking.

Even as a 2x256-bit implementation, AVX512 has enough bonuses (auto-vectorization instructions, opcode masks, scatter instructions, extended register sets) that its worthwhile to use.
Posted on Reply
#40
windwhirl
dragontamer5788I guess my point is... don't judge the AVX512 instruction set based on its current implementation (ie: Skylake-X). Skylake-X is clearly a "bad" implementation of AVX512. We should instead judge AVX512 based on its future viability. Focusing too much on Skylake-X's performance quirks will make our comments obsolete quicker.
That's what I'm looking forward about AVX-512. Seeing how Intel implements it in their next products and see what improvements they make.

And if that chart is correct, a larger subset available on more mainstream CPUs (not just top-of-the-line Extreme Edition CPUs or Xeons) could make it worthwhile for devs and programmers of all kinds of work to use it.
Posted on Reply
#41
efikkan
dragontamer5788My perspective is that these microarchitectural issues (ie: downclocking or whatnot) will absolutely change by the next major "tick-tock" architecture from Intel. Intel's first implementation of any SIMD has always been crappy.
<snip>
Case in point: the CNS AVX512 chip (yeah, Via-chips. Surprise!!) can support AVX512 at full clock speeds. It does this by implementing all AVX512 instructions as 256-bit instructions executed over 2x clock ticks. No downclocking involved at all. Maybe this 2x256-bit methodology will be superior in the future, and Intel will copy it. Or maybe Intel figures out the 512-bit power issues and removes the need of downclocking.
Intel's power issues is probably related to the node. The AVX-512 units are pretty large, and needs to be in sync. I assume at 10nm and 7nm the voltage needed will be less, and the power much more under control.

Via's decision to do it over two cycles have probably to do with saving die space. Zen(1) did something similar with AVX2.
windwhirlThat's what I'm looking forward about AVX-512. Seeing how Intel implements it in their next products and see what improvements they make.

And if that chart is correct, a larger subset available on more mainstream CPUs (not just top-of-the-line Extreme Edition CPUs or Xeons) could make it worthwhile for devs and programmers of all kinds of work to use it.
While those charts might look a bit intimidating, most of the common features are covered by the F and CD sets, and these also require the most die space.
BTW; you can see the massive list of instructions in the F set here.
Posted on Reply
#42
R-T-B
efikkanThe CPUs are superscalar, so the technically it can execute both integer instructions and vector instructions at the same time, and it often does. E.g. you have a loop with dense math, the math is AVX, but the loop is not. But it's not a problem, as the alternative would be to do much more code, so even if a few instructions technically runs slower, the overall workload is still a lot faster.
Ah yes, you are correct, even if the conclusion is technically the same.
Posted on Reply
Add your own comment
Apr 27th, 2024 18:34 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts