Sunday, July 12th 2020

Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks

"I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on." These were the words of Linux and Git creator Linus Torvalds in a mailing list, expressing his displeasure over "Alder Lake" lacking AVX-512. Torvalds also cautioned against placing too much weightage on floating-point performance benchmarks, particularly those that take advantage of exotic new instruction sets that have a fragmented and varied implementation across product lines.

"I've said this before, and I'll say it again: in the heyday of x86, when Intel was laughing all the way to the bank and killing all their competition, absolutely everybody else did better than Intel on FP loads. Intel's FP performance sucked (relatively speaking), and it matter not one iota. Because absolutely nobody cares outside of benchmarks." Torvalds believes AVX2 is "more than enough" thanks to its proliferation, but advocated that processor manufacturers design better FPUs for their core designs so they don't have to rely on instruction set-level optimization to eke out performance.
"Yes, yes, I'm biased. I absolutely detest FP benchmarks, and I realize other people care deeply. I just think AVX512 is exactly the wrong thing to do. It's a pet peeve of mine. It's a prime example of something Intel has done wrong, partly by just increasing the fragmentation of the market. Stop with the special-case garbage, and make all the core common stuff that everybody cares about run as well as you humanly can. Then do a FPU that is barely good enough on the side, and people will be happy. AVX2 is much more than enough," he added. Torvalds recently upgraded to an AMD Ryzen Threadripper for his main work machine.
Source: Phoronix
Add your own comment

42 Comments on Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks

#2
windwhirl
Kudos to Intel for the massive effort of designing instruction sets just to win benchmarks, though :laugh:

Truth be told, though, I kinda agree with Torvalds? I mean, AVX has a history of generating more heat, introducing a performance penalty (triggered by either using one single instruction or by using more than a certain number, depending on which specific instruction is used) in mixed workloads, and on top of that, AVX-512 has a multitude of instructions that are not necessarily all available together, if you want them, probably due to Intel's habit of aggressively cutting off features for market segmentation.
Posted on Reply
#3
Minus Infinity
When we were doing electromagnetic simulations back in the day, Intel were junk, DEC Alpha was the only game in town and killed Intel. The Itanic was a total failure and it really sucked when HP eventually took over DEC. Hoipefully AMD’s reported 50% uplift in FPU performance doesn’t just come from adding AVX-512 or such things. Also he should call out Nvidia for their crap FP64 performance.
Posted on Reply
#4
InVasMani
Would rather see Intel bring a return to 16KB L1 and just label it L0 while retaining the other L caches, sizes, and structure nature.
Posted on Reply
#5
R-T-B
The issue here isn't so much that the instruction set doesn't work, it's that intel's fragmented it to the point it will never be used.

That's the core of his rant. He also is upset that they don't just make a better FPU.
Posted on Reply
#6
InVasMani
Minus InfinityAlso he should call out Nvidia for their crap FP64 performance.
Apparently AMD's newest Raedeon Pro's that infinity fabric bridge are quite the FP64 beasts if there diagrams benchmarks are to be trusted enough and not cherry picked scenario's. That said FP64 isn't that useful from what I hear for actual gaming tasks though it's wonderful for compute and pretty sure there is more money to be had at compute between the two.
Posted on Reply
#7
Midland Dog
Minus InfinityWhen we were doing electromagnetic simulations back in the day, Intel were junk, DEC Alpha was the only game in town and killed Intel. The Itanic was a total failure and it really sucked when HP eventually took over DEC. Hoipefully AMD’s reported 50% uplift in FPU performance doesn’t just come from adding AVX-512 or such things. Also he should call out Nvidia for their crap FP64 performance.
except GA100 has more FP64 potential than amd does FP32 potential
Posted on Reply
#8
steen
Midland Dogexcept GA100 has more FP64 potential than amd does FP32 potential
except not everything is amenable to matrices
Posted on Reply
#9
efikkan
Mr Torvalds is certainly entitled to his own opinions, but that doesn't make every single of them gold.
The background for this topic is the addition of Alder Lake support in GCC which lacked AVX-512. It remains to be seen if this means the core itself lacks the feature, or if parts of it does. I assume following all this noise Intel will make some sort of statement.

I'm also disappointed with the adoption rate of AVX-512, but that doesn't make it a gimmick. It holds incredible performance potential and increased flexibility over AVX2. But what annoys me much more is Intel's complete lack of support of any AVX in their Pentium/Celeron processors, which is unnecessary fragmentation and holds back mainstream software from embracing modern features.
Minus InfinityAlso he should call out Nvidia for their crap FP64 performance.
Why do you need FP64 on GPUs? Please elabrorate.
Posted on Reply
#10
tygrus
It shouldn't be this complicated and problematic.

It used to be the choice of instruction had isolated impact and predictable results. They neither slowed down any other code around it nor impacted code running on other cores. These were almost free to use and a benefit when used correctly.

The problem is when mixing code and mixing running tasks, AVX512 et. al. reduce the clockspeed to impact the integer code running in the same thread AND ALL OTHER running threads on the same processor. It slows down all integer & non-AVX FP code running in ALL cores. Compilers cannot know during compiling, what the potential performance impacts will be for users at runtime. The OS cannot know the potential performance impacts that occur at runtime when scheduling a mixture of threads. Fairness and predictable performance goes out the window. The best choice for fairness and predictable performance is to IGNORE occasional use of AVX. It may be nice for a computers/servers dedicated to a single task that benefits from these instructions but the typical general user is hurt more then helped by them. Cloud and VM users are hurt by them. Arbitrary and occasional use of them impact all running code so the OS should avoid using them.

It would be OK if the processor could maintain clock speed while using exotic instructions. They would have to be engineered to increase the stages/cycles required to complete the more complex work but maintain clockspeed at all costs. I would much rather have more FP units that are simpler for greater throughput and flexibility. Good if you can pipeline the Multiply into the Add and get the result slightly later than AVX512, but doesn't slowdown the rest of the code. Just because you can use an AVX__ instruction, doesn't mean you should.

CPU's with AVX support a mixture of yes and no. The clockspeed impact also varies according to the CPU model and many other variables.
I agree with Linus, it shouldn't be this complicated and problematic.
Posted on Reply
#11
Cheeseball
Not a Potato
steenexcept not everything is amenable to matrices
But for AI and machine learning this is advantageous
Minus InfinityWhen we were doing electromagnetic simulations back in the day, Intel were junk, DEC Alpha was the only game in town and killed Intel. The Itanic was a total failure and it really sucked when HP eventually took over DEC. Hoipefully AMD’s reported 50% uplift in FPU performance doesn’t just come from adding AVX-512 or such things. Also he should call out Nvidia for their crap FP64 performance.
Quadros can handle FP64 fine. Whats lacking is FP16
Posted on Reply
#12
trparky
I'd have to agree with @tygrus here, running AVX code if you don't have good cooling (like a majority of OEM pre-builds) is going to result in lower clock speed due to Intel own AVX-offset. I know that there are those of us who have tweaked our motherboard UEFI's to force the processor to run at the same speed even while using AVX code by setting to AVX-offset to 0 but that's not possible on OEM stripped-down UEFI's. And even then, for those of us who have removed the limitation (because we can) you better have a damn good cooler.
Posted on Reply
#13
efikkan
tygrusThe problem is when mixing code and mixing running tasks, AVX512 et. al. reduce the clockspeed to impact the integer code running in the same thread AND ALL OTHER running threads on the same processor. It slows down all integer & non-AVX FP code running in ALL cores.
No it does not. Unless the CPU reaches a thermal or power limit, it will not throttle the whole CPU, it does not slow down all cores. Loads of applications use AVX to some extent in the background, including compression, web browsers and pretty much anything which deals with video.
tygrusCompilers cannot know during compiling, what the potential performance impacts will be for users at runtime. The OS cannot know the potential performance impacts that occur at runtime when scheduling a mixture of threads. Fairness and predictable performance goes out the window. The best choice for fairness and predictable performance is to IGNORE occasional use of AVX.
I'm going to give you a chance to rephrase that, since it makes no sense.
AVX code is if anything much more predictable, since the throughput is more consistent, cache lines are more effectively used and there is less branching.
tygrusIt would be OK if the processor could maintain clock speed while using exotic instructions. They would have to be engineered to increase the stages/cycles required to complete the more complex work but maintain clockspeed at all costs. I would much rather have more FP units that are simpler for greater throughput and flexibility. Good if you can pipeline the Multiply into the Add and get the result slightly later than AVX512, but doesn't slowdown the rest of the code. Just because you can use an AVX__ instruction, doesn't mean you should.
Firstly, both single FP operations, SSE and AVX are fed into the same vector units, the only difference is how filled the vector registers are. Intel have two full FMA-sets of AVX-512, to compete with that with single FPUs in FP32 throughput you would need 32 of them, you would also need the circuitry to handle these writing back to the same cache lines without adding pipeline steps. Then the instructions would be at least 16x larger, meaning you would have to increase the instruction cache >10x and probably L2 a bit as well, then the instruction window would have to increase ~10x, and the prefetcher, branch predictor etc. needs to work much more efficiently. And even if you manage all this, you better pray that compiler have unrolled all loops aggressively, because otherwise there is no way you are going to feed your 32 hungry FPUs. :rolleyes:
If you have a rough understanding of how CPUs works, you have probably understood by now that your suggestion was short-sighted.
Posted on Reply
#14
TurboFEM
I see a lot of questions on why does one even need FP performance.

Probably many things, but one I know of quite well is - engineering simulations.

Thousands and thousands of engineers are relying on Xeons every day to run their finite element- and finite difference type analyses (mechanical FE, CFD, electromagnetics etc.).
For FE, specifically, you spec a machine like this -> As many AVX2/512 cores you can get away with and nCores * ~8GB ECC RAM. Turn off hyperthreading and go have fun.

It's a big market for Intel, and increasingly nVidia (new codes start to introduce GPU FP64 slowly, but typically require CUDA, so no luck for AMD).
Posted on Reply
#15
efikkan
@btarunr
As I alluded to in #10, there would probably be some kind of response.
Videocardz (if we can trust them), have some clarifications: link
So it may appear that the big cores offers more ISA features.
Posted on Reply
#16
ThrashZone
Hi,
Can't say I've ever run into avx-512 so far ?
Set it to 5 and clocks have never dropped that far.
Posted on Reply
#17
John Naylor
Ya mean ... just like "more cores" ?. While more cores can be useful, having more than you actually need for your applications doesn't do anything for you.
Posted on Reply
#18
windwhirl
ThrashZoneHi,
Can't say I've ever run into avx-512 so far ?
Set it to 5 and clocks have never dropped that far.
It's a relatively recent instruction set. It's rare to see it in use outside of scientific applications or others that get a real benefit out of using it.

Besides, AVX-512 is found only in high-end desktop processors (Core i7 or i9) or Xeons, and for whatever reason, on some specific mobile chips.

On top of that, while there is a subset that is sort of available on every Intel CPU that "supports" AVX-512, there are some instructions that are only found on specific CPUs. Tiger Lake has not even launched yet, if I remember correctly.

efikkanNo it does not. Unless the CPU reaches a thermal or power limit, it will not throttle the whole CPU, it does not slow down all cores. Loads of applications use AVX to some extent in the background, including compression, web browsers and pretty much anything which deals with video.
AVX impact is relative, apparently, according to this
stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency/56861355#56861355

TLDR, it seems to affect only Turbo frequencies, in the first place, and how much it will downclock will depend on the type and number of instructions executed. AVX512 does trigger this throttling a bit more, while AVX and AVX2 do it less or don't even do so at all.
Posted on Reply
#19
ThrashZone
Hi,
Yep my prior x299/ 7900x had it and so does my current 9940x
z490/ 10900k does not nor does x99/ 5930k.
Posted on Reply
#20
efikkan
windwhirlBesides, AVX-512 is found only in high-end desktop processors (Core i7 or i9) or Xeons, and for whatever reason, on some specific mobile chips.
If Ice Lake-S/-H hadn't been cancelled, the whole lineup* would have offered AVX-512 already, so this strangeness is not intentional segmentation.
Once client applications starts to utilize it, it will offer significant performance and efficiency gains, even for low-power laptops.

*) Except Atom, Pentium and Celeron of course.
Posted on Reply
#21
quadibloc
AMD's Ryzen processors lagged behind Intel's chips significantly in earlier generations when they only supported 128-bit SIMD while Intel already had AVX-256. So when 10nm on the desktop finally gives Intel the thermals to put AVX-512 on the desktop, I've been expecting that Intel will take over the lead from AMD (although, as with the earlier generations of Ryzen, it won't be that far behind) once again. So I was pretty shocked to hear Linus' comments!
After all, faster integer performance will... let your computer send E-mail faster? Gaming uses floating-point too, so improving the power of chips for HPC applications will make them more powerful for everyone.
But maybe Linus Torvalds is at least partly right. Maybe it's time to split the processor line-up, to offer a choice between chips that have high floating-point performance, and other chips that tilt more towards integer performance, so that one can buy a processor appropriate to one's workload.
Posted on Reply
#22
mtcn77
efikkan*) Except Atom, Pentium and Celeron of course.
Nobody will need more than 8000 MIPS.
Posted on Reply
#23
dragontamer5788
R-T-BThat's the core of his rant. He also is upset that they don't just make a better FPU.
Intel Skylake (non-X) 256-bit AVX already supports 3x 2x 256-bit multiply-and-adds, 2x 256-bit loads from L1 cache and 1x 256-bit store to L1 cache... per clock tick with like 5-cycle latency.

Outside of going to 512-bits, how exactly do you expect Intel to improve upon that? AVX512 simply change that to 3x 2x 512-bit multiply-and-adds, 2x 512-bit loads and 1x-512 bit stores. Its the most obvious way to improve the SIMD / FPU unit.

EDIT: Apparently 2x multiply-and-adds supported per clock on Skylake, according to software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3508,3922,2581&techs=FMA&text=fmadd. Still, that's 16 flops per cycle. Hard to imagine how to make this 2x better aside from the "obvious" extend to 512 bits.

------

SIMD FPU-multiply is higher performance than 64-bit integer-multiply, lol. (to be fair: SIMD FPU-multiply is easier at only 53-bits (Double precision), but still...)
steenexcept not everything is amenable to matrices
But virtually everything has a "memset(blah, 0, ...)" somewhere. And this memset code is almost always compiled into SIMD in my experience (be it 128-bit SSE, 256-bit AVX, or 512-bit AVX512 code)

GCC and Clang have surprisingly good auto-vectorizers that can change many simple for-loops into SIMD accelerated versions. AVX512 has literally double the performance with memset, memcmp, memcpy, strcmp, strcpy, etc. etc compared to 256-bit AVX2. (Note: AVX does NOT support integer operations. You need AVX2 in your compile flags, as well as an AVX2 CPU).

The 512-bit thick data-path extends all the way to L2 cache... meaning memcmp / memcpy / etc. etc. bonus applies to a huge amount of C code automatically.
Posted on Reply
#24
efikkan
dragontamer5788Outside of going to 512-bits, how exactly do you expect Intel to improve upon that? AVX512 simply change that to 3x 2x 512-bit multiply-and-adds, 2x 512-bit loads and 1x-512 bit stores. Its the most obvious way to improve the SIMD / FPU unit.
There is also the option of adding more execution ports and vector units.
This does however require the front-end to be able to decode and issue micro-ops faster, having a larger instruction window, etc., and even then run the risk of underutilization. I do expect that we will eventually move to 3 or even 4 FMA sets in desktop CPUs, but but the architectures will need to evolve a lot to facilitate that.

One interesting bit is the rumor about Zen 3 offering 50% higher FPU performance. If true, I do wonder if they added more units, or if they improved them somehow.
dragontamer5788GCC and Clang have surprisingly good auto-vectorizers that can change many simple for-loops into SIMD accelerated versions…
They do, and software can get a good portion of free performance simply by enabling these instructions.
But still, the huge performance gains still requires tailored code using intrinsics, which is unfortunately a bit too difficult for most programmers. But I do hope we get to a point where the compilers are able to convert a bit more complex calculations into pretty optimal AVX, provided you have cache optimized etc.

One of the interesting things about AVX is the vast feature set which extends far beyond just arithmetics. It also support things like comparisons with masks, which essentially enables you to do conditionals without branching logic, and the feature set of AVX-512 is almost like a new instruction set. The potential here is huge, but it's still "inaccessible" to most programmers. If we get to a point where writing clean C code can be compiled into decent AVX instructions, even with more complex calculations and some basic conditionals, that would be huge for the adoption of AVX.
dragontamer5788The 512-bit thick data-path extends all the way to L2 cache... meaning memcmp / memcpy / etc. etc. bonus applies to a huge amount of C code automatically.
One thing that comes to mind is the 512-bit vector size fits very well with the cache line size.
Posted on Reply
#25
dragontamer5788
efikkanOne of the interesting things about AVX is the vast feature set which extends far beyond just arithmetics. It also support things like comparisons with masks, which essentially enables you to do conditionals without branching logic, and the feature set of AVX-512 is almost like a new instruction set. The potential here is huge, but it's still "inaccessible" to most programmers. If we get to a point where writing clean C code can be compiled into decent AVX instructions, even with more complex calculations and some basic conditionals, that would be huge for the adoption of AVX.
That's actually what makes me most excited about AVX512. All of these new AVX512 features allow auto-vectorization to happen far more easily. The details are complicated, but... lets just say that NVidia CUDA and AMD OpenCL has been doing this stuff for over a decade on GPUs. Intel finally is providing CPU-compilers the ability what GPU-compilers have been doing all along. It requires some additional support from the CPU instruction set to ease auto-vectorization and provide more SIMD-based branching controls. But once provided, the theory is already well studied from 1980s SIMD computers and is well known.

Honestly, Linus Torvalds is very clearly out of his depth in this subject matter. I'm no expert, but I can confidently say that I know more than Linus on this subject based on what he's saying here.

AVX and AVX2 are over a decade behind GPU-SIMD computers. AVX512 finally brings parity to CPU-autovectorizers to what GPUs have been doing since 2006. AVX512 is actually a really well designed instruction set... but Intel is certainly messing up the business side of things IMO.
Posted on Reply
Add your own comment
May 12th, 2024 05:51 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts