• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Linus Torvalds Finds AVX-512 an Intel Gimmick to Invent and Win at Benchmarks

Joined
Apr 24, 2020
Messages
2,568 (1.74/day)
One of the interesting things about AVX is the vast feature set which extends far beyond just arithmetics. It also support things like comparisons with masks, which essentially enables you to do conditionals without branching logic, and the feature set of AVX-512 is almost like a new instruction set. The potential here is huge, but it's still "inaccessible" to most programmers. If we get to a point where writing clean C code can be compiled into decent AVX instructions, even with more complex calculations and some basic conditionals, that would be huge for the adoption of AVX.

That's actually what makes me most excited about AVX512. All of these new AVX512 features allow auto-vectorization to happen far more easily. The details are complicated, but... lets just say that NVidia CUDA and AMD OpenCL has been doing this stuff for over a decade on GPUs. Intel finally is providing CPU-compilers the ability what GPU-compilers have been doing all along. It requires some additional support from the CPU instruction set to ease auto-vectorization and provide more SIMD-based branching controls. But once provided, the theory is already well studied from 1980s SIMD computers and is well known.

Honestly, Linus Torvalds is very clearly out of his depth in this subject matter. I'm no expert, but I can confidently say that I know more than Linus on this subject based on what he's saying here.

AVX and AVX2 are over a decade behind GPU-SIMD computers. AVX512 finally brings parity to CPU-autovectorizers to what GPUs have been doing since 2006. AVX512 is actually a really well designed instruction set... but Intel is certainly messing up the business side of things IMO.
 
Last edited:
Joined
Jun 10, 2014
Messages
2,904 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
That's actually what makes me most excited about AVX512. All of these new AVX512 features allow auto-vectorization to happen far more easily. The details are complicated, but... lets just say that NVidia CUDA and AMD OpenCL has been doing this stuff for over a decade on GPUs. Intel finally is providing CPU-compilers the ability what GPU-compilers have been doing all along. It requires some additional support from the CPU instruction set to ease auto-vectorization and provide more SIMD-based branching controls. But once provided, the theory is already well studied from 1980s SIMD computers and is well known.
Yes, and the interesting thing is that this would solve most of the scaling problems with code, which as you probably know is branching and cache misses. Most branching inside algorithms doesn't actually affect the bigger control flow of the code, put just 3-4 of these and you pretty much guaranteed one or more stalls. I often call these "false branching", and sometimes do clever things to try to eliminate them, like bitwise operations, conditional moves etc. But AVX can resolve a lot of this, it really comes down to being able to write clean readable code which translates into optimal AVX instructions. I still find it a daunting task to write anything but smaller pieces using intrinsics though.

Honestly, Linus Torvalds is very clearly out of his depth in this subject matter. I'm no expert, but I can confidently say that I know more than Linus on this subject based on what he's saying here.
I have tremendous respect for Mr Torvalds and am a big fan of his two software creations, and I know he is a very smart man. But this doesn't make every outburst from him gold, and most of what he said here is not accurate.

The only part I could agree about is some of the more application specific instructions (like "AI" stuff). I believe a standard ISA should be generic compute and logic, not application specific. So in my opinion, throw out all the AES, zip, jpeg(!) etc. acceleration instructions, and give us four 512-bit FMA-sets instead.
 
Last edited:
Joined
Apr 24, 2020
Messages
2,568 (1.74/day)
I often call these "false branching", and sometimes do clever things to try to eliminate them, like bitwise operations, conditional moves etc

My favorite is "max", "min", and similar operations.

Consider your typical "comparison" for a sorting problem. You'd think you need an "if" statement, but in reality... you can make due with:

Code:
higher = max(a, b);
lower = min(a, b);

The max/min version of the code is branchless at the lowest level, thanks to instructions like vpmaxud. And all of a sudden, your for-loop starts to look far more auto-vectorizable and branchless.
 

Kanan

Tech Enthusiast & Gamer
Joined
Aug 22, 2015
Messages
3,517 (1.10/day)
Location
Europe
System Name eazen corp | Xentronon 7.2
Processor AMD Ryzen 7 3700X // PBO max.
Motherboard Asus TUF Gaming X570-Plus
Cooling Noctua NH-D14 SE2011 w/ AM4 kit // 3x Corsair AF140L case fans (2 in, 1 out)
Memory G.Skill Trident Z RGB 2x16 GB DDR4 3600 @ 3800, CL16-19-19-39-58-1T, 1.4 V
Video Card(s) Asus ROG Strix GeForce RTX 2080 Ti modded to MATRIX // 2000-2100 MHz Core / 1938 MHz G6
Storage Silicon Power P34A80 1TB NVME/Samsung SSD 830 128GB&850 Evo 500GB&F3 1TB 7200RPM/Seagate 2TB 5900RPM
Display(s) Samsung 27" Curved FS2 HDR QLED 1440p/144Hz&27" iiyama TN LED 1080p/120Hz / Samsung 40" IPS 1080p TV
Case Corsair Carbide 600C
Audio Device(s) HyperX Cloud Orbit S / Creative SB X AE-5 @ Logitech Z906 / Sony HD AVR @PC & TV @ Teufel Theater 80
Power Supply EVGA 650 GQ
Mouse Logitech G700 @ Steelseries DeX // Xbox 360 Wireless Controller
Keyboard Corsair K70 LUX RGB /w Cherry MX Brown switches
VR HMD Still nope
Software Win 10 Pro
Benchmark Scores 15 095 Time Spy | P29 079 Firestrike | P35 628 3DM11 | X67 508 3DM Vantage Extreme
Honestly, Linus Torvalds is very clearly out of his depth in this subject matter. I'm no expert, but I can confidently say that I know more than Linus on this subject based on what he's saying here.
I'm pretty sure it was one of his usual rants, he does that sometimes. I too agree that AVX512 is definitely far from being useless, BUT, the availability as well as in the feature set per se, is far too fragmented, the point of Linus still holds, that Intel made a mess out of it.
 
Joined
Jun 10, 2014
Messages
2,904 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
My favorite is "max", "min", and similar operations.

Consider your typical "comparison" for a sorting problem. You'd think you need an "if" statement, but in reality... you can make due with:

Code:
higher = max(a, b);
lower = min(a, b);

The max/min version of the code is branchless at the lowest level, thanks to instructions like vpmaxud. And all of a sudden, your for-loop starts to look far more auto-vectorizable and branchless.
Yeah, that's the kind of stuff I've been doing, like mostly creating simple inline functions with vector and matrix maths, but not whole algorithms yet. But SIMD is very suited for algorithms designed in a data oriented approach, I imagine for things like line intersections, collisions, etc. I'm sure some software architects' heads will explode though :D
 
Joined
Jun 3, 2010
Messages
2,540 (0.50/day)
But for AI and machine learning this is advantageous



Quadros can handle FP64 fine. Whats lacking is FP16
What about tensors? I think vectors count as rank 1 tensors, so we should be able to compare the two.
 
Joined
Apr 24, 2020
Messages
2,568 (1.74/day)
I'm pretty sure it was one of his usual rants, he does that sometimes. I too agree that AVX512 is definitely far from being useless, BUT, the availability as well as in the feature set per se, is far too fragmented, the point of Linus still holds, that Intel made a mess out of it.

Yeah, Linus definitely has a habit of ranting online and leaving his field of expertise. And to be fair: so do I. We're only human after all. It just means that you gotta be on guard and always critically read what Linus is saying. He's clearly a smart guy (probably smarter than me in most aspects of programming). But don't ever grow complacent.

AVX512's main issues are business related. Its locked out of mainstream Skylake chips (typical i7s), so its not really a common compilation target. It was originally Knights-landing feature (aka: Xeon Phi), which is a dead-end.

Yeah, that's the kind of stuff I've been doing, like mostly creating simple inline functions with vector and matrix maths, but not whole algorithms yet. But SIMD is very suited for algorithms designed in a data oriented approach, I imagine for things like line intersections, collisions, etc. I'm sure some software architects' heads will explode though :D

I suggest reading through this dissertation by the way: https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf

Blelloch's dissertation from 1990 would seem out-of-date at first glance. But in reality, modern SIMD machines (both AVX512 and GPUs) are heavily based on the CM5 machine he used as the basis of his dissertation. As such, his dissertation reads amazingly close to modern machines.

Dr. Blelloch's more recent papers map more closely to modern machines: https://www.cs.cmu.edu/~guyb/

Just some food for thought. I wouldn't try to do the "flattened nested parallelism" from the top-down in every algorithm. Its unlikely to be fast on all modern architectures. But what's interesting is that Dr. Blelloch has proven an equivalence between recursive definitions and the prefix scan-operations. As such, we have a "universal gadget" to try to convert recursive forms of algorithms into prefix-sum, prefix-max, and similar operations.

Not that the gadget is always efficient on a modern SIMD machine. Its absolutely not... but maybe restating the problem in a prefix-sum style provides insight and gives you ideas for a more efficient algorithm.

---------

You don't have to go very far to be amazed. In as early as Chapter 1, Dr. Blelloch converts recursive quicksort (yes, quicksort) into prefix sum operations.
 
Last edited:
Joined
Jun 10, 2014
Messages
2,904 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
AVX512's main issues are business related. Its locked out of mainstream Skylake chips (typical i7s), so its not really a common compilation target. It was originally Knights-landing feature (aka: Xeon Phi), which is a dead-end.
It's important to remember that Intel's intention was to release Skylake-SP/X and Ice Lake (client) pretty close together. Coffee Lake(s) and Comet Lake were emergency backup plans. So if anything, their business failure is in failing to have a backported Sunny Cove etc. just in case 10nm failed. This AVX-512 inconsistency was never their intention, but still ultimately their "fault".

I suggest reading through this dissertation by the way: https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf
…
Thanks.
Some good corona-times reading :)
 
Joined
Mar 6, 2017
Messages
3,212 (1.22/day)
Location
North East Ohio, USA
System Name My Ryzen 7 7700X Super Computer
Processor AMD Ryzen 7 7700X
Motherboard Gigabyte B650 Aorus Elite AX
Cooling DeepCool AK620 with Arctic Silver 5
Memory 2x16GB G.Skill Trident Z5 NEO DDR5 EXPO (CL30)
Video Card(s) XFX AMD Radeon RX 7900 GRE
Storage Samsung 980 EVO 1 TB NVMe SSD (System Drive), Samsung 970 EVO 500 GB NVMe SSD (Game Drive)
Display(s) Acer Nitro XV272U (DisplayPort) and Acer Nitro XV270U (DisplayPort)
Case Lian Li LANCOOL II MESH C
Audio Device(s) On-Board Sound / Sony WH-XB910N Bluetooth Headphones
Power Supply MSI A850GF
Mouse Logitech M705
Keyboard Steelseries
Software Windows 11 Pro 64-bit
Benchmark Scores https://valid.x86.fr/liwjs3
No it does not. Unless the CPU reaches a thermal or power limit, it will not throttle the whole CPU, it does not slow down all cores. Loads of applications use AVX to some extent in the background, including compression, web browsers and pretty much anything which deals with video.
Then tell me why there is an AVX Offset in UEFI?

If I understand the concept of the AVX Offset correctly, it's a setting that if you set it at 5 the processor will down-clock from the highest speed by that setting. In the case of a setting of 5 the processor will down-clock by 500 MHz when executing AVX instructions.
 
Joined
Dec 16, 2017
Messages
2,731 (1.17/day)
Location
Buenos Aires, Argentina
System Name System V
Processor AMD Ryzen 5 3600
Motherboard Asus Prime X570-P
Cooling Cooler Master Hyper 212 // a bunch of 120 mm Xigmatek 1500 RPM fans (2 ins, 3 outs)
Memory 2x8GB Ballistix Sport LT 3200 MHz (BLS8G4D32AESCK.M8FE) (CL16-18-18-36)
Video Card(s) Gigabyte AORUS Radeon RX 580 8 GB
Storage SHFS37A240G / DT01ACA200 / WD20EZRX / MKNSSDTR256GB-3DL / LG BH16NS40 / ST10000VN0008
Display(s) LG 22MP55 IPS Display
Case NZXT Source 210
Audio Device(s) Logitech G430 Headset
Power Supply Corsair CX650M
Mouse Microsoft Trackball Optical 1.0
Keyboard HP Vectra VE keyboard (Part # D4950-63004)
Software Whatever build of Windows 11 is being served in Dev channel at the time.
Benchmark Scores Corona 1.3: 3120620 r/s Cinebench R20: 3355 FireStrike: 12490 TimeSpy: 4624
Then tell me why there is an AVX Offset in UEFI?

If I understand the concept of the AVX Offset correctly, it's a setting that if you set it at 5 the processor will down-clock from the highest speed by that setting. In the case of a setting of 5 the processor will down-clock by 500 MHz when executing AVX instructions.

Don't know about that AVX Offset thing in UEFI (I don't do overclocks, after all), but you may be refering to this:


It's documented behavior that Intel processors have different frequency sets according to whatever is running on it.

TLDR, it seems to affect only Turbo frequencies, in the first place, and how much it will downclock will depend on the type and number of instructions executed. AVX512 does trigger this throttling a bit more, while AVX and AVX2 do it less or don't even do so at all.
 
Joined
Aug 20, 2007
Messages
20,817 (3.41/day)
System Name Pioneer
Processor Ryzen R9 7950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage 2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11
Also, due to how hyperthreading only lets two threads run on a core tops, you'll never "slow down" an integer thread on the same core as an AVS instruction very often. Most of the time, it will rapidly downclock for AVX, execute that instruction with reduced clocks (and still better performance than if it hadn't), and then switch back and do whatever integer thing it was doing at full speed. No penalty. The only situation there would be a penalty would be if it literally executed some kind of AVX and had TIME LEFT OVER (unlikely) to then execute an integer instruction, which would be forced to execute at the lower clock. This is exceedingly rare in practice, I'd picture.
 
Joined
Jun 10, 2014
Messages
2,904 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
Then tell me why there is an AVX Offset in UEFI?

If I understand the concept of the AVX Offset correctly, it's a setting that if you set it at 5 the processor will down-clock from the highest speed by that setting. In the case of a setting of 5 the processor will down-clock by 500 MHz when executing AVX instructions.
The claim was that any AVX code would impact any other code running on the CPU, and that's simply not the case. A single core can throttle with a lot of AVX, but the CPU runs AVX all the time without any problem.

The purpose of the AVX offset is for overclockers to push non-AVX workloads to a higher clock speed.

Also, due to how hyperthreading only lets two threads run on a core tops, you'll never "slow down" an integer thread on the same core as an AVS instruction very often. Most of the time, it will rapidly downclock for AVX, execute that instruction with reduced clocks (and still better performance than if it hadn't), and then switch back and do whatever integer thing it was doing at full speed. No penalty. The only situation there would be a penalty would be if it literally executed some kind of AVX and had TIME LEFT OVER (unlikely) to then execute an integer instruction, which would be forced to execute at the lower clock. This is exceedingly rare in practice, I'd picture.
The CPUs are superscalar, so the technically it can execute both integer instructions and vector instructions at the same time, and it often does. E.g. you have a loop with dense math, the math is AVX, but the loop is not. But it's not a problem, as the alternative would be to do much more code, so even if a few instructions technically runs slower, the overall workload is still a lot faster.

Running the same calculations as AVX greatly reduces the instruction count and the clock cycles needed. It also makes it unroll even more the loops, which again reduces the loop code and branching associated with it. And denser code also helps both data caches, instruction caches, data dependencies and branch prediction, as the logic is more dense.
 

Kanan

Tech Enthusiast & Gamer
Joined
Aug 22, 2015
Messages
3,517 (1.10/day)
Location
Europe
System Name eazen corp | Xentronon 7.2
Processor AMD Ryzen 7 3700X // PBO max.
Motherboard Asus TUF Gaming X570-Plus
Cooling Noctua NH-D14 SE2011 w/ AM4 kit // 3x Corsair AF140L case fans (2 in, 1 out)
Memory G.Skill Trident Z RGB 2x16 GB DDR4 3600 @ 3800, CL16-19-19-39-58-1T, 1.4 V
Video Card(s) Asus ROG Strix GeForce RTX 2080 Ti modded to MATRIX // 2000-2100 MHz Core / 1938 MHz G6
Storage Silicon Power P34A80 1TB NVME/Samsung SSD 830 128GB&850 Evo 500GB&F3 1TB 7200RPM/Seagate 2TB 5900RPM
Display(s) Samsung 27" Curved FS2 HDR QLED 1440p/144Hz&27" iiyama TN LED 1080p/120Hz / Samsung 40" IPS 1080p TV
Case Corsair Carbide 600C
Audio Device(s) HyperX Cloud Orbit S / Creative SB X AE-5 @ Logitech Z906 / Sony HD AVR @PC & TV @ Teufel Theater 80
Power Supply EVGA 650 GQ
Mouse Logitech G700 @ Steelseries DeX // Xbox 360 Wireless Controller
Keyboard Corsair K70 LUX RGB /w Cherry MX Brown switches
VR HMD Still nope
Software Win 10 Pro
Benchmark Scores 15 095 Time Spy | P29 079 Firestrike | P35 628 3DM11 | X67 508 3DM Vantage Extreme
Yeah, Linus definitely has a habit of ranting online and leaving his field of expertise. And to be fair: so do I. We're only human after all. It just means that you gotta be on guard and always critically read what Linus is saying. He's clearly a smart guy (probably smarter than me in most aspects of programming). But don't ever grow complacent
Linus Torvalds is well appreciated by me anyway. I respect people who publicly are bold, direct and honest, it is a rare trait. The most famous was his moment where he struck the middle finger to Nvidia in a conference, which was well deserved. Big companies must always be tested and questioned, they should not have a free pass or they will always abuse it in the name of capitalism and their share holders.
 
Joined
Mar 6, 2017
Messages
3,212 (1.22/day)
Location
North East Ohio, USA
System Name My Ryzen 7 7700X Super Computer
Processor AMD Ryzen 7 7700X
Motherboard Gigabyte B650 Aorus Elite AX
Cooling DeepCool AK620 with Arctic Silver 5
Memory 2x16GB G.Skill Trident Z5 NEO DDR5 EXPO (CL30)
Video Card(s) XFX AMD Radeon RX 7900 GRE
Storage Samsung 980 EVO 1 TB NVMe SSD (System Drive), Samsung 970 EVO 500 GB NVMe SSD (Game Drive)
Display(s) Acer Nitro XV272U (DisplayPort) and Acer Nitro XV270U (DisplayPort)
Case Lian Li LANCOOL II MESH C
Audio Device(s) On-Board Sound / Sony WH-XB910N Bluetooth Headphones
Power Supply MSI A850GF
Mouse Logitech M705
Keyboard Steelseries
Software Windows 11 Pro 64-bit
Benchmark Scores https://valid.x86.fr/liwjs3
The claim was that any AVX code would impact any other code running on the CPU, and that's simply not the case. A single core can throttle with a lot of AVX, but the CPU runs AVX all the time without any problem.

The purpose of the AVX offset is for overclockers to push non-AVX workloads to a higher clock speed.
So, in other words, nothing to be alarmed about. It's there but it's not going to cause too many slowdowns unless your cooling setup is really that shitty.
 
Joined
Apr 24, 2020
Messages
2,568 (1.74/day)
Hmmm... I recall some very, very, very smart people discussing AVX512 downclocking / slowdown issues. I don't recall what they said about it however.

My perspective is that these microarchitectural issues (ie: downclocking or whatnot) will absolutely change by the next major "tick-tock" architecture from Intel. Intel's first implementation of any SIMD has always been crappy.

When AVX was first released, it was executed 128-bits at a time (Sandy Bridge). It was missing integer instructions: that's right, you could do 53-bit double-precision multiplies but you couldn't do 32-bit integer multiplies. All sorts of terrible. Eventually, Haswell + AVX2 came out and fixed the issues, finally making the AVX transition mostly worthwhile over SSE instructions. But all of the flamewars from the early 2010s about "is AVX worth it" look hopelessly outdated in today's environment.

I guess my point is... don't judge the AVX512 instruction set based on its current implementation (ie: Skylake-X). Skylake-X is clearly a "bad" implementation of AVX512. We should instead judge AVX512 based on its future viability. Focusing too much on Skylake-X's performance quirks will make our comments obsolete quicker.

-------------

Case in point: the CNS AVX512 chip (yeah, Via-chips. Surprise!!) can support AVX512 at full clock speeds. It does this by implementing all AVX512 instructions as 256-bit instructions executed over 2x clock ticks. No downclocking involved at all. Maybe this 2x256-bit methodology will be superior in the future, and Intel will copy it. Or maybe Intel figures out the 512-bit power issues and removes the need of downclocking.

Even as a 2x256-bit implementation, AVX512 has enough bonuses (auto-vectorization instructions, opcode masks, scatter instructions, extended register sets) that its worthwhile to use.
 
Last edited:
Joined
Dec 16, 2017
Messages
2,731 (1.17/day)
Location
Buenos Aires, Argentina
System Name System V
Processor AMD Ryzen 5 3600
Motherboard Asus Prime X570-P
Cooling Cooler Master Hyper 212 // a bunch of 120 mm Xigmatek 1500 RPM fans (2 ins, 3 outs)
Memory 2x8GB Ballistix Sport LT 3200 MHz (BLS8G4D32AESCK.M8FE) (CL16-18-18-36)
Video Card(s) Gigabyte AORUS Radeon RX 580 8 GB
Storage SHFS37A240G / DT01ACA200 / WD20EZRX / MKNSSDTR256GB-3DL / LG BH16NS40 / ST10000VN0008
Display(s) LG 22MP55 IPS Display
Case NZXT Source 210
Audio Device(s) Logitech G430 Headset
Power Supply Corsair CX650M
Mouse Microsoft Trackball Optical 1.0
Keyboard HP Vectra VE keyboard (Part # D4950-63004)
Software Whatever build of Windows 11 is being served in Dev channel at the time.
Benchmark Scores Corona 1.3: 3120620 r/s Cinebench R20: 3355 FireStrike: 12490 TimeSpy: 4624
I guess my point is... don't judge the AVX512 instruction set based on its current implementation (ie: Skylake-X). Skylake-X is clearly a "bad" implementation of AVX512. We should instead judge AVX512 based on its future viability. Focusing too much on Skylake-X's performance quirks will make our comments obsolete quicker.

That's what I'm looking forward about AVX-512. Seeing how Intel implements it in their next products and see what improvements they make.
20200713-001443.png

And if that chart is correct, a larger subset available on more mainstream CPUs (not just top-of-the-line Extreme Edition CPUs or Xeons) could make it worthwhile for devs and programmers of all kinds of work to use it.
 
Joined
Jun 10, 2014
Messages
2,904 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
My perspective is that these microarchitectural issues (ie: downclocking or whatnot) will absolutely change by the next major "tick-tock" architecture from Intel. Intel's first implementation of any SIMD has always been crappy.
<snip>
Case in point: the CNS AVX512 chip (yeah, Via-chips. Surprise!!) can support AVX512 at full clock speeds. It does this by implementing all AVX512 instructions as 256-bit instructions executed over 2x clock ticks. No downclocking involved at all. Maybe this 2x256-bit methodology will be superior in the future, and Intel will copy it. Or maybe Intel figures out the 512-bit power issues and removes the need of downclocking.
Intel's power issues is probably related to the node. The AVX-512 units are pretty large, and needs to be in sync. I assume at 10nm and 7nm the voltage needed will be less, and the power much more under control.

Via's decision to do it over two cycles have probably to do with saving die space. Zen(1) did something similar with AVX2.
That's what I'm looking forward about AVX-512. Seeing how Intel implements it in their next products and see what improvements they make.
View attachment 162210
And if that chart is correct, a larger subset available on more mainstream CPUs (not just top-of-the-line Extreme Edition CPUs or Xeons) could make it worthwhile for devs and programmers of all kinds of work to use it.
While those charts might look a bit intimidating, most of the common features are covered by the F and CD sets, and these also require the most die space.
BTW; you can see the massive list of instructions in the F set here.
 
Joined
Aug 20, 2007
Messages
20,817 (3.41/day)
System Name Pioneer
Processor Ryzen R9 7950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage 2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11
The CPUs are superscalar, so the technically it can execute both integer instructions and vector instructions at the same time, and it often does. E.g. you have a loop with dense math, the math is AVX, but the loop is not. But it's not a problem, as the alternative would be to do much more code, so even if a few instructions technically runs slower, the overall workload is still a lot faster.

Ah yes, you are correct, even if the conclusion is technically the same.
 
Top