• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD "Zen" Does Support FMA4, Just Not Exposed

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
46,343 (7.68/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
With its "Zen" CPU microarchitecture, AMD removed support for the FMA4 instruction-set, on paper. This, while retaining FMA3. Level1Techs discovered that "Zen" CPUs do support FMA4 instructions, even through the instruction-set is not exposed to the operating system. FMA, or fused multiply add, is an efficient way to compute linear algebra. FMA3 and FMA4 are not generations of the instruction-set (unlike SSE3 and SSE4), but rather the digit denotes the number of operands per instruction. Support for both were introduced by AMD in 2012 with its FX-series processors, while Intel added FMA3 support in 2013 with "Haswell."

The exact reasons why AMD deprecated FMA4 with "Zen" are unknown, but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput). Intel's adoption of FMA3 made it more popular, and hence more stable over the years. Level1Techs used an OpenBLAS FMA4 test-program to confirm that feeding "Zen" processors with FMA4 instructions won't just return a "illegal instruction" error, but also the processor will go ahead and complete the operation. This is interesting because FMA4 isn't exposed as a CPUID bit, and the operating system has no idea the processor even supports the instruction. For linear algebra, FMA4 has proven more efficient than AVX in both single- and double-precision.



View at TechPowerUp Main Site
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.64/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Maybe has something to do with mitigating Specter/Meltdown? FMA operations have to be cache happy instructions.
 
Joined
Sep 9, 2013
Messages
527 (0.14/day)
System Name Can I run it
Processor delidded i9-10900KF @ AI OC 3x5.3 10x5.1+Supercool direct die waterblock
Motherboard ASUS Maximus XII Apex 2801 BIOS
Cooling Main = GTS 360 GTX 240, EK PE 360,XSPC EX 360,2x EK-XRES 100 Revo D5 PWM, 12x T30, AC High Flow Next
Memory 2x16GB TridentZ 3600@4600 16-16-16-36@1.61V+EK Monarch, Separate loop with GTS 120&Freezemod DDC
Video Card(s) Gigabyte RTX 3080 Ti Gaming OC @ 0.85V 1950Mhz core 21Gbps mem + Barrow full cover waterblock
Storage Transcend PCIE 220S 1TB (main), WD Blue 3D NAND 250GB for OC testing, Seagate Barracuda 4TB
Display(s) Samsung Odyssey OLED G9 5120x1440 240Hz calibrated by X-Rite i1 Display Pro Plus
Case Thermaltake View 71
Audio Device(s) Q Acoustics M20 HD with Q Acoustics QB12
Power Supply Silverstone ST-1200 PTS 1200W 80+ Platinum
Mouse Logitech G Pro Wireless
Keyboard Logitech G913 (GL Linear)
Software Windows 11
Actually Intel Add FMA3 with Haswell not Ivy Bridge.
 
Joined
Oct 26, 2008
Messages
2,243 (0.40/day)
System Name Budget AMD System
Processor Threadripper 1900X @ 4.1Ghz (100x41 @ 1.3250V)
Motherboard Gigabyte X399 Aorus Gaming 7
Cooling EKWB X399 Monoblock
Memory 4x8GB GSkill TridentZ RGB 14-14-14-32 CR1 @ 3266
Video Card(s) XFX Radeon RX Vega₆⁴ Liquid @ 1,800Mhz Core, 1025Mhz HBM2
Storage 1x ADATA SX8200 NVMe, 1x Segate 2.5" FireCuda 2TB SATA, 1x 500GB HGST SATA
Display(s) Vizio 22" 1080p 60hz TV (Samsung Panel)
Case Corsair 570X
Audio Device(s) Onboard
Power Supply Seasonic X Series 850W KM3
Software Windows 10 Pro x64
Yep. Known this for a long time. Over a year now. lol. Its why we approved the move from our Bulldozer based arch (servers) to Zen last year. lol
 
Last edited:

silentbogo

Moderator
Staff member
Joined
Nov 20, 2013
Messages
5,473 (1.44/day)
Location
Kyiv, Ukraine
System Name WS#1337
Processor Ryzen 7 3800X
Motherboard ASUS X570-PLUS TUF Gaming
Cooling Xigmatek Scylla 240mm AIO
Memory 4x8GB Samsung DDR4 ECC UDIMM
Video Card(s) Inno3D RTX 3070 Ti iChill
Storage ADATA Legend 2TB + ADATA SX8200 Pro 1TB
Display(s) Samsung U24E590D (4K/UHD)
Case ghetto CM Cosmos RC-1000
Audio Device(s) ALC1220
Power Supply SeaSonic SSR-550FX (80+ GOLD)
Mouse Logitech G603
Keyboard Modecom Volcano Blade (Kailh choc LP)
VR HMD Google dreamview headset(aka fancy cardboard)
Software Windows 11, Ubuntu 20.04 LTS
Maybe has something to do with mitigating Specter/Meltdown? FMA operations have to be cache happy instructions.
Nope. It was disabled right off the start, long before Spectre/Meltdown conundrum, and even before FMA3 issue was discovered and patched up.
 

qubit

Overclocked quantum bit
Joined
Dec 6, 2007
Messages
17,865 (2.99/day)
Location
Quantum Well UK
System Name Quantumville™
Processor Intel Core i7-2700K @ 4GHz
Motherboard Asus P8Z68-V PRO/GEN3
Cooling Noctua NH-D14
Memory 16GB (2 x 8GB Corsair Vengeance Black DDR3 PC3-12800 C9 1600MHz)
Video Card(s) MSI RTX 2080 SUPER Gaming X Trio
Storage Samsung 850 Pro 256GB | WD Black 4TB | WD Blue 6TB
Display(s) ASUS ROG Strix XG27UQR (4K, 144Hz, G-SYNC compatible) | Asus MG28UQ (4K, 60Hz, FreeSync compatible)
Case Cooler Master HAF 922
Audio Device(s) Creative Sound Blaster X-Fi Fatal1ty PCIe
Power Supply Corsair AX1600i
Mouse Microsoft Intellimouse Pro - Black Shadow
Keyboard Yes
Software Windows 10 Pro 64-bit
but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput).
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.
 

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
46,343 (7.68/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.

33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
 
Joined
Apr 12, 2013
Messages
6,740 (1.68/day)
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.
I don't remember it being buggy, it worked fine on PD IIRC. It's just that Intel didn't include FMA4 in their chips when they were planning to do it ~
The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time.
https://en.wikipedia.org/wiki/FMA_instruction_set
 

HTC

Joined
Apr 1, 2008
Messages
4,604 (0.79/day)
Location
Portugal
System Name HTC's System
Processor Ryzen 5 2600X
Motherboard Asrock Taichi X370
Cooling NH-C14, with the AM4 mounting kit
Memory G.Skill Kit 16GB DDR4 F4 - 3200 C16D - 16 GTZB
Video Card(s) Sapphire Nitro+ Radeon RX 480 OC 4 GB
Storage 1 Samsung NVMe 960 EVO 250 GB + 1 3.5" Seagate IronWolf Pro 6TB 7200RPM 256MB SATA III
Display(s) LG 27UD58
Case Fractal Design Define R6 USB-C
Audio Device(s) Onboard
Power Supply Corsair TX 850M 80+ Gold
Mouse Razer Deathadder Elite
Software Ubuntu 19.04 LTS
btarunr said:
The exact reasons why AMD deprecated FMA4 with "Zen" are unknown, but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput). Intel's adoption of FMA3 made it more popular, and hence more stable over the years. Level1Techs used an OpenBLAS FMA4 test-program to confirm that feeding "Zen" processors with FMA4 instructions won't just return a "illegal instruction" error, but also the processor will go ahead and complete the operation. This is interesting because FMA4 isn't exposed as a CPUID bit, and the operating system has no idea the processor even supports the instruction. For linear algebra, FMA4 has proven more efficient than AVX in both single- and double-precision.

If this is true and they manage to fix this ...

33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.

If it turns out this is the reason, a triple facepalm won't be enough ...
 

qubit

Overclocked quantum bit
Joined
Dec 6, 2007
Messages
17,865 (2.99/day)
Location
Quantum Well UK
System Name Quantumville™
Processor Intel Core i7-2700K @ 4GHz
Motherboard Asus P8Z68-V PRO/GEN3
Cooling Noctua NH-D14
Memory 16GB (2 x 8GB Corsair Vengeance Black DDR3 PC3-12800 C9 1600MHz)
Video Card(s) MSI RTX 2080 SUPER Gaming X Trio
Storage Samsung 850 Pro 256GB | WD Black 4TB | WD Blue 6TB
Display(s) ASUS ROG Strix XG27UQR (4K, 144Hz, G-SYNC compatible) | Asus MG28UQ (4K, 60Hz, FreeSync compatible)
Case Cooler Master HAF 922
Audio Device(s) Creative Sound Blaster X-Fi Fatal1ty PCIe
Power Supply Corsair AX1600i
Mouse Microsoft Intellimouse Pro - Black Shadow
Keyboard Yes
Software Windows 10 Pro 64-bit
33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
Oh duh! This is what happens when I multitask TPU with work. :laugh:

I took it as being 33% faster than Intel's version. My bad.

If it turns out this is the reason, a triple facepalm won't be enough ...
Sorry bud, I had a comprehension error 101 lol.
 
Joined
Jun 12, 2017
Messages
136 (0.05/day)
33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
OH NO MY GOD, btarunr, STOP THIS NONSENSE PLEASE.
There's wiki out there that explains the difference between 4 operand and 3 operand. FMA4 and FMA3 they all just do one job, compute 'd=a*b+c'. The only difference is that FMA4 stores result 'd' in a new register which is specified in the instruction, while FMA3 stores it by overwriting one of the three input registers. THEY DID THE SAME THING, just different ways of handling the result.

FMA4 has the advantage of programming flexibility, meaning there's more room for optimization, since the output and input do not interfere. But the room will never be anywhere near 33%. If you write x86 assembly code, you will understand. However FMA3 uses less transistors, easier to implement (means you can design it with less latency on silicon), so Intel jumped ship of FMA4 and chose FMA3.

I don't write BLAS code, to be honest. But I do think well-optimized FMA3 doesn't have much disadvantage. Because if the flexibility is not well utilized, then the FMA4 processors will be troubled by its chunkier and slower units.

Edit:
I just come up with a great analogy of this. We can call x86's ADD as 'ADD2', ARM's ADD as 'ADD3'. If you write 'ADD A,B' in x86, then it stores the result in A, meaning A=A+B. If you write 'ADD A,B,C' in ARM, then it is good old 'A=B+C'.
Sure ADD3 is more flexible, but I don't think ARM has 50% more throughput than x86.
 
Top