• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD "Zen" Does Support FMA4, Just Not Exposed

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
41,281 (8.25/day)
Location
Hyderabad, India
Processor AMD Ryzen 7 2700X
Motherboard ASUS ROG Strix B450-E Gaming
Cooling AMD Wraith Prism
Memory 2x 16GB Corsair Vengeance LPX DDR4-3000
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) Creative Sound Blaster Recon3D PCIe
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Microsoft Sidewinder X4
Software Windows 10 Pro
With its "Zen" CPU microarchitecture, AMD removed support for the FMA4 instruction-set, on paper. This, while retaining FMA3. Level1Techs discovered that "Zen" CPUs do support FMA4 instructions, even through the instruction-set is not exposed to the operating system. FMA, or fused multiply add, is an efficient way to compute linear algebra. FMA3 and FMA4 are not generations of the instruction-set (unlike SSE3 and SSE4), but rather the digit denotes the number of operands per instruction. Support for both were introduced by AMD in 2012 with its FX-series processors, while Intel added FMA3 support in 2013 with "Haswell."

The exact reasons why AMD deprecated FMA4 with "Zen" are unknown, but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput). Intel's adoption of FMA3 made it more popular, and hence more stable over the years. Level1Techs used an OpenBLAS FMA4 test-program to confirm that feeding "Zen" processors with FMA4 instructions won't just return a "illegal instruction" error, but also the processor will go ahead and complete the operation. This is interesting because FMA4 isn't exposed as a CPUID bit, and the operating system has no idea the processor even supports the instruction. For linear algebra, FMA4 has proven more efficient than AVX in both single- and double-precision.



View at TechPowerUp Main Site
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,147 (5.64/day)
Location
IA, USA
System Name BY-2015
Processor Intel Core i7-6700K (4 x 4.00 GHz) w/ HT and Turbo on
Motherboard MSI Z170A GAMING M7
Cooling Scythe Kotetsu
Memory 2 x Kingston HyperX DDR4-2133 8 GiB
Video Card(s) Sapphire Radeon RX 5500 XT Pulse 8 GiB
Storage Crucial MX300 275 GB, Seagate Exos X12 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse SteelSeries Sensei RAW
Keyboard Tesoro Excalibur
Software Windows 10 Pro 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Maybe has something to do with mitigating Specter/Meltdown? FMA operations have to be cache happy instructions.
 
Joined
Sep 9, 2013
Messages
520 (0.18/day)
System Name Can I run it
Processor delidded i9-9900K (P0) @ 5.0Ghz + NB @ 4.7Ghz ~1.2V (VR VOUT) 1.024V IO and SA + Koolance CPU-380i
Motherboard Asrock Z370 Taichi P4.20
Cooling HWlabs Nemesis GTS 360 and GTX 240 / 3x Gentle Typhoon AP-15 1850 RPM / 3x EK Vardar / SC600 pump
Memory 2x8GB 2133Mhz OEM SK Hynix RAM (dual ranks AFR) @ 3500Mhz 16-18-18-36-360-2T 1.408V with 60mm fan
Video Card(s) EVGA GTX 1080 Ti FTW3 + universal waterblock underclock @ 1721Mhz 0.8V 12488 mem(140W during gaming)
Storage Transcend PCIE 220S 1TB , Seagate Barracuda 4TB
Display(s) Acer XR341CK 3440x1440 75Hz G-Sync compatible calibrated by X-Rite i1 Display Pro Plus
Case NZXT H440 White modded front and top panel for better airflow.
Power Supply Corsair HX 750W 80+ Silver
Mouse Logitech G Pro Wireless
Keyboard Logitech G913 (GL clicky) for gaming , Ducky Shine 7 (Cherry MX red) for working
Software Windows 10 Enterprise 2016 LTSB (1607 inside)
Benchmark Scores 1838 FIDE rating (inactive since 2010). ~2150 Lichess.org blitz rating.
Actually Intel Add FMA3 with Haswell not Ivy Bridge.
 
Joined
Oct 26, 2008
Messages
2,160 (0.47/day)
System Name Budget AMD System
Processor Threadripper 1900X @ 4.025Ghz (100x40.25 @ 1.325V)
Motherboard Gigabyte X399 Aorus Gaming 7
Cooling EKWB X399 Monoblock
Memory 4x8GB GSkill TridentZ RGB 12-12-13-32 CR1 @ 3200
Video Card(s) XFX Radeon RX Vega₆⁴ Liquid @ 1,800Mhz Core, 1100 HBM2
Storage 1x ADATA SX8200 NVMe, 1x Segate 2.5" FireCuda 2TB SATA, 1x 500GB HGST SATA
Display(s) Vizio 22" 1080p 60hz TV (Samsung Panel)
Case Corsair 570X
Audio Device(s) Onboard
Power Supply Seasonic X Series 850W KM3
Software Windows 10 Pro x64
Yep. Known this for a long time. Over a year now. lol. Its why we approved the move from our Bulldozer based arch (servers) to Zen last year. lol
 
Last edited:
Joined
Nov 20, 2013
Messages
4,888 (1.76/day)
Location
Kiev, Ukraine
System Name WS#1337
Processor Ryzen 7 3800X
Motherboard ASUS X570-PLUS TUF Gaming
Cooling Xigmatek Scylla 240mm AIO
Memory 4x8GB G.Skill Ares OEM DDR4-3200 (B-die)
Video Card(s) GTX 1070 Ti
Storage Adata SX8200 Pro 1TB
Display(s) Samsung U24E590D (4K/UHD)
Case ghetto CM Cosmos RC-1000
Audio Device(s) ALC1220
Power Supply SeaSonic SSR-550FX (80+ GOLD)
Mouse Logitech G603
Keyboard Modecom Volcano Blade (Kailh choc LP)
Software Windows 10, Ubuntu 20.04 LTS
Maybe has something to do with mitigating Specter/Meltdown? FMA operations have to be cache happy instructions.
Nope. It was disabled right off the start, long before Spectre/Meltdown conundrum, and even before FMA3 issue was discovered and patched up.
 

qubit

Overclocked quantum bit
Joined
Dec 6, 2007
Messages
16,563 (3.35/day)
Location
Quantum Well UK
System Name Quantumville™
Processor Intel Core i7-2700K @ 4GHz
Motherboard Asus P8Z68-V PRO/GEN3
Cooling Noctua NH-D14
Memory 16GB (2 x 8GB Corsair Vengeance Black DDR3 PC3-12800 C9 1600MHz)
Video Card(s) MSI RTX 2080 SUPER Gaming X Trio
Storage Samsung 850 Pro 256GB | WD Black 4TB | WD Blue 6TB
Display(s) BenQ XL2720Z (144Hz, 3D Vision 2, 1080p) | Asus MG28UQ (4K, 60Hz, FreeSync compatible)
Case Cooler Master HAF 922
Audio Device(s) Creative Sound Blaster X-Fi Fatal1ty PCIe
Power Supply Corsair HX 850W v1
Mouse Microsoft Intellimouse Pro - Black Shadow
Keyboard Yes
Software Windows 10 Pro 64-bit
but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput).
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.
 

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
41,281 (8.25/day)
Location
Hyderabad, India
Processor AMD Ryzen 7 2700X
Motherboard ASUS ROG Strix B450-E Gaming
Cooling AMD Wraith Prism
Memory 2x 16GB Corsair Vengeance LPX DDR4-3000
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) Creative Sound Blaster Recon3D PCIe
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Microsoft Sidewinder X4
Software Windows 10 Pro
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.

33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
 
Joined
Apr 12, 2013
Messages
4,204 (1.40/day)
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.
I don't remember it being buggy, it worked fine on PD IIRC. It's just that Intel didn't include FMA4 in their chips when they were planning to do it ~
The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time.
https://en.wikipedia.org/wiki/FMA_instruction_set
 

HTC

Joined
Apr 1, 2008
Messages
4,323 (0.90/day)
Location
Portugal
System Name HTC's System
Processor Ryzen 5 2600X
Motherboard Asrock Taichi X370
Cooling NH-C14, with the AM4 mounting kit
Memory G.Skill Kit 16GB DDR4 F4 - 3200 C16D - 16 GTZB
Video Card(s) Sapphire Nitro+ Radeon RX 480 OC 4 GB
Storage 1 Samsung NVMe 960 EVO 250 GB + 1 3.5" Seagate IronWolf Pro 6TB 7200RPM 256MB SATA III
Display(s) LG 27UD58
Case Fractal Design Define R6 USB-C
Audio Device(s) Onboard
Power Supply Corsair TX 850M 80+ Gold
Mouse Razer Deathadder Elite
Software Ubuntu 19.04 LTS
btarunr said:
The exact reasons why AMD deprecated FMA4 with "Zen" are unknown, but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput). Intel's adoption of FMA3 made it more popular, and hence more stable over the years. Level1Techs used an OpenBLAS FMA4 test-program to confirm that feeding "Zen" processors with FMA4 instructions won't just return a "illegal instruction" error, but also the processor will go ahead and complete the operation. This is interesting because FMA4 isn't exposed as a CPUID bit, and the operating system has no idea the processor even supports the instruction. For linear algebra, FMA4 has proven more efficient than AVX in both single- and double-precision.

If this is true and they manage to fix this ...

33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.

If it turns out this is the reason, a triple facepalm won't be enough ...
 

qubit

Overclocked quantum bit
Joined
Dec 6, 2007
Messages
16,563 (3.35/day)
Location
Quantum Well UK
System Name Quantumville™
Processor Intel Core i7-2700K @ 4GHz
Motherboard Asus P8Z68-V PRO/GEN3
Cooling Noctua NH-D14
Memory 16GB (2 x 8GB Corsair Vengeance Black DDR3 PC3-12800 C9 1600MHz)
Video Card(s) MSI RTX 2080 SUPER Gaming X Trio
Storage Samsung 850 Pro 256GB | WD Black 4TB | WD Blue 6TB
Display(s) BenQ XL2720Z (144Hz, 3D Vision 2, 1080p) | Asus MG28UQ (4K, 60Hz, FreeSync compatible)
Case Cooler Master HAF 922
Audio Device(s) Creative Sound Blaster X-Fi Fatal1ty PCIe
Power Supply Corsair HX 850W v1
Mouse Microsoft Intellimouse Pro - Black Shadow
Keyboard Yes
Software Windows 10 Pro 64-bit
33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
Oh duh! This is what happens when I multitask TPU with work. :laugh:

I took it as being 33% faster than Intel's version. My bad.

If it turns out this is the reason, a triple facepalm won't be enough ...
Sorry bud, I had a comprehension error 101 lol.
 
Joined
Jun 12, 2017
Messages
91 (0.06/day)
33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
OH NO MY GOD, btarunr, STOP THIS NONSENSE PLEASE.
There's wiki out there that explains the difference between 4 operand and 3 operand. FMA4 and FMA3 they all just do one job, compute 'd=a*b+c'. The only difference is that FMA4 stores result 'd' in a new register which is specified in the instruction, while FMA3 stores it by overwriting one of the three input registers. THEY DID THE SAME THING, just different ways of handling the result.

FMA4 has the advantage of programming flexibility, meaning there's more room for optimization, since the output and input do not interfere. But the room will never be anywhere near 33%. If you write x86 assembly code, you will understand. However FMA3 uses less transistors, easier to implement (means you can design it with less latency on silicon), so Intel jumped ship of FMA4 and chose FMA3.

I don't write BLAS code, to be honest. But I do think well-optimized FMA3 doesn't have much disadvantage. Because if the flexibility is not well utilized, then the FMA4 processors will be troubled by its chunkier and slower units.

Edit:
I just come up with a great analogy of this. We can call x86's ADD as 'ADD2', ARM's ADD as 'ADD3'. If you write 'ADD A,B' in x86, then it stores the result in A, meaning A=A+B. If you write 'ADD A,B,C' in ARM, then it is good old 'A=B+C'.
Sure ADD3 is more flexible, but I don't think ARM has 50% more throughput than x86.
 
Top