Sunday, October 14th 2018

AMD "Zen" Does Support FMA4, Just Not Exposed

With its "Zen" CPU microarchitecture, AMD removed support for the FMA4 instruction-set, on paper. This, while retaining FMA3. Level1Techs discovered that "Zen" CPUs do support FMA4 instructions, even through the instruction-set is not exposed to the operating system. FMA, or fused multiply add, is an efficient way to compute linear algebra. FMA3 and FMA4 are not generations of the instruction-set (unlike SSE3 and SSE4), but rather the digit denotes the number of operands per instruction. Support for both were introduced by AMD in 2012 with its FX-series processors, while Intel added FMA3 support in 2013 with "Haswell."

The exact reasons why AMD deprecated FMA4 with "Zen" are unknown, but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput). Intel's adoption of FMA3 made it more popular, and hence more stable over the years. Level1Techs used an OpenBLAS FMA4 test-program to confirm that feeding "Zen" processors with FMA4 instructions won't just return a "illegal instruction" error, but also the processor will go ahead and complete the operation. This is interesting because FMA4 isn't exposed as a CPUID bit, and the operating system has no idea the processor even supports the instruction. For linear algebra, FMA4 has proven more efficient than AVX in both single- and double-precision.
Sources: Level1Techs (YouTube), Agner's CPU Blog
Add your own comment

10 Comments on AMD "Zen" Does Support FMA4, Just Not Exposed

#1
FordGT90Concept
"I go fast!1!11!1!"
Maybe has something to do with mitigating Specter/Meltdown? FMA operations have to be cache happy instructions.
Posted on Reply
#2
Enterprise24
Actually Intel Add FMA3 with Haswell not Ivy Bridge.
Posted on Reply
#3
Flyordie
Yep. Known this for a long time. Over a year now. lol. Its why we approved the move from our Bulldozer based arch (servers) to Zen last year. lol
Posted on Reply
#4
silentbogo
FordGT90Concept, post: 3923232, member: 60463"
Maybe has something to do with mitigating Specter/Meltdown? FMA operations have to be cache happy instructions.
Nope. It was disabled right off the start, long before Spectre/Meltdown conundrum, and even before FMA3 issue was discovered and patched up.
Posted on Reply
#5
qubit
Overclocked quantum bit
btarunr, post: 3923221, member: 43587"
but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput).
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.
Posted on Reply
#6
btarunr
Editor & Senior Moderator
qubit, post: 3923308, member: 46003"
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.
33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
Posted on Reply
#7
R0H1T
qubit, post: 3923308, member: 46003"
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.
I don't remember it being buggy, it worked fine on PD IIRC. It's just that Intel didn't include FMA4 in their chips when they were planning to do it ~
The incompatibility between Intel's FMA3 and AMD's FMA4 is due to both companies changing plans without coordinating coding details with each other. AMD changed their plans from FMA3 to FMA4 while Intel changed their plans from FMA4 to FMA3 almost at the same time.
https://en.wikipedia.org/wiki/FMA_instruction_set
Posted on Reply
#8
HTC
btarunr, member: 43587"
The exact reasons why AMD deprecated FMA4 with "Zen" are unknown, but some developers speculate it's because AMD's implementation of FMA4 is buggy, even though it's more efficient (33% more throughput). Intel's adoption of FMA3 made it more popular, and hence more stable over the years. Level1Techs used an OpenBLAS FMA4 test-program to confirm that feeding "Zen" processors with FMA4 instructions won't just return a "illegal instruction" error, but also the processor will go ahead and complete the operation. This is interesting because FMA4 isn't exposed as a CPUID bit, and the operating system has no idea the processor even supports the instruction. For linear algebra, FMA4 has proven more efficient than AVX in both single- and double-precision.
If this is true and they manage to fix this ...

qubit, post: 3923308, member: 46003"
33% faster is a massive difference, too much to be just due to better design. Therefore, I bet bet if they fix the bug that throughput will be about the same as Intel's. All it takes is a missed flag or something small like that somewhere to affect it.
If it turns out this is the reason, a triple facepalm won't be enough ...
Posted on Reply
#9
qubit
Overclocked quantum bit
btarunr, post: 3923312, member: 43587"
33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
Oh duh! This is what happens when I multitask TPU with work. :laugh:

I took it as being 33% faster than Intel's version. My bad.

HTC, post: 3923315, member: 51238"
If it turns out this is the reason, a triple facepalm won't be enough ...
Sorry bud, I had a comprehension error 101 lol.
Posted on Reply
#10
First Strike
btarunr, post: 3923312, member: 43587"
33% higher throughput for the simple reason that you can pack 4 operands per instruction versus 3.
OH NO MY GOD, btarunr, STOP THIS NONSENSE PLEASE.
There's wiki out there that explains the difference between 4 operand and 3 operand. FMA4 and FMA3 they all just do one job, compute 'd=a*b+c'. The only difference is that FMA4 stores result 'd' in a new register which is specified in the instruction, while FMA3 stores it by overwriting one of the three input registers. THEY DID THE SAME THING, just different ways of handling the result.

FMA4 has the advantage of programming flexibility, meaning there's more room for optimization, since the output and input do not interfere. But the room will never be anywhere near 33%. If you write x86 assembly code, you will understand. However FMA3 uses less transistors, easier to implement (means you can design it with less latency on silicon), so Intel jumped ship of FMA4 and chose FMA3.

I don't write BLAS code, to be honest. But I do think well-optimized FMA3 doesn't have much disadvantage. Because if the flexibility is not well utilized, then the FMA4 processors will be troubled by its chunkier and slower units.

Edit:
I just come up with a great analogy of this. We can call x86's ADD as 'ADD2', ARM's ADD as 'ADD3'. If you write 'ADD A,B' in x86, then it stores the result in A, meaning A=A+B. If you write 'ADD A,B,C' in ARM, then it is good old 'A=B+C'.
Sure ADD3 is more flexible, but I don't think ARM has 50% more throughput than x86.
Posted on Reply