I swear it was in the original presentation of the cards, when I find it ill post it, it might have been in an after interview as to why not rdna3. It also could just be the slide that says 798tops int8 performance on the fsr4 slide. If it is there or in the presentation it clearly is wrong as then the 9060xt's wouldn't be powerful enough.
That source that I linked is pretty much a TLDR of the presentation+interview, the major reason mentioned in there is pretty much FP8 support.
And yeah, you bring a good point, raw flop number is meaningless otherwise the 9060 wouldn't make the cut.
All ai numbers are fluff... nvidia started this a long time ago, boasting a new lower precision+sparcity number every generation apples to oranges each time.... its quite frustrating.
But I can reach Nvidia's theoretical numbers. Doing a mamf-finder I can reach 80~90% of Nvidia's mentioned performance for my 3090, and with mmpeak I can even surpass them a bit (likely due to my higher boost clocks).
I asked a friend of mine to do a mamf run in their 9070xt, they only reached ~110tflops out of the ~195 theoretical number for it, so 60% of what AMD claims, which is really on par with the results other folks have achieved with instinct cards, whereas with Nvidia it's the norm to do 80~90%, as I experienced myself.
AMDs int4 numbers are 4:2 structured sparcity to hit that 1500tops number. 390 TFLOPs fp8 dense is not bad at all. Puts it on mi250x performance level... with a fraction of the annoying complexity those things have.
My point is that you won't be achieving anywhere close to those numbers due to the software stack, that's widely known.
It's like how Nvidia claimed they doubled the cuda cores because they could all do int and float even though they cant do both at the same time...
They kinda can, that's the whole "dual issue" when doing sheer fp maths, the throughput is indeed doubled compared to Turing. But yeah, still kinda misleading since int+float is still the older rate.
But that's another point that AMD is failing on, their wave64 is being outright disabled throughout their stack and their compiler has a really hard time achieving dual issue vopds.
It is a physical core, but it is part of the regular compute units, not separate matrix/tensor cores. Shared scheduler everything. There is a physical aspect of it that exceeds an instruction set.
It is a physical unit, but a not so capable one.
It also has shared ports, which limits throughput.
same same but different, no longer shares with the FP block but still part of the CU.
Part of the CU, yes, but this still allows the scheduler to issue instructions to the FP block in an independent manner, that improves things in the end.
The unit itself is more capable as well.
That is fair I guess, there are front end arguments to be made here but, idc, rdna2/3/4 cores while growing more separate are still part of the main compute cores vs nvidia's implementation or AMD's cnda.
We could nerd out about it, but what matters is the final throughput you manage to reach.
If AMD can achieve good numbers in
practice with those built-in units, then great! But so far that hasn't been the case.
Yeah, so the hacked together rocm for strix halo was measured at just shy of 60 flops of bf16.
No, that's the theoretical performance for SH.
With ROCm they achieved 5.1 TFLOPs, but with a custom docker image they did manage 36.9 TFLOPs (64.4% efficiency, even a bit above the avg when it comes to AMD products)
It does manage to make better use of memory bandwidth tho, at 70~73% efficiency, which is not bad at all:
For the latest Strix Halo / AMD Ryzen AI Max+ 395 with Radeon 8060S (gfx1151) support, check out: 2025-05-02 https://github.com/ROCm/TheRock/discussions/244 2025-05-02 https://github.
llm-tracker.info
AMD has some tuned versions of LMstudio, I do not know what they are using, I am assuming it is directML but it performs much better than directML is known to.
It should be vulkan, directml would be shit as you mentioned.
Same tbh.
I am more hardware than software though I have been through the ISA's I am usually digging for specific things.
I am curious where you work is focused.
I used to work with embedded systems in the auto industry but jumped into data because $ and work from home

But I'm a bit all over the place professionally, my master's is in computer vision with MTL models (which should also be the focus of my phd next year), and I work as a data/ML engineer for startups, so I go from building backends, large-scale distributed pipelines, to deploying models for "cheap" within k8s.
So as i understand from what you talk, RDNA3 and maybe 2 can do tensor specific operations, but, they are are NOT done fast enough, having specialized hardware similar to Nvidia or AMD server grade tensor cores greatly speeds up these tasks, so RDNA2 and 3 can do FSR4, can do ML based ray reconstruction and denoising, but they might eat up too much GPU performance to run these and a software solution like FSR3.1 is more suitable.
So that's why they won't do FSR4 on RDNA3, it can run but the cost would be to high.
That'd be a good tldr all things considered, yes. For RDNA3 there would be the cost of "porting" the model as well to work within its capabilities.
with the AI cores only being there as an aid.
Minor nit: the AI cores ARE the shader cores. otherwise, correct.