• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

AMD Updates ROCm to Support Ryzen AI Max and Radeon RX 9000 Series

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,769 (7.42/day)
Location
Dublin, Ireland
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard Gigabyte B550 AORUS Elite V2
Cooling DeepCool Gammax L240 V2
Memory 2x 16GB DDR4-3200
Video Card(s) Galax RTX 4070 Ti EX
Storage Samsung 990 1TB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
AMD announced its Radeon Open Compute (ROCm) platform with hardware acceleration support for the Ryzen AI Max 300 "Strix Halo" client processors, and the Radeon RX 9000 series gaming GPUs. For the Ryzen AI Max 300 "Strix Halo," this would unlock the compute power of the 40 RDNA 3.5 compute units, with their 80 AI accelerators, and 2,560 stream processors, besides the AI-specific ISA of the up to 16 "Zen 5" CPU cores, including their full fat 512-bit FPU for executing AVX512 instructions. For the Radeon RX 9000 series, this would mean putting those up to 64 RDNA 4 compute units with up to 128 AI accelerators and up to 4,096 stream processors to use.

AMD also announced that it has updated the ROCm product stack with support for various main distributions of Linux, including OpenSuSE (available now), Ubuntu, and Red Hat EPEL, with the latter two getting ROCm support in the second half of 2025. Lastly, ROCm gets full Windows support, including Pytorch and ONNX-EP. A preview of the Pytorch support can be expected in Q3-2025, while a preview for ONNX-EP could arrive in July 2025.



View at TechPowerUp Main Site
 
I thought it wasn't lacking. I've been crunching with BOINC on my 9070 XT without problems (only in Windows, though, unfortunately). Or have I missed something?
 
It's about time. No ROCm at launch was pretty irritating, and the radio silence from AMD until now on the matter didn't help.
As i read in the tech press, they are very quiet about "tensor cores" in RDNA4, i believe they are tiptoeing Nvidia so no patent infringement lawsuit starts.
There is no doubt what AMD does, copy paste Nvidia tech, at some point they will get sued.
 
As i read in the tech press, they are very quiet about "tensor cores" in RDNA4, i believe they are tiptoeing Nvidia so no patent infringement lawsuit starts.
There is no doubt what AMD does, copy paste Nvidia tech, at some point they will get sued.
That is just silly. Google made a TPU before Nvidia made the V100 with tensor cores. AMD initially called its "tensor cores" Matrix cores, as matrix math is the primary thing tensor cores do... and both tensor and matrix are math terms I am pretty sure they are nota trademarkable nor is the ability to solve matrix math patentable.
 
That is just silly. Google made a TPU before Nvidia made the V100 with tensor cores. AMD initially called its "tensor cores" Matrix cores, as matrix math is the primary thing tensor cores do... and both tensor and matrix are math terms I am pretty sure they are nota trademarkable nor is the ability to solve matrix math patentable.
I know Nvidia didn't invent ray tracing, tensor cores and machine learning, they just bet everything on it and they are far ahead of everyone.
I don't know much about how this works but i do know AMD would want CUDA workloads and software to be easily translated to ROCm with little to no performance penalty, if they are similar enough they could just do this.
I also know FSR4 is a copy paste job, recently announced FSR Redstone also a copy paste, they called it machine learning ray regeneration, Nvidia calls it ray reconstruction, Amd will introduce neural radiance caching, Nvidia did that in 2021 for path tracing.
Do you see a pattern here ? RDNA4 has everything, it just needs the ML models and support.
 
I know Nvidia didn't invent ray tracing, tensor cores and machine learning, they just bet everything on it and they are far ahead of everyone.
I don't know much about how this works but i do know AMD would want CUDA workloads and software to be easily translated to ROCm with little to no performance penalty, if they are similar enough they could just do this.
I also know FSR4 is a copy paste job, recently announced FSR Redstone also a copy paste, they called it machine learning ray regeneration, Nvidia calls it ray reconstruction, Amd will introduce neural radiance caching, Nvidia did that in 2021 for path tracing.
Do you see a pattern here ? RDNA4 has everything, it just needs the ML models and support.
No, They are alternatives with the same result but different routes, if they were copy and paste, AMD wouldn't have taken a year to implement them.
FSR4 for example uses both CNN and Transformers. In the end, both DLSS and FSR4 were based on image reconstruction techniques that have been on the market for a long time.


RDNA 4 still doesn't use Matrix cores as beefier as CDNA 3 (much stronger than Nvidia's equivalents), UDNA is where that changes.
 
No, They are alternatives with the same result but different routes, if they were copy and paste, AMD wouldn't have taken a year to implement them.
FSR4 for example uses both CNN and Transformers. In the end, both DLSS and FSR4 were based on image reconstruction techniques that have been on the market for a long time.


RDNA 4 still doesn't use Matrix cores as beefier as CDNA 3 (much stronger than Nvidia's equivalents), UDNA is where that changes.
Exactly. AMD's machine learning has been around longer than RDNA 4. It's not a copy paste job.

On the above logic, programmable shaders are a copy paste job, too, and Nvidia and AMD should have sued each other on that long ago, but alas, they didn't.
 
Well, this is beyond my understanding but still, if they had machine learning capability why didn't they do FSR4 for RDNA3 or 2 ? it's probably limited in what it can do but i leave the explanation to people who better understand this.
 
Well, this is beyond my understanding but still, if they had machine learning capability why didn't they do FSR4 for RDNA3 or 2 ? it's probably limited in what it can do but i leave the explanation to people who better understand this.
It's because their workstation lineup (CDNA) is totally different, and is developed by a different team than their gaming one (RDNA). It's an entirely different division within the company, if you will.
This will change next gen with UDNA (unified-DNA).
 
I thought it wasn't lacking. I've been crunching with BOINC on my 9070 XT without problems (only in Windows, though, unfortunately). Or have I missed something?
But are you using ROCm for that to begin with?
BOINC uses OpenCL, and (afaik) the OCL driver on Windows has nothing to do with ROCm or anything of the sorts.

Well, this is beyond my understanding but still, if they had machine learning capability why didn't they do FSR4 for RDNA3 or 2 ? it's probably limited in what it can do but i leave the explanation to people who better understand this.
Just because FSR4 uses machine learning, it doesn't mean that anything that can run "machine learning" can run FSR4 out of a sudden. FSR4 requires some specific operations to run, likely ones that are exclusive to RDNA4 and no equivalent has been made for previous generations, nor been validated for those.

The OP also has nothing to do with FSR4 whatsoever, ROCm is a stack with drivers and runtimes to train and run models in a generic manner, but from a developer point of view, not a final-user one.
As said above, CDNA is a totally different thing from the consumer offerings, and ROCm has had way better support for such product rather than the consumer products.
 
But are you using ROCm for that to begin with?
BOINC uses OpenCL, and (afaik) the OCL driver on Windows has nothing to do with ROCm or anything of the sorts.
Ah I see. I'm not a big compute guy, so I wouldn't know. I only crunch WCG/BOINC. :oops: Thanks for the clarification.
 
But are you using ROCm for that to begin with?
BOINC uses OpenCL, and (afaik) the OCL driver on Windows has nothing to do with ROCm or anything of the sorts.


Just because FSR4 uses machine learning, it doesn't mean that anything that can run "machine learning" can run FSR4 out of a sudden. FSR4 requires some specific operations to run, likely ones that are exclusive to RDNA4 and no equivalent has been made for previous generations, nor been validated for those.

The OP also has nothing to do with FSR4 whatsoever, ROCm is a stack with drivers and runtimes to train and run models in a generic manner, but from a developer point of view, not a final-user one.
As said above, CDNA is a totally different thing from the consumer offerings, and ROCm has had way better support for such product rather than the consumer products.

No, its because RDNA3 7900xtx caps out at 122.8 FP16 tops performance, and they need about 900tops for FSR4. They hope to get the inferencing performance lower as they continue to train the model but its not to the point where it would increase RDNA3's performance, allegedly. RDNA4 adds FP8 and FP4 support getting the 9070xt to 1500tops of FP4.
Yes the RDNA3 and 4 cards have matrix cores, they were just more limited than the CDNA cards, the RDNA4 cores are more beefy. They called them AI cores and access them via the WMMA instruction
Wave Matrix Multiply Accumulate so... same same but different, its like saying v100 doesn't have tensor cores because they are insanely limited compared to modern tensor cores.

9070xt's have been usable on windows via LM studio, AMD has a small subset of hip applications from rocm on windows, as well as suporting accelerated code paths through ms's api's. That said they are bringing full ROCm to windows Q3? I expect delays. It finally gets official ROCm support in linux now. (active yesterday) as does strix halo.
 
No, its because RDNA3 7900xtx caps out at 122.8 FP16 tops performance, and they need about 900tops for FSR4. They hope to get the inferencing performance lower as they continue to train the model but its not to the point where it would increase RDNA3's performance, allegedly.
I don't think the issue is just the raw FLOPs number, but rather that - allegedly - FSR4 uses a model with FP8 weights (and it seems like INT8 wasn't good enough) (source), which RDNA3 does not support.
IDK where you got that idea that FSR4 requires "900tops", that seems non-sense, specially given that a 9070xt won't be reaching nowhere that in FP8.
RDNA4 adds FP8 and FP4 support getting the 9070xt to 1500tops of FP4.
That's the number with sparsity. Without sparsity the 9070xt does ~390TFLOPs FP8/TOPs INT8. Double that for INT4, half that for FP16 (so ~195TFLOPs fp16 vs ~120 for the 7900xtx).
Btw, those numbers from AMD are really misleading since they assume one is able to do dual-issue VOPD or pack stuff into a Wave64, which are often not the case. HIP even has Wave64 disabled for any RDNA GPU (source), so you can pretty much halve all of those compute numbers for any RDNA product.
Yes the RDNA3 and 4 cards have matrix cores, they were just more limited than the CDNA cards, the RDNA4 cores are more beefy. They called them AI cores and access them via the WMMA instruction
Wave Matrix Multiply Accumulate so... same same but different, its like saying v100 doesn't have tensor cores because they are insanely limited compared to modern tensor cores.
RDNA3 does not have any kind of tensor/matrix cores. The WMMA instructions are executed through the regular ALUs, not any other different unit, as this is listed clearly in the programming guide:
These instructions work over multiple cycles to compute the result matrix and internally use the DOTinstructions
RDNA4's ISA guide has no such note, but I could not get a clear info if RDNA4 has proper new units for that, or it they're just beefy, so I'll leave it at that.

V100 does have proper tensor cores and those are different units than the other regular scalar/vector ALUs, meaning that you can dispatch independent instructions to the tensor cores and the math units in tandem.
9070xt's have been usable on windows via LM studio
With either Vulkan or OpenCL, afaik (or even DirectML if you want worse than CPU perf). There's no support for ROCm on Windows for RDNA4 for either LMStudio nor llama.cpp/ollama.
Linux support has been added recently.
as well as suporting accelerated code paths through ms's api's.
Are you talking about DirectML? If so, that's pretty much useless given the low perf.
as does strix halo.
Sad part is that the performance for strix halo is really underwhelming due to the lackluster software.
 
I don't think the issue is just the raw FLOPs number, but rather that - allegedly - FSR4 uses a model with FP8 weights (and it seems like INT8 wasn't good enough) (source), which RDNA3 does not support.
IDK where you got that idea that FSR4 requires "900tops", that seems non-sense, specially given that a 9070xt won't be reaching nowhere that in FP8.
I swear it was in the original presentation of the cards, when I find it ill post it, it might have been in an after interview as to why not rdna3. It also could just be the slide that says 798tops int8 performance on the fsr4 slide. If it is there or in the presentation it clearly is wrong as then the 9060xt's wouldn't be powerful enough.

That's the number with sparsity. Without sparsity the 9070xt does ~390TFLOPs FP8/TOPs INT8. Double that for INT4, half that for FP16 (so ~195TFLOPs fp16 vs ~120 for the 7900xtx).
Btw, those numbers from AMD are really misleading since they assume one is able to do dual-issue VOPD or pack stuff into a Wave64, which are often not the case. HIP even has Wave64 disabled for any RDNA GPU (source), so you can pretty much halve all of those compute numbers for any RDNA product.
All ai numbers are fluff... nvidia started this a long time ago, boasting a new lower precision+sparcity number every generation apples to oranges each time.... its quite frustrating.
AMDs int4 numbers are 4:2 structured sparcity to hit that 1500tops number. 390 TFLOPs fp8 dense is not bad at all. Puts it on mi250x performance level... with a fraction of the annoying complexity those things have.

1747893875641.png

RDNA3 does not have any kind of tensor/matrix cores. The WMMA instructions are executed through the regular ALUs, not any other different unit, as this is listed clearly in the programming guide:
Yes, but no but yes but no. It's like how Nvidia claimed they doubled the cuda cores because they could all do int and float even though they cant do both at the same time...
It is a physical core, but it is part of the regular compute units, not separate matrix/tensor cores. Shared scheduler everything. There is a physical aspect of it that exceeds an instruction set.
1747892418115.png

RDNA4's ISA guide has no such note, but I could not get a clear info if RDNA4 has proper new units for that, or it they're just beefy, so I'll leave it at that.
same same but different, no longer shares with the FP block but still part of the CU.
1747892676930.png

V100 does have proper tensor cores and those are different units than the other regular scalar/vector ALUs, meaning that you can dispatch independent instructions to the tensor cores and the math units in tandem.
That is fair I guess, there are front end arguments to be made here but, idc, rdna2/3/4 cores while growing more separate are still part of the main compute cores vs nvidia's implementation or AMD's cnda.
With either Vulkan or OpenCL, afaik (or even DirectML if you want worse than CPU perf). There's no support for ROCm on Windows for RDNA4 for either LMStudio nor llama.cpp/ollama.
Linux support has been added recently.
Are you talking about DirectML? If so, that's pretty much useless given the low perf.
Sad part is that the performance for strix halo is really underwhelming due to the lackluster software.
Yeah, so the hacked together rocm for strix halo was measured at just shy of 60 flops of bf16.
AMD has some tuned versions of LMstudio, I do not know what they are using, I am assuming it is directML but it performs much better than directML is known to.
AMDs windows support has been ROCm through WSL and very very specific targeted application support.
I stay on linux.

I am more hardware than software though I have been through the ISA's I am usually digging for specific things.
I am curious where you work is focused.
 
Last edited:
So as i understand from what you talk, RDNA3 and maybe 2 can do tensor specific operations, but, they are are NOT done fast enough, having specialized hardware similar to Nvidia or AMD server grade tensor cores greatly speeds up these tasks, so RDNA2 and 3 can do FSR4, can do ML based ray reconstruction and denoising, but they might eat up too much GPU performance to run these and a software solution like FSR3.1 is more suitable.
So that's why they won't do FSR4 on RDNA3, it can run but the cost would be to high.
 
So as i understand from what you talk, RDNA3 and maybe 2 can do tensor specific operations, but, they are are NOT done fast enough, having specialized hardware similar to Nvidia or AMD server grade tensor cores greatly speeds up these tasks, so RDNA2 and 3 can do FSR4, can do ML based ray reconstruction and denoising, but they might eat up too much GPU performance to run these and a software solution like FSR3.1 is more suitable.
So that's why they won't do FSR4 on RDNA3, it can run but the cost would be to high.
RDNA 2 doesn't have Matrix (AI) cores, so it can't do any such task.
RDNA 3 has Matrix (AI) cores, but such operations are mostly done on the shader cores, with the AI cores only being there as an aid. They also lack specific instructions needed for FSR 4.
RDNA 4 has Matrix (AI) cores that are a sort of separate entity from the shader cores now, they are also a lot beefier than the ones on RDNA 3, and have the instructions needed for FSR 4.

This is my understanding of it.
 
I swear it was in the original presentation of the cards, when I find it ill post it, it might have been in an after interview as to why not rdna3. It also could just be the slide that says 798tops int8 performance on the fsr4 slide. If it is there or in the presentation it clearly is wrong as then the 9060xt's wouldn't be powerful enough.
That source that I linked is pretty much a TLDR of the presentation+interview, the major reason mentioned in there is pretty much FP8 support.
And yeah, you bring a good point, raw flop number is meaningless otherwise the 9060 wouldn't make the cut.

All ai numbers are fluff... nvidia started this a long time ago, boasting a new lower precision+sparcity number every generation apples to oranges each time.... its quite frustrating.
But I can reach Nvidia's theoretical numbers. Doing a mamf-finder I can reach 80~90% of Nvidia's mentioned performance for my 3090, and with mmpeak I can even surpass them a bit (likely due to my higher boost clocks).
I asked a friend of mine to do a mamf run in their 9070xt, they only reached ~110tflops out of the ~195 theoretical number for it, so 60% of what AMD claims, which is really on par with the results other folks have achieved with instinct cards, whereas with Nvidia it's the norm to do 80~90%, as I experienced myself.
AMDs int4 numbers are 4:2 structured sparcity to hit that 1500tops number. 390 TFLOPs fp8 dense is not bad at all. Puts it on mi250x performance level... with a fraction of the annoying complexity those things have.
My point is that you won't be achieving anywhere close to those numbers due to the software stack, that's widely known.
It's like how Nvidia claimed they doubled the cuda cores because they could all do int and float even though they cant do both at the same time...
They kinda can, that's the whole "dual issue" when doing sheer fp maths, the throughput is indeed doubled compared to Turing. But yeah, still kinda misleading since int+float is still the older rate.
But that's another point that AMD is failing on, their wave64 is being outright disabled throughout their stack and their compiler has a really hard time achieving dual issue vopds.
It is a physical core, but it is part of the regular compute units, not separate matrix/tensor cores. Shared scheduler everything. There is a physical aspect of it that exceeds an instruction set.
It is a physical unit, but a not so capable one.
It also has shared ports, which limits throughput.
same same but different, no longer shares with the FP block but still part of the CU.
Part of the CU, yes, but this still allows the scheduler to issue instructions to the FP block in an independent manner, that improves things in the end.
The unit itself is more capable as well.
That is fair I guess, there are front end arguments to be made here but, idc, rdna2/3/4 cores while growing more separate are still part of the main compute cores vs nvidia's implementation or AMD's cnda.
We could nerd out about it, but what matters is the final throughput you manage to reach.
If AMD can achieve good numbers in practice with those built-in units, then great! But so far that hasn't been the case.
Yeah, so the hacked together rocm for strix halo was measured at just shy of 60 flops of bf16.
No, that's the theoretical performance for SH.
With ROCm they achieved 5.1 TFLOPs, but with a custom docker image they did manage 36.9 TFLOPs (64.4% efficiency, even a bit above the avg when it comes to AMD products)
It does manage to make better use of memory bandwidth tho, at 70~73% efficiency, which is not bad at all:

AMD has some tuned versions of LMstudio, I do not know what they are using, I am assuming it is directML but it performs much better than directML is known to.
It should be vulkan, directml would be shit as you mentioned.
I stay on linux.
Same tbh.
I am more hardware than software though I have been through the ISA's I am usually digging for specific things.
I am curious where you work is focused.
I used to work with embedded systems in the auto industry but jumped into data because $ and work from home :laugh:
But I'm a bit all over the place professionally, my master's is in computer vision with MTL models (which should also be the focus of my phd next year), and I work as a data/ML engineer for startups, so I go from building backends, large-scale distributed pipelines, to deploying models for "cheap" within k8s.


So as i understand from what you talk, RDNA3 and maybe 2 can do tensor specific operations, but, they are are NOT done fast enough, having specialized hardware similar to Nvidia or AMD server grade tensor cores greatly speeds up these tasks, so RDNA2 and 3 can do FSR4, can do ML based ray reconstruction and denoising, but they might eat up too much GPU performance to run these and a software solution like FSR3.1 is more suitable.
So that's why they won't do FSR4 on RDNA3, it can run but the cost would be to high.
That'd be a good tldr all things considered, yes. For RDNA3 there would be the cost of "porting" the model as well to work within its capabilities.
with the AI cores only being there as an aid.
Minor nit: the AI cores ARE the shader cores. otherwise, correct.
 
But I can reach Nvidia's theoretical numbers. Doing a mamf-finder I can reach 80~90% of Nvidia's mentioned performance for my 3090, and with mmpeak I can even surpass them a bit (likely due to my higher boost clocks).
I asked a friend of mine to do a mamf run in their 9070xt, they only reached ~110tflops out of the ~195 theoretical number for it, so 60% of what AMD claims, which is really on par with the results other folks have achieved with instinct cards, whereas with Nvidia it's the norm to do 80~90%, as I experienced myself.

My point is that you won't be achieving anywhere close to those numbers due to the software stack, that's widely known.

60% does seem quite common, I skimmed that earlier this week searching for actual strix halo performance as AMD has been cagey when talking about it leading me to believe its quite shite.
I thought the 59Tflops was the mamf not theoretical, mb.
With hipblast it hit 60% I get the feeling there is an architectural bottleneck as much as software one. RocBlas is. yeah bad, kinda known to be.

Would you mind sending me what you sent your friend, curious to see if the official support changes anything on the 9070xt or if my mi100hive does better with the IF linkage.
AMD has lots of open spots they are trying to fill... I keep getting hit up for stuff like... (This is a position within the AI GPU Software Group (AGS) responsible for AMD's ML SDK initiatives, with focus on development within the ROCm Profiling Tools) I keep replying, please find someone lol.
I have been using ROCm since V64, and its been broken for a lot of that time. It makes massive strides forward but not in a universal sense but in a focused sense.
If you are doing particular things there are competitive solutions to be had with Instincts, if you are not..... SOL.
They are working on training rn, but it need an architectural switch, and UALink really. the IF links aren't stout enough for traditional training.

I went from top3 server vendor pre-sales engineering for HPC-AI gear to edge-ai startup, embedded world is fun, but I missed HPC, and my homelab was strongly hpc, and got me back into that world.
 
Last edited:
60% does seem quite common, I skimmed that earlier this week searching for actual strix halo performance as AMD has been cagey when talking about it leading me to believe its quite shite.
With hipblast it hit 60% I get the feeling there is an architectural bottleneck as much as software one. RocBlas is deprecated.
A colleague of mine had managed to achieve really nice numbers with some instinct products by writing custom kernels, and the folks at tinygrad also managed some nice numbers without using ROCm. Issue is that those are really specific scenarios and not portable, so you'll have a hard time achieving this in other use cases, sadly.
I think the issue with strix halo is the same as all other GPU product of theirs: a bad software stack.

Would you mind sending me what you sent your friend, curious to see if the official support changes anything on the 9070xt or if my mi100hive does better with the IF linkage.
Sure. FWIW, they used the latest ROCm on Linux with the official changes since the benchmark relies on pytorch.

Here's the code we used:
More info on the benchmark itself and other results if you're interested:

AMD has lots of open spots they are trying to fill... I keep getting hit up for stuff like... (This is a position within the AI GPU Software Group (AGS) responsible for AMD's ML SDK initiatives, with focus on development within the ROCm Profiling Tools) I keep replying, please find someone lol.
Too bad most of those are not remote, worldwide positions :p
I have been using ROCm since V64, and its been broken for a lot of that time. It makes massive strides forward but not in a universal sense but in a focused sense.
If you are doing particular things there are competitive solutions to be had with Instincts, if you are not..... SOL.
If you have the expertise and enough throughput to get your hands dirty, you can make great use of AMD products.
But if you are more focused on getting an overall project to work in a fast manner, yeah, you're SOL and better off with any other offering.

Fun fact: AMD will be moving most stuff to SPIR-V so we won't have that mess of multi-gigabyte stacks with compiled stuff for each µarch. Downside is that's likely to break tons of things in the near future.
They are working on training rn, but it need an architectural switch, and UALink really. the IF links aren't stout enough for traditional training.
Yeah, multi node scaling is not great on their end, and I don't think UALink will be ready in a timely manner.
Even for inference the hardware is really sub-utilized ATM due to all those software limitations, which is sad given that the hardware itself is capable of much more.
 
A colleague of mine had managed to achieve really nice numbers with some instinct products by writing custom kernels, and the folks at tinygrad also managed some nice numbers without using ROCm. Issue is that those are really specific scenarios and not portable, so you'll have a hard time achieving this in other use cases, sadly.
I think the issue with strix halo is the same as all other GPU product of theirs: a bad software stack.

They continue to improve but are clearly behind, yeah...
Sure. FWIW, they used the latest ROCm on Linux with the official changes since the benchmark relies on pytorch.
Pytorch is very sensitive to tuning, its likely there is performance gains to be had there.
Too bad most of those are not remote, worldwide positions :p
Soo true... though that last one was full remote. I just have no desire to create profiling tools.
 
They continue to improve but are clearly behind, yeah...
They have done another article since then highlighting that:

It is improving and at a really nice pace, but still mostly focused on their instinct lineup (so RDNA gets scraps), and Nvidia is not slowing down by any means, so the bar is ever increasing.
Pytorch is very sensitive to tuning, its likely there is performance gains to be had there.
Indeed, but it's up to AMD to provide good kernels within their HIP backend. They take too long to provide good kernels, and given all of Nvidia's moat, any 3rd party kernel will be CUDA-first, specially given how easy is for one to get a GeForce GPU, mess around with some stuff, and then trivially spin up an H100/H200/B200 instance in any random cloud to port that kernel over (think stuff like thunderkittens).
 
Back
Top