AMD Updates ROCm to Support Ryzen AI Max and Radeon RX 9000 Series

btarunr · Wednesday at 4:00 AM

AMD announced its Radeon Open Compute (ROCm) platform with hardware acceleration support for the Ryzen AI Max 300 "Strix Halo" client processors, and the Radeon RX 9000 series gaming GPUs. For the Ryzen AI Max 300 "Strix Halo," this would unlock the compute power of the 40 RDNA 3.5 compute units, with their 80 AI accelerators, and 2,560 stream processors, besides the AI-specific ISA of the up to 16 "Zen 5" CPU cores, including their full fat 512-bit FPU for executing AVX512 instructions. For the Radeon RX 9000 series, this would mean putting those up to 64 RDNA 4 compute units with up to 128 AI accelerators and up to 4,096 stream processors to use.

AMD also announced that it has updated the ROCm product stack with support for various main distributions of Linux, including OpenSuSE (available now), Ubuntu, and Red Hat EPEL, with the latter two getting ROCm support in the second half of 2025. Lastly, ROCm gets full Windows support, including Pytorch and ONNX-EP. A preview of the Pytorch support can be expected in Q3-2025, while a preview for ONNX-EP could arrive in July 2025.

View at TechPowerUp Main Site

Ninja Weedle · Wednesday at 4:06 AM

It's about time. No ROCm at launch was pretty irritating, and the radio silence from AMD until now on the matter didn't help.

AusWolf · Wednesday at 6:13 AM

I thought it wasn't lacking. I've been crunching with BOINC on my 9070 XT without problems (only in Windows, though, unfortunately). Or have I missed something?

lambda · Wednesday at 7:19 AM

The documentation hasn't been updated as of writing this comment.

System requirements (Linux) — ROCm installation (Linux)

System requirements for AMD ROCm

rocm.docs.amd.com

Luminescent · Wednesday at 8:34 AM

Ninja Weedle said:
It's about time. No ROCm at launch was pretty irritating, and the radio silence from AMD until now on the matter didn't help.

As i read in the tech press, they are very quiet about "tensor cores" in RDNA4, i believe they are tiptoeing Nvidia so no patent infringement lawsuit starts.
There is no doubt what AMD does, copy paste Nvidia tech, at some point they will get sued.

Patriot · Wednesday at 8:46 AM

Luminescent said:
As i read in the tech press, they are very quiet about "tensor cores" in RDNA4, i believe they are tiptoeing Nvidia so no patent infringement lawsuit starts.
There is no doubt what AMD does, copy paste Nvidia tech, at some point they will get sued.

That is just silly. Google made a TPU before Nvidia made the V100 with tensor cores. AMD initially called its "tensor cores" Matrix cores, as matrix math is the primary thing tensor cores do... and both tensor and matrix are math terms I am pretty sure they are nota trademarkable nor is the ability to solve matrix math patentable.

Luminescent · Wednesday at 9:28 AM

Patriot said:
That is just silly. Google made a TPU before Nvidia made the V100 with tensor cores. AMD initially called its "tensor cores" Matrix cores, as matrix math is the primary thing tensor cores do... and both tensor and matrix are math terms I am pretty sure they are nota trademarkable nor is the ability to solve matrix math patentable.

I know Nvidia didn't invent ray tracing, tensor cores and machine learning, they just bet everything on it and they are far ahead of everyone.
I don't know much about how this works but i do know AMD would want CUDA workloads and software to be easily translated to ROCm with little to no performance penalty, if they are similar enough they could just do this.
I also know FSR4 is a copy paste job, recently announced FSR Redstone also a copy paste, they called it machine learning ray regeneration, Nvidia calls it ray reconstruction, Amd will introduce neural radiance caching, Nvidia did that in 2021 for path tracing.
Do you see a pattern here ? RDNA4 has everything, it just needs the ML models and support.

Denver · Wednesday at 12:10 PM

Luminescent said:
I know Nvidia didn't invent ray tracing, tensor cores and machine learning, they just bet everything on it and they are far ahead of everyone.
I don't know much about how this works but i do know AMD would want CUDA workloads and software to be easily translated to ROCm with little to no performance penalty, if they are similar enough they could just do this.
I also know FSR4 is a copy paste job, recently announced FSR Redstone also a copy paste, they called it machine learning ray regeneration, Nvidia calls it ray reconstruction, Amd will introduce neural radiance caching, Nvidia did that in 2021 for path tracing.
Do you see a pattern here ? RDNA4 has everything, it just needs the ML models and support.

No, They are alternatives with the same result but different routes, if they were copy and paste, AMD wouldn't have taken a year to implement them.
FSR4 for example uses both CNN and Transformers. In the end, both DLSS and FSR4 were based on image reconstruction techniques that have been on the market for a long time.

RDNA 4 still doesn't use Matrix cores as beefier as CDNA 3 (much stronger than Nvidia's equivalents), UDNA is where that changes.

AusWolf · Wednesday at 1:45 PM

Denver said:
No, They are alternatives with the same result but different routes, if they were copy and paste, AMD wouldn't have taken a year to implement them.
FSR4 for example uses both CNN and Transformers. In the end, both DLSS and FSR4 were based on image reconstruction techniques that have been on the market for a long time.

RDNA 4 still doesn't use Matrix cores as beefier as CDNA 3 (much stronger than Nvidia's equivalents), UDNA is where that changes.

Exactly. AMD's machine learning has been around longer than RDNA 4. It's not a copy paste job.

On the above logic, programmable shaders are a copy paste job, too, and Nvidia and AMD should have sued each other on that long ago, but alas, they didn't.

Luminescent · Wednesday at 3:41 PM

Well, this is beyond my understanding but still, if they had machine learning capability why didn't they do FSR4 for RDNA3 or 2 ? it's probably limited in what it can do but i leave the explanation to people who better understand this.

AusWolf · Wednesday at 4:26 PM

Luminescent said:
Well, this is beyond my understanding but still, if they had machine learning capability why didn't they do FSR4 for RDNA3 or 2 ? it's probably limited in what it can do but i leave the explanation to people who better understand this.

It's because their workstation lineup (CDNA) is totally different, and is developed by a different team than their gaming one (RDNA). It's an entirely different division within the company, if you will.
This will change next gen with UDNA (unified-DNA).

igormp · Wednesday at 5:04 PM

AusWolf said:
I thought it wasn't lacking. I've been crunching with BOINC on my 9070 XT without problems (only in Windows, though, unfortunately). Or have I missed something?

But are you using ROCm for that to begin with?
BOINC uses OpenCL, and (afaik) the OCL driver on Windows has nothing to do with ROCm or anything of the sorts.

Luminescent said:
Well, this is beyond my understanding but still, if they had machine learning capability why didn't they do FSR4 for RDNA3 or 2 ? it's probably limited in what it can do but i leave the explanation to people who better understand this.

Just because FSR4 uses machine learning, it doesn't mean that anything that can run "machine learning" can run FSR4 out of a sudden. FSR4 requires some specific operations to run, likely ones that are exclusive to RDNA4 and no equivalent has been made for previous generations, nor been validated for those.

The OP also has nothing to do with FSR4 whatsoever, ROCm is a stack with drivers and runtimes to train and run models in a generic manner, but from a developer point of view, not a final-user one.
As said above, CDNA is a totally different thing from the consumer offerings, and ROCm has had way better support for such product rather than the consumer products.

AusWolf · Wednesday at 5:17 PM

igormp said:
But are you using ROCm for that to begin with?
BOINC uses OpenCL, and (afaik) the OCL driver on Windows has nothing to do with ROCm or anything of the sorts.

Ah I see. I'm not a big compute guy, so I wouldn't know. I only crunch WCG/BOINC. :oops:

Thanks for the clarification.

Patriot · Wednesday at 7:11 PM

igormp said:
But are you using ROCm for that to begin with?
BOINC uses OpenCL, and (afaik) the OCL driver on Windows has nothing to do with ROCm or anything of the sorts.

Just because FSR4 uses machine learning, it doesn't mean that anything that can run "machine learning" can run FSR4 out of a sudden. FSR4 requires some specific operations to run, likely ones that are exclusive to RDNA4 and no equivalent has been made for previous generations, nor been validated for those.

The OP also has nothing to do with FSR4 whatsoever, ROCm is a stack with drivers and runtimes to train and run models in a generic manner, but from a developer point of view, not a final-user one.
As said above, CDNA is a totally different thing from the consumer offerings, and ROCm has had way better support for such product rather than the consumer products.

No, its because RDNA3 7900xtx caps out at 122.8 FP16 tops performance, and they need about 900tops for FSR4. They hope to get the inferencing performance lower as they continue to train the model but its not to the point where it would increase RDNA3's performance, allegedly. RDNA4 adds FP8 and FP4 support getting the 9070xt to 1500tops of FP4.
Yes the RDNA3 and 4 cards have matrix cores, they were just more limited than the CDNA cards, the RDNA4 cores are more beefy. They called them AI cores and access them via the WMMA instruction
Wave Matrix Multiply Accumulate so... same same but different, its like saying v100 doesn't have tensor cores because they are insanely limited compared to modern tensor cores.

9070xt's have been usable on windows via LM studio, AMD has a small subset of hip applications from rocm on windows, as well as suporting accelerated code paths through ms's api's. That said they are bringing full ROCm to windows Q3? I expect delays. It finally gets official ROCm support in linux now. (active yesterday) as does strix halo.

igormp · Wednesday at 9:08 PM

Patriot said:
No, its because RDNA3 7900xtx caps out at 122.8 FP16 tops performance, and they need about 900tops for FSR4. They hope to get the inferencing performance lower as they continue to train the model but its not to the point where it would increase RDNA3's performance, allegedly.

I don't think the issue is just the raw FLOPs number, but rather that - allegedly - FSR4 uses a model with FP8 weights (and it seems like INT8 wasn't good enough) (source), which RDNA3 does not support.
IDK where you got that idea that FSR4 requires "900tops", that seems non-sense, specially given that a 9070xt won't be reaching nowhere that in FP8.

Patriot said:
RDNA4 adds FP8 and FP4 support getting the 9070xt to 1500tops of FP4.

That's the number with sparsity. Without sparsity the 9070xt does ~390TFLOPs FP8/TOPs INT8. Double that for INT4, half that for FP16 (so ~195TFLOPs fp16 vs ~120 for the 7900xtx).
Btw, those numbers from AMD are really misleading since they assume one is able to do dual-issue VOPD or pack stuff into a Wave64, which are often not the case. HIP even has Wave64 disabled for any RDNA GPU (source), so you can pretty much halve all of those compute numbers for any RDNA product.

Patriot said:
Yes the RDNA3 and 4 cards have matrix cores, they were just more limited than the CDNA cards, the RDNA4 cores are more beefy. They called them AI cores and access them via the WMMA instruction
Wave Matrix Multiply Accumulate so... same same but different, its like saying v100 doesn't have tensor cores because they are insanely limited compared to modern tensor cores.

RDNA3 does not have any kind of tensor/matrix cores. The WMMA instructions are executed through the regular ALUs, not any other different unit, as this is listed clearly in the programming guide:

These instructions work over multiple cycles to compute the result matrix and internally use the DOTinstructions

RDNA4's ISA guide has no such note, but I could not get a clear info if RDNA4 has proper new units for that, or it they're just beefy, so I'll leave it at that.

V100 does have proper tensor cores and those are different units than the other regular scalar/vector ALUs, meaning that you can dispatch independent instructions to the tensor cores and the math units in tandem.

Patriot said:
9070xt's have been usable on windows via LM studio

With either Vulkan or OpenCL, afaik (or even DirectML if you want worse than CPU perf). There's no support for ROCm on Windows for RDNA4 for either LMStudio nor llama.cpp/ollama.
Linux support has been added recently.

Patriot said:
as well as suporting accelerated code paths through ms's api's.

Are you talking about DirectML? If so, that's pretty much useless given the low perf.

Patriot said:
as does strix halo.

Sad part is that the performance for strix halo is really underwhelming due to the lackluster software.

Patriot · Thursday at 6:56 AM

igormp said:
I don't think the issue is just the raw FLOPs number, but rather that - allegedly - FSR4 uses a model with FP8 weights (and it seems like INT8 wasn't good enough) (source), which RDNA3 does not support.
IDK where you got that idea that FSR4 requires "900tops", that seems non-sense, specially given that a 9070xt won't be reaching nowhere that in FP8.

I swear it was in the original presentation of the cards, when I find it ill post it, it might have been in an after interview as to why not rdna3. It also could just be the slide that says 798tops int8 performance on the fsr4 slide. If it is there or in the presentation it clearly is wrong as then the 9060xt's wouldn't be powerful enough.

igormp said:
That's the number with sparsity. Without sparsity the 9070xt does ~390TFLOPs FP8/TOPs INT8. Double that for INT4, half that for FP16 (so ~195TFLOPs fp16 vs ~120 for the 7900xtx).
Btw, those numbers from AMD are really misleading since they assume one is able to do dual-issue VOPD or pack stuff into a Wave64, which are often not the case. HIP even has Wave64 disabled for any RDNA GPU (source), so you can pretty much halve all of those compute numbers for any RDNA product.

All ai numbers are fluff... nvidia started this a long time ago, boasting a new lower precision+sparcity number every generation apples to oranges each time.... its quite frustrating.
AMDs int4 numbers are 4:2 structured sparcity to hit that 1500tops number. 390 TFLOPs fp8 dense is not bad at all. Puts it on mi250x performance level... with a fraction of the annoying complexity those things have.

igormp said:
RDNA3 does not have any kind of tensor/matrix cores. The WMMA instructions are executed through the regular ALUs, not any other different unit, as this is listed clearly in the programming guide:

Yes, but no but yes but no. It's like how Nvidia claimed they doubled the cuda cores because they could all do int and float even though they cant do both at the same time...
It is a physical core, but it is part of the regular compute units, not separate matrix/tensor cores. Shared scheduler everything. There is a physical aspect of it that exceeds an instruction set.

igormp said:
RDNA4's ISA guide has no such note, but I could not get a clear info if RDNA4 has proper new units for that, or it they're just beefy, so I'll leave it at that.

same same but different, no longer shares with the FP block but still part of the CU.

igormp said:
V100 does have proper tensor cores and those are different units than the other regular scalar/vector ALUs, meaning that you can dispatch independent instructions to the tensor cores and the math units in tandem.

That is fair I guess, there are front end arguments to be made here but, idc, rdna2/3/4 cores while growing more separate are still part of the main compute cores vs nvidia's implementation or AMD's cnda.

igormp said:
With either Vulkan or OpenCL, afaik (or even DirectML if you want worse than CPU perf). There's no support for ROCm on Windows for RDNA4 for either LMStudio nor llama.cpp/ollama.
Linux support has been added recently.
Are you talking about DirectML? If so, that's pretty much useless given the low perf.
Sad part is that the performance for strix halo is really underwhelming due to the lackluster software.

Yeah, so the hacked together rocm for strix halo was measured at just shy of 60 flops of bf16.
AMD has some tuned versions of LMstudio, I do not know what they are using, I am assuming it is directML but it performs much better than directML is known to.
AMDs windows support has been ROCm through WSL and very very specific targeted application support.

HIP SDK installation for Windows — HIP SDK installation (Windows)

HIP SDK installation for Windows

rocm.docs.amd.com

I stay on linux.

I am more hardware than software though I have been through the ISA's I am usually digging for specific things.
I am curious where you work is focused.

snapdragon · Thursday at 7:10 AM

Still neither WSL2 nor Windows support; just another announcement.

Patriot · Thursday at 7:37 AM

sascha said:
Still neither WSL2 nor Windows support; just another announcement.

https://www.amd.com/en/resources/support-articles/release-notes/RN-RAD-WIN-25-3-1.html

Support is allegedly there in the driver even if the guide is outdated with rocm6.3.4 vs the current rocm6.4.1
Though with anything, documentation is theory till someone shows it in practice.

Luminescent · Thursday at 8:13 AM

So as i understand from what you talk, RDNA3 and maybe 2 can do tensor specific operations, but, they are are NOT done fast enough, having specialized hardware similar to Nvidia or AMD server grade tensor cores greatly speeds up these tasks, so RDNA2 and 3 can do FSR4, can do ML based ray reconstruction and denoising, but they might eat up too much GPU performance to run these and a software solution like FSR3.1 is more suitable.
So that's why they won't do FSR4 on RDNA3, it can run but the cost would be to high.

AusWolf · Thursday at 9:24 AM

Luminescent said:
So as i understand from what you talk, RDNA3 and maybe 2 can do tensor specific operations, but, they are are NOT done fast enough, having specialized hardware similar to Nvidia or AMD server grade tensor cores greatly speeds up these tasks, so RDNA2 and 3 can do FSR4, can do ML based ray reconstruction and denoising, but they might eat up too much GPU performance to run these and a software solution like FSR3.1 is more suitable.
So that's why they won't do FSR4 on RDNA3, it can run but the cost would be to high.

RDNA 2 doesn't have Matrix (AI) cores, so it can't do any such task.
RDNA 3 has Matrix (AI) cores, but such operations are mostly done on the shader cores, with the AI cores only being there as an aid. They also lack specific instructions needed for FSR 4.
RDNA 4 has Matrix (AI) cores that are a sort of separate entity from the shader cores now, they are also a lot beefier than the ones on RDNA 3, and have the instructions needed for FSR 4.

This is my understanding of it.

igormp · Thursday at 2:54 PM

Patriot said:
I swear it was in the original presentation of the cards, when I find it ill post it, it might have been in an after interview as to why not rdna3. It also could just be the slide that says 798tops int8 performance on the fsr4 slide. If it is there or in the presentation it clearly is wrong as then the 9060xt's wouldn't be powerful enough.

That source that I linked is pretty much a TLDR of the presentation+interview, the major reason mentioned in there is pretty much FP8 support.
And yeah, you bring a good point, raw flop number is meaningless otherwise the 9060 wouldn't make the cut.

Patriot said:
All ai numbers are fluff... nvidia started this a long time ago, boasting a new lower precision+sparcity number every generation apples to oranges each time.... its quite frustrating.

But I can reach Nvidia's theoretical numbers. Doing a mamf-finder I can reach 80~90% of Nvidia's mentioned performance for my 3090, and with mmpeak I can even surpass them a bit (likely due to my higher boost clocks).
I asked a friend of mine to do a mamf run in their 9070xt, they only reached ~110tflops out of the ~195 theoretical number for it, so 60% of what AMD claims, which is really on par with the results other folks have achieved with instinct cards, whereas with Nvidia it's the norm to do 80~90%, as I experienced myself.

Patriot said:
AMDs int4 numbers are 4:2 structured sparcity to hit that 1500tops number. 390 TFLOPs fp8 dense is not bad at all. Puts it on mi250x performance level... with a fraction of the annoying complexity those things have.

My point is that you won't be achieving anywhere close to those numbers due to the software stack, that's widely known.

Patriot said:
It's like how Nvidia claimed they doubled the cuda cores because they could all do int and float even though they cant do both at the same time...

They kinda can, that's the whole "dual issue" when doing sheer fp maths, the throughput is indeed doubled compared to Turing. But yeah, still kinda misleading since int+float is still the older rate.
But that's another point that AMD is failing on, their wave64 is being outright disabled throughout their stack and their compiler has a really hard time achieving dual issue vopds.

Patriot said:
It is a physical core, but it is part of the regular compute units, not separate matrix/tensor cores. Shared scheduler everything. There is a physical aspect of it that exceeds an instruction set.

It is a physical unit, but a not so capable one.
It also has shared ports, which limits throughput.

Patriot said:
same same but different, no longer shares with the FP block but still part of the CU.

Part of the CU, yes, but this still allows the scheduler to issue instructions to the FP block in an independent manner, that improves things in the end.
The unit itself is more capable as well.

Patriot said:
That is fair I guess, there are front end arguments to be made here but, idc, rdna2/3/4 cores while growing more separate are still part of the main compute cores vs nvidia's implementation or AMD's cnda.

We could nerd out about it, but what matters is the final throughput you manage to reach.
If AMD can achieve good numbers in practice with those built-in units, then great! But so far that hasn't been the case.

Patriot said:
Yeah, so the hacked together rocm for strix halo was measured at just shy of 60 flops of bf16.

No, that's the theoretical performance for SH.
With ROCm they achieved 5.1 TFLOPs, but with a custom docker image they did manage 36.9 TFLOPs (64.4% efficiency, even a bit above the avg when it comes to AMD products)
It does manage to make better use of memory bandwidth tho, at 70~73% efficiency, which is not bad at all:

Strix Halo

For the latest Strix Halo / AMD Ryzen AI Max+ 395 with Radeon 8060S (gfx1151) support, check out: 2025-05-02 https://github.com/ROCm/TheRock/discussions/244 2025-05-02 https://github.

llm-tracker.info

Patriot said:
AMD has some tuned versions of LMstudio, I do not know what they are using, I am assuming it is directML but it performs much better than directML is known to.

It should be vulkan, directml would be shit as you mentioned.

Patriot said:
I stay on linux.

Same tbh.

Patriot said:
I am more hardware than software though I have been through the ISA's I am usually digging for specific things.
I am curious where you work is focused.

I used to work with embedded systems in the auto industry but jumped into data because $ and work from home :laugh:

But I'm a bit all over the place professionally, my master's is in computer vision with MTL models (which should also be the focus of my phd next year), and I work as a data/ML engineer for startups, so I go from building backends, large-scale distributed pipelines, to deploying models for "cheap" within k8s.

Luminescent said:
So as i understand from what you talk, RDNA3 and maybe 2 can do tensor specific operations, but, they are are NOT done fast enough, having specialized hardware similar to Nvidia or AMD server grade tensor cores greatly speeds up these tasks, so RDNA2 and 3 can do FSR4, can do ML based ray reconstruction and denoising, but they might eat up too much GPU performance to run these and a software solution like FSR3.1 is more suitable.
So that's why they won't do FSR4 on RDNA3, it can run but the cost would be to high.

That'd be a good tldr all things considered, yes. For RDNA3 there would be the cost of "porting" the model as well to work within its capabilities.

AusWolf said:
with the AI cores only being there as an aid.

Minor nit: the AI cores ARE the shader cores. otherwise, correct.

Patriot · Thursday at 5:13 PM

igormp said:
But I can reach Nvidia's theoretical numbers. Doing a mamf-finder I can reach 80~90% of Nvidia's mentioned performance for my 3090, and with mmpeak I can even surpass them a bit (likely due to my higher boost clocks).
I asked a friend of mine to do a mamf run in their 9070xt, they only reached ~110tflops out of the ~195 theoretical number for it, so 60% of what AMD claims, which is really on par with the results other folks have achieved with instinct cards, whereas with Nvidia it's the norm to do 80~90%, as I experienced myself.

My point is that you won't be achieving anywhere close to those numbers due to the software stack, that's widely known.

60% does seem quite common, I skimmed that earlier this week searching for actual strix halo performance as AMD has been cagey when talking about it leading me to believe its quite shite.
I thought the 59Tflops was the mamf not theoretical, mb.
With hipblast it hit 60% I get the feeling there is an architectural bottleneck as much as software one. RocBlas is. yeah bad, kinda known to be.

Would you mind sending me what you sent your friend, curious to see if the official support changes anything on the 9070xt or if my mi100hive does better with the IF linkage.
AMD has lots of open spots they are trying to fill... I keep getting hit up for stuff like... (This is a position within the AI GPU Software Group (AGS) responsible for AMD's ML SDK initiatives, with focus on development within the ROCm Profiling Tools) I keep replying, please find someone lol.
I have been using ROCm since V64, and its been broken for a lot of that time. It makes massive strides forward but not in a universal sense but in a focused sense.
If you are doing particular things there are competitive solutions to be had with Instincts, if you are not..... SOL.
They are working on training rn, but it need an architectural switch, and UALink really. the IF links aren't stout enough for traditional training.

I went from top3 server vendor pre-sales engineering for HPC-AI gear to edge-ai startup, embedded world is fun, but I missed HPC, and my homelab was strongly hpc, and got me back into that world.

igormp · Thursday at 5:59 PM

Patriot said:
60% does seem quite common, I skimmed that earlier this week searching for actual strix halo performance as AMD has been cagey when talking about it leading me to believe its quite shite.
With hipblast it hit 60% I get the feeling there is an architectural bottleneck as much as software one. RocBlas is deprecated.

A colleague of mine had managed to achieve really nice numbers with some instinct products by writing custom kernels, and the folks at tinygrad also managed some nice numbers without using ROCm. Issue is that those are really specific scenarios and not portable, so you'll have a hard time achieving this in other use cases, sadly.
I think the issue with strix halo is the same as all other GPU product of theirs: a bad software stack.

Patriot said:
Would you mind sending me what you sent your friend, curious to see if the official support changes anything on the 9070xt or if my mi100hive does better with the IF linkage.

Sure. FWIW, they used the latest ROCm on Linux with the official changes since the benchmark relies on pytorch.

Here's the code we used:

GitHub - shisa-ai/mamf-finder

Contribute to shisa-ai/mamf-finder development by creating an account on GitHub.

github.com

More info on the benchmark itself and other results if you're interested:

ml-engineering/compute/accelerator/README.md at master · stas00/ml-engineering

Machine Learning Engineering Open Book. Contribute to stas00/ml-engineering development by creating an account on GitHub.

github.com

ml-engineering/compute/accelerator/benchmarks/README.md at master · stas00/ml-engineering

Machine Learning Engineering Open Book. Contribute to stas00/ml-engineering development by creating an account on GitHub.

github.com

Patriot said:
AMD has lots of open spots they are trying to fill... I keep getting hit up for stuff like... (This is a position within the AI GPU Software Group (AGS) responsible for AMD's ML SDK initiatives, with focus on development within the ROCm Profiling Tools) I keep replying, please find someone lol.

Too bad most of those are not remote, worldwide positions

Patriot said:
I have been using ROCm since V64, and its been broken for a lot of that time. It makes massive strides forward but not in a universal sense but in a focused sense.
If you are doing particular things there are competitive solutions to be had with Instincts, if you are not..... SOL.

If you have the expertise and enough throughput to get your hands dirty, you can make great use of AMD products.
But if you are more focused on getting an overall project to work in a fast manner, yeah, you're SOL and better off with any other offering.

Fun fact: AMD will be moving most stuff to SPIR-V so we won't have that mess of multi-gigabyte stacks with compiled stuff for each µarch. Downside is that's likely to break tons of things in the near future.

Patriot said:
They are working on training rn, but it need an architectural switch, and UALink really. the IF links aren't stout enough for traditional training.

Yeah, multi node scaling is not great on their end, and I don't think UALink will be ready in a timely manner.
Even for inference the hardware is really sub-utilized ATM due to all those software limitations, which is sad given that the hardware itself is capable of much more.

Patriot · Thursday at 7:12 PM

igormp said:
A colleague of mine had managed to achieve really nice numbers with some instinct products by writing custom kernels, and the folks at tinygrad also managed some nice numbers without using ROCm. Issue is that those are really specific scenarios and not portable, so you'll have a hard time achieving this in other use cases, sadly.
I think the issue with strix halo is the same as all other GPU product of theirs: a bad software stack.

MI300X vs H100 vs H200 Benchmark Part 1: Training – CUDA Moat Still Alive

Intro SemiAnalysis has been on a five-month long quest to settle the reality of MI300X. In theory, the MI300X should be at a huge advantage over Nvidia’s H100 and H200 in terms of specifications an…

semianalysis.com

They continue to improve but are clearly behind, yeah...

igormp said:
Sure. FWIW, they used the latest ROCm on Linux with the official changes since the benchmark relies on pytorch.

Pytorch is very sensitive to tuning, its likely there is performance gains to be had there.

igormp said:
Too bad most of those are not remote, worldwide positions

Soo true... though that last one was full remote. I just have no desire to create profiling tools.

igormp · Thursday at 7:35 PM

Patriot said:
MI300X vs H100 vs H200 Benchmark Part 1: Training – CUDA Moat Still Alive

Intro SemiAnalysis has been on a five-month long quest to settle the reality of MI300X. In theory, the MI300X should be at a huge advantage over Nvidia’s H100 and H200 in terms of specifications an…

semianalysis.com

They continue to improve but are clearly behind, yeah...

They have done another article since then highlighting that:

AMD 2.0 – New Sense of Urgency | MI450X Chance to Beat Nvidia | Nvidia’s New Moat

SemiAnalysis is expanding the AI engineering team! If you have an experience in PyTorch, training, inferencing, system modelling, SLURM/Kubernetes, send us your resume and 5 bullet points demonstra…

semianalysis.com

It is improving and at a really nice pace, but still mostly focused on their instinct lineup (so RDNA gets scraps), and Nvidia is not slowing down by any means, so the bar is ever increasing.

Patriot said:
Pytorch is very sensitive to tuning, its likely there is performance gains to be had there.

Indeed, but it's up to AMD to provide good kernels within their HIP backend. They take too long to provide good kernels, and given all of Nvidia's moat, any 3rd party kernel will be CUDA-first, specially given how easy is for one to get a GeForce GPU, mess around with some stuff, and then trivially spin up an H100/H200/B200 instance in any random cloud to port that kernel over (think stuff like thunderkittens).

System Name	RBMK-1000
Processor	AMD Ryzen 7 5700G
Motherboard	Gigabyte B550 AORUS Elite V2
Cooling	DeepCool Gammax L240 V2
Memory	2x 16GB DDR4-3200
Video Card(s)	Galax RTX 4070 Ti EX
Storage	Samsung 990 1TB
Display(s)	BenQ 1440p 60 Hz 27-inch
Case	Corsair Carbide 100R
Audio Device(s)	ASUS SupremeFX S1220A
Power Supply	Cooler Master MWE Gold 650W
Mouse	ASUS ROG Strix Impact
Keyboard	Gamdias Hermes E2
Software	Windows 11 Pro

System Name	My second and third PCs are Intel + Nvidia
Processor	AMD Ryzen 7 7800X3D @ 45 W TDP Eco Mode
Motherboard	MSi Pro B650M-A Wifi
Cooling	Noctua NH-D9L chromax.black
Memory	2x 24 GB Corsair Vengeance DDR5-6000 CL36
Video Card(s)	PowerColor Reaper Radeon RX 9070 XT
Storage	2 TB Corsair MP600 GS, 4 TB Seagate Barracuda
Display(s)	Dell S3422DWG 34" 1440 UW 144 Hz
Case	Corsair Crystal 280X
Audio Device(s)	Logitech Z333 2.1 speakers, AKG Y50 headphones
Power Supply	750 W Seasonic Prime GX
Mouse	Logitech MX Master 2S
Keyboard	Logitech G413 SE
Software	Bazzite (Fedora Linux) KDE Plasma

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

System Name	Home
Processor	5950x
Motherboard	Asrock Taichi x370
Cooling	Thermalright True Spirit 140
Memory	Patriot 32gb DDR4 3200mhz
Video Card(s)	Gigabyte gaming OC 9070 xt
Storage	Too many to count
Display(s)	U2518D+u2417h
Case	Chieftec
Audio Device(s)	onboard
Power Supply	seasonic prime 1000W
Mouse	Razer Viper
Keyboard	Logitech
Software	Windows 10

System Name	[H]arbringer
Processor	4x 61XX ES @3.5Ghz (48cores)
Motherboard	SM GL
Cooling	3x xspc rx360, rx240, 4x DT G34 snipers, D5 pump.
Memory	16x gskill DDR3 1600 cas6 2gb
Video Card(s)	blah bigadv folder no gfx needed
Storage	32GB Sammy SSD
Display(s)	headless
Case	Xigmatek Elysium (whats left of it)
Audio Device(s)	yawn
Power Supply	Antec 1200w HCP
Software	Ubuntu 10.10
Benchmark Scores	http://valid.canardpc.com/show_oc.php?id=1780855 http://www.hwbot.org/submission/2158678 http://ww

AMD Updates ROCm to Support Ryzen AI Max and Radeon RX 9000 Series

btarunr

Editor & Senior Moderator

Ninja Weedle

AusWolf

lambda

System requirements (Linux) — ROCm installation (Linux)

Luminescent

Patriot

Luminescent

Denver

AusWolf

Luminescent

AusWolf

igormp

AusWolf

Patriot

igormp

Patriot

HIP SDK installation for Windows — HIP SDK installation (Windows)

snapdragon

New Member

Patriot

Luminescent

AusWolf

igormp

Strix Halo

Patriot

igormp

GitHub - shisa-ai/mamf-finder

ml-engineering/compute/accelerator/README.md at master · stas00/ml-engineering

ml-engineering/compute/accelerator/benchmarks/README.md at master · stas00/ml-engineering

Patriot

MI300X vs H100 vs H200 Benchmark Part 1: Training – CUDA Moat Still Alive

igormp

MI300X vs H100 vs H200 Benchmark Part 1: Training – CUDA Moat Still Alive

AMD 2.0 – New Sense of Urgency | MI450X Chance to Beat Nvidia | Nvidia’s New Moat