Monday, April 14th 2025

AMD Launches ROCm 6.4 with Technical Upgrades, Still no Support for RDNA 4

AMD officially released ROCm 6.4, its latest open‑source GPU compute stack, bringing several under‑the‑hood improvements while still lacking official RDNA 4 support. The update improves compatibility between ROCm's user‑space libraries and the AMDKFD kernel driver, making it easier to run across a wider range of Linux kernels. AMD has also expanded its internal testing to cover more combinations of user and kernel versions, which should reduce integration headaches for HPC and AI workloads. On the framework side, ROCm 6.4 now supports PyTorch 2.5 and 2.6 out of the box, so developers can use the latest deep‑learning features without building from source. The Megatron‑LM integration adds three new fused kernels, Attention (QKV), Layer Norm, and ROPE, to speed up transformer model training by combining multiple operations into single GPU passes. Video decoding gets a boost, too, with VP9 support in both rocDecode and rocPyDecode, plus a new bitstream reader module to streamline media pipelines.

Oracle Linux 9 is now officially supported, and the Radeon PRO W7800 48 GB workstation card has been validated under ROCm. AMD also enabled CPX mode with NPS4 memory configurations, catering to advanced memory bandwidth scenarios on MI Instinct accelerators. Despite these updates, ROCm 6.4 still does not officially support RDNA 4 GPUs, such as the RX 9070 series. While community members report that the new release can run on those cards unofficially, the lack of formal enablement means RDNA 4's doubled FP16 throughput, eight times INT4 sparsity acceleration, and FP8 capabilities remain largely untapped in ROCm workflows. On Linux, consumer Radeon support is limited to just a few models, even though Windows coverage for RDNA 2 and 3 families has expanded since 2022. With AMD's "Advancing AI" event coming in June, many developers are hoping for an announcement about RDNA 4 integration. Until then, those who need guaranteed, day‑one GPU support may continue to look at alternative ecosystems.
Sources: Tom's Hardware, AMD
Add your own comment

17 Comments on AMD Launches ROCm 6.4 with Technical Upgrades, Still no Support for RDNA 4

#1
R-T-B
The entirety of the ROCm stack really needs better attention if they want to seriously compete with CUDA. Love them or hate em, stuff like this does not happen over in nvidia land. Their compute frameworks just work.
Posted on Reply
#2
evernessince
R-T-BThe entirety of the ROCm stack really needs better attention if they want to seriously compete with CUDA. Love them or hate em, stuff like this does not happen over in nvidia land. Their compute frameworks just work.
Oh it most certainly happens on the Nvidia end as well. Blackwell still has many issues even running AI tools that work completely fine on 4000 series and older cards. Although as everyone is aware, blackwell has been a disaster of a launch.

AMD does need to continue improving ROCm though. I'm hoping they improve things enough to where they can compete on the AI front so I don't have to buy a tirefire of an Nvidia card with 12V2X6, no hotspot sensor, and whatever other anti-consumer nonsense Nvidia decides to layer on by the time I'm ready for my next GPU upgrade.
Posted on Reply
#3
Bomby569
RDNA 4 is just RDNA 3.1, old news. Their focus is the next step
Posted on Reply
#4
csendesmark
"still lacking official RDNA 4 support"
That's what Deadpool would call a "total dick move" :D :slap:
Posted on Reply
#5
R-T-B
evernessinceBlackwell still has many issues even running AI tools that work completely fine on 4000 series and older cards.
Oh, interesting. That being said that wasn't at all the same grade of issue as outright not having support.
Bomby569RDNA 4 is just RDNA 3.1, old news. Their focus is the next step
RDNA3 is supported though. :laugh:
Posted on Reply
#6
Rightness_1
AMD, always fixing the barn after the horses have bolted!
Posted on Reply
#7
Lianna
AleksandarKDespite these updates, ROCm 6.4 still does not officially support RDNA 4 GPUs, such as the RX 9070 series.
Unfortunately even RDNA "3.5" is not supported; neither in newer, AI MAX+ version, nor older 7840/8840/370 versions.
Support for slow but popular and common machines would widen the audience for ROCm.
AFAIK, NPU support - both older 10/16 TOPS and newer ~50 TOPS versions - is also missing.
But this is especially bad for AI MAX+ products which were advertised with "96 GB VRAM allocation" possibility and both NPU and GPU on IO die.
How is that working for their GAIA initiative?
Posted on Reply
#8
Makaveli
Not surprised probably won't see support for a few more months.

Looks like LM studio just added support for Meta Llama 4

Posted on Reply
#9
Patriot
LiannaUnfortunately even RDNA "3.5" is not supported; neither in newer, AI MAX+ version, nor older 7840/8840/370 versions.
Support for slow but popular and common machines would widen the audience for ROCm.
AFAIK, NPU support - both older 10/16 TOPS and newer ~50 TOPS versions - is also missing.
But this is especially bad for AI MAX+ products which were advertised with "96 GB VRAM allocation" possibility and both NPU and GPU on IO die.
How is that working for their GAIA initiative?
rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html
Shows supported here for rdna3.5. for LMstudio windows hipbased versions, linux only rocm not showing updated support listings yet.
Posted on Reply
#10
Lianna
Patriotrocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html
Shows supported here for rdna3.5. for LMstudio windows hipbased versions, linux only rocm not showing updated support listings yet.
Thanks for correction. So AI MAX+ is supported on Windows, but not supported on Linux (hopefully "yet") [previous statement based on Linux support as told by Phoronix].
Older RDNA 3.5, such as 7840/8840 series or 370 series support still missing.
NPU support seems missing as well, at least on Linux; not sure about NPU support on ROCm, but apparently GAIA could use both GPU and NPU so...
Posted on Reply
#11
igormp
LiannaNPU support seems missing as well, at least on Linux
It has been merged already:
www.phoronix.com/news/AMD-NPU-Firmware-Upstream

FWIW, NPU is not meant to be used within ROCm, it's a different stack entirely, at least as far as I know. The NPU is also only meant for inference of certain quantized models (INT4 or INT8).
Posted on Reply
#12
evernessince
R-T-BOh, interesting. That being said that wasn't at all the same grade of issue as outright not having support.
Agreed, hopefully they get broad support to their consumer cards and capitalize on Nvidia's missteps. RDNA4 with it's compute capabilities has potential for prosumers and professionals on a budget.
Posted on Reply
#13
Patriot
I know AMD demod the 9070xt running LLMs, but I don't know why the documentation doesnt list support on the hip side of things.
gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/ might have to try it and see.
I also know there have been people that have found some mixed support just adding the 9070xt to the rocm whitelist. So its not stable yet on that front.
Posted on Reply
#14
igormp
PatriotI know AMD demod the 9070xt running LLMs, but I don't know why the documentation doesnt list support on the hip side of things.
gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/ might have to try it and see.
I also know there have been people that have found some mixed support just adding the 9070xt to the rocm whitelist. So its not stable yet on that front.
From that link it seems like they're comparing stuff using that "Amuse" platform, which seems to run using the ONNX runtime using the DirectML backend, so no ROCm involved whatsoever.
Also implies that the quality of such models is degraded (given it's using quantization), and also the performance is subpar given the runtime and used backend.
Posted on Reply
#15
Patriot
igormpFrom that link it seems like they're comparing stuff using that "Amuse" platform, which seems to run using the ONNX runtime using the DirectML backend, so no ROCm involved whatsoever.
Also implies that the quality of such models is degraded (given it's using quantization), and also the performance is subpar given the runtime and used backend.
Degraded is perhaps a bit harsh.... as that has been nvidia's game the past ~5 generations (depends how you count as volta and hopper are server only). The name of the game is lowering quant while retaining as much accuracy as possible.
RTX5xxx gen is all about fp4. Whenever there is a "doubling" of ai performance it is a new lower quant mode of operation.
Accelerating workflows is all about finding where the precision is needed and where it isn't. This is one of the big changes that deepseek brought where it stores the the values in fp16 and trains in fp8 off of them, but then running them, inferring off of them can be run lower in fp4.

Yes you lose data when lowering quant and you lose data when you process for sparsity, these are not lossless optimizations but tradeoffs, and the tradeoff for deepseek is faster iteration rate, newer models faster, and it costs.... vram, lots of vram. Also keep in mind with how the data is processed its not a 1:1 tradeoff, its not half quality for double speed as these are optimized formats for halving the data processed by removal of zero values... sparsity also wont double the performance unless its a test model, designed for it you will probably get 80% faster at the same precision.
Posted on Reply
#16
igormp
PatriotDegraded is perhaps a bit harsh.... as that has been nvidia's game the past ~5 generations (depends how you count as volta and hopper are server only). The name of the game is lowering quant while retaining as much accuracy as possible.
I mean, it is lower quality for sure, but how worse it is depends on how it was quantized (quantization-aware training makes it almost negligible). If you can recommend a better wording for it, I'm all ears haha
But one thing is just quantizing models, another one is actually training such models at lower precision. The latter usually has no downsides and many benefits.
PatriotYes you lose data when lowering quant and you lose data when you process for sparsity, these are not lossless optimizations but tradeoffs, and the tradeoff for deepseek is faster iteration rate, newer models faster, and it costs.... vram, lots of vram. Also keep in mind with how the data is processed its not a 1:1 tradeoff, its not half quality for double speed as these are optimized formats for halving the data processed by removal of zero values... sparsity also wont double the performance unless its a test model, designed for it you will probably get 80% faster at the same precision.
I agree with all you said, but grabbing a FP32/FP16 model and just turning it into INT4/INT8 (which are the formats suitable for NPUs) does decrease quality in a noticeable manner. Remember that your Q4/5/6 quants are better (and more compute-intensive) than "simple" INT4/8 quants that are meant to be used within NPUs, and such Q quants can't be used with NPUs either.
And the above is totally different from when you have a model natively trained in FP8, and quantizing an FP8 model into some other smaller formats has the upside of less quality loss when compared to the jump from a FP32/16 model.
Posted on Reply
#17
Makaveli
igormpFrom that link it seems like they're comparing stuff using that "Amuse" platform, which seems to run using the ONNX runtime using the DirectML backend, so no ROCm involved whatsoever.
Also implies that the quality of such models is degraded (given it's using quantization), and also the performance is subpar given the runtime and used backend.
Correct I have Amuse installed for a few months now it doesn't use ROCm
Posted on Reply
Add your own comment
Apr 21st, 2025 22:26 CDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts