Monday, April 14th 2025

AMD Launches ROCm 6.4 with Technical Upgrades, Still no Support for RDNA 4
AMD officially released ROCm 6.4, its latest open‑source GPU compute stack, bringing several under‑the‑hood improvements while still lacking official RDNA 4 support. The update improves compatibility between ROCm's user‑space libraries and the AMDKFD kernel driver, making it easier to run across a wider range of Linux kernels. AMD has also expanded its internal testing to cover more combinations of user and kernel versions, which should reduce integration headaches for HPC and AI workloads. On the framework side, ROCm 6.4 now supports PyTorch 2.5 and 2.6 out of the box, so developers can use the latest deep‑learning features without building from source. The Megatron‑LM integration adds three new fused kernels, Attention (QKV), Layer Norm, and ROPE, to speed up transformer model training by combining multiple operations into single GPU passes. Video decoding gets a boost, too, with VP9 support in both rocDecode and rocPyDecode, plus a new bitstream reader module to streamline media pipelines.
Oracle Linux 9 is now officially supported, and the Radeon PRO W7800 48 GB workstation card has been validated under ROCm. AMD also enabled CPX mode with NPS4 memory configurations, catering to advanced memory bandwidth scenarios on MI Instinct accelerators. Despite these updates, ROCm 6.4 still does not officially support RDNA 4 GPUs, such as the RX 9070 series. While community members report that the new release can run on those cards unofficially, the lack of formal enablement means RDNA 4's doubled FP16 throughput, eight times INT4 sparsity acceleration, and FP8 capabilities remain largely untapped in ROCm workflows. On Linux, consumer Radeon support is limited to just a few models, even though Windows coverage for RDNA 2 and 3 families has expanded since 2022. With AMD's "Advancing AI" event coming in June, many developers are hoping for an announcement about RDNA 4 integration. Until then, those who need guaranteed, day‑one GPU support may continue to look at alternative ecosystems.
Sources:
Tom's Hardware, AMD
Oracle Linux 9 is now officially supported, and the Radeon PRO W7800 48 GB workstation card has been validated under ROCm. AMD also enabled CPX mode with NPS4 memory configurations, catering to advanced memory bandwidth scenarios on MI Instinct accelerators. Despite these updates, ROCm 6.4 still does not officially support RDNA 4 GPUs, such as the RX 9070 series. While community members report that the new release can run on those cards unofficially, the lack of formal enablement means RDNA 4's doubled FP16 throughput, eight times INT4 sparsity acceleration, and FP8 capabilities remain largely untapped in ROCm workflows. On Linux, consumer Radeon support is limited to just a few models, even though Windows coverage for RDNA 2 and 3 families has expanded since 2022. With AMD's "Advancing AI" event coming in June, many developers are hoping for an announcement about RDNA 4 integration. Until then, those who need guaranteed, day‑one GPU support may continue to look at alternative ecosystems.
17 Comments on AMD Launches ROCm 6.4 with Technical Upgrades, Still no Support for RDNA 4
AMD does need to continue improving ROCm though. I'm hoping they improve things enough to where they can compete on the AI front so I don't have to buy a tirefire of an Nvidia card with 12V2X6, no hotspot sensor, and whatever other anti-consumer nonsense Nvidia decides to layer on by the time I'm ready for my next GPU upgrade.
That's what Deadpool would call a "total dick move" :D :slap:
Support for slow but popular and common machines would widen the audience for ROCm.
AFAIK, NPU support - both older 10/16 TOPS and newer ~50 TOPS versions - is also missing.
But this is especially bad for AI MAX+ products which were advertised with "96 GB VRAM allocation" possibility and both NPU and GPU on IO die.
How is that working for their GAIA initiative?
Looks like LM studio just added support for Meta Llama 4
Shows supported here for rdna3.5. for LMstudio windows hipbased versions, linux only rocm not showing updated support listings yet.
Older RDNA 3.5, such as 7840/8840 series or 370 series support still missing.
NPU support seems missing as well, at least on Linux; not sure about NPU support on ROCm, but apparently GAIA could use both GPU and NPU so...
www.phoronix.com/news/AMD-NPU-Firmware-Upstream
FWIW, NPU is not meant to be used within ROCm, it's a different stack entirely, at least as far as I know. The NPU is also only meant for inference of certain quantized models (INT4 or INT8).
gpuopen.com/learn/accelerating_generative_ai_on_amd_radeon_gpus/ might have to try it and see.
I also know there have been people that have found some mixed support just adding the 9070xt to the rocm whitelist. So its not stable yet on that front.
Also implies that the quality of such models is degraded (given it's using quantization), and also the performance is subpar given the runtime and used backend.
RTX5xxx gen is all about fp4. Whenever there is a "doubling" of ai performance it is a new lower quant mode of operation.
Accelerating workflows is all about finding where the precision is needed and where it isn't. This is one of the big changes that deepseek brought where it stores the the values in fp16 and trains in fp8 off of them, but then running them, inferring off of them can be run lower in fp4.
Yes you lose data when lowering quant and you lose data when you process for sparsity, these are not lossless optimizations but tradeoffs, and the tradeoff for deepseek is faster iteration rate, newer models faster, and it costs.... vram, lots of vram. Also keep in mind with how the data is processed its not a 1:1 tradeoff, its not half quality for double speed as these are optimized formats for halving the data processed by removal of zero values... sparsity also wont double the performance unless its a test model, designed for it you will probably get 80% faster at the same precision.
But one thing is just quantizing models, another one is actually training such models at lower precision. The latter usually has no downsides and many benefits. I agree with all you said, but grabbing a FP32/FP16 model and just turning it into INT4/INT8 (which are the formats suitable for NPUs) does decrease quality in a noticeable manner. Remember that your Q4/5/6 quants are better (and more compute-intensive) than "simple" INT4/8 quants that are meant to be used within NPUs, and such Q quants can't be used with NPUs either.
And the above is totally different from when you have a model natively trained in FP8, and quantizing an FP8 model into some other smaller formats has the upside of less quality loss when compared to the jump from a FP32/16 model.