ComfyUI-Zluda Experience (AMD GPUs)

d@wn · Jul 2, 2025

Looking for fellow-users of ComfyUI-Zluda as I have some questions, especially regarding custom nodes. I have a 7900XTX and am really happy with the image-generation performance under ComfyUI but many things don't yet work as expected.

First, anyone managed to get crystools working properly via Zluda? You can install it but it reports being broken (forgot exact wording) and I found this about deleting the GPU check, which I tried with no success (as likely something changed since 2024 - python version or ComfyUI itself).

Second, ComfyUI Manager. It installs and works in terms of finding stuff and as an interface, but doesn't seem to install items (custom nodes for example) - it reports installing but nothing happens and I can see this in cmd. Is this my sole experience, anyone else using ComfyUI Manager with ComfyUI Zluda?

Third, unsure if based on ready-to-use GUI models from within the interface, but for example right-clicking on an empty area inside ComfyUI is not possible/doesn't do anything. It however works with dragging connection arrows or by right clicking on existing nodes.

I can't also seem to get live preview and save-node image display. In ComfyUI Manager I switched between Auto and TAESD preview, downloading and installing the respective TAESD decoders and encoders, but I still can't see the images/progress. They just stay blank while images are getting generated in the ComfyUI output folder. What am I doing wrong?

On a positive note, image generation works very well and 1024x1024 images with 20 steps takes less than 20 seconds to produce. 512 by 512 typically take less than 9 seconds with ROCm 6.2. My card is also undervolted and overclocked (incl. VRAM), but it would still be fast enough from what I can see...

It would be really great to share our experiences with ComfyUI-Zluda as information online is very fragmented and from what I've seen so far, there are not too many sources of it, plus Github developers are also ignoring AMD users.

W1zzard · Jul 2, 2025

Post got flagged by the spam filter. Doesn't look like spam to me. OP isn't using a VPN or proxy, UK .edu IP

d@wn · Jul 3, 2025

Hi, this is definitely not spam nor abusive in any way. I only wanted to share my experiences with ComfyUI Zluda on AMD GPU and get feedback from other users as knowledge base for CUDA-designed apps running on AMD is extremely limited. There are workarounds and many people do it their own way (many of them completely user-unfriendly) but, hey this is what forums are for.

I managed to fix the live preview in the processor and the save-node by working with the browser and extension settings. It looks like one of the extensions was blocking it. Now I can confirm that ComfyUI will work better with some web browsers than with others as apart from JavaScript required there are other background settings that will run with no obvious adjustment/prompt.

Unless someone else confirms it being operational, it also looks like crystools will only work with CUDA cards on Windows, but is available on Linux for AMD, not Windows.

Are there any TPU users running Stable Diffusion (models, custom nodes, extensions, settings) on AMD GPUs or at all?

igormp · Jul 3, 2025

Afaik the zluda fork is really behind upstream, so it's missing lots of fixes and new features.

d@wn · Jul 3, 2025

I've been using Zluda for a month or so and the set-up has been dreadsome, involved a lot of CLI/permissions/OS tinkering and nothing was straighforward, i.e. install and use. Plus error after error that had to be addressed individually. But once ComfyUI was finally working, performance on a 7900XTX with the latest ROCm 6.2 is so much better than I ever expected. 512x512 images get generated almost instantly while 1024x1024 take up to 20 seconds. How much does it take on a 4090/5090?

I also use local LLM 14B-parameters model and easily generate 50+ tok/sec. Even increased the context window to 7152 tokens from the 2048 default and it runs fast, stable and low-temperature. Whisper on GPU also work wonders, even transcribing songs (for fun) with 4-minute track taking a few seconds to fully transcribe. Curious to see how much it takes on 4/5090 and 9070XT.

Zluda was rewritten from scratch for over an year now since AMD took down the project and is now fully open source, but with ROCm 6.1 and 6.2 performance skyrocketed. With 7.0 in a few months and hopefully - with open source support by the community for older GPUs before the 7-XXX - things will only get better.

igormp · Jul 3, 2025

d@wn said:
How much does it take on a 4090/5090?

You can have an idea how other GPU performs by looking here:

GPU Benchmark! 4070 4080 4090 3080 3090 · comfyanonymous ComfyUI · Discussion #2970

Could we Do a benchmark for GPU´s ? Need to get new hardware . To keep it simple we could use the default load worklow on 1024*1024 with sdxl 1.0. Seed 1 I know that ram comes in play when workflow...

github.com

I've never used ComfyUI for diffusion stuff, so I can't really compare.

d@wn said:
I also use local LLM 14B-parameters model and easily generate 50+ tok/sec. Even increased the context window to 7152 tokens from the 2048 default and it runs fast, stable and low-temperature. Whisper on GPU also work wonders, even transcribing songs (for fun) with 4-minute track taking a few seconds to fully transcribe. Curious to see how much it takes on 4/5090 and 9070XT.

What Whisper backend are you using? With WhisperX I can get 70~100x RTF on large-v3 with my 3090.
For LLMs I posted some benchmarks here:

igormp said:
I've tried a bunch of models, and keep switching back and forth between those.
Those are the ones currently downloaded into my MBP:
View attachment 385419

I usually use ollama as the backend, and either the python API for some software I run, or Open webUI for when I want a chat-like thingie.

Some performance numbers from my setups on a "what's the meaning of life?" prompt, without using fa or any other software speedups:

Model 2x3090 (tok/s) M3 Max (tok/s)
phi4:14b-q4_K_M 62 25.5
phi4:14b-q8_0 47.5 16.9
deepseek-r1:32b-qwen-distill-q4_K_M 27.8 12.6
deepseek-r1:7b-qwen-distill-q4_K_M 116 50.4
gemma:7b-instruct-v1.1-q4_0 113.7 46.3
llama3.1:8b-instruct-q4_K_M 110.1 46.08
llama3.1:8b-instruct-fp16 49.8 17.5
deepseek-r1:32b-qwen-distill-q8_0 21 -
deepseek-r1:70b-llama-distill-q4_K_M 16.6 -
llama3.3:70b-instruct-q4_K_M 16.7 -

With the above models that need 2 GPUs, both GPUs average their utilization at 50%. I did not run those into my MBP since I was out of memory for those given the other crap I had open.
Both my 3090s are also set to a 275W power limit.

d@wn · Jul 4, 2025

Thank you, this is all very helpful.

igormp said:
What Whisper backend are you using? With WhisperX I can get 70~100x RTF on large-v3 with my 3090.
For LLMs I posted some benchmarks here:

I've tried with this whisper.cpp implementation first. Unfortunately it doesn't work with the large-v3 model (including the turbo one), but tiny and medium work great. The large model however works with the Subtitle Edit front end (GPU only). I need to check about specific performance since I've been trying to find the best front-end functionality to performance ratio, but it's fast and relatively accurate. Surprisingly the large model for certain tasks will perform worse than the medium (unsure why, the tiny is good only for very basic ones).

I also run Ollama with AnythingLLM front end (it simply provides so many more options than pure cmd) with single 7900XTX. GPU will sometimes hit 14GB and above VRAM used, but on this rig I only have 32GB RAM which is more important as, like you, I run other things simultaneously. From what I've seen exploring, really large models are not worth it in home/small-business environment for diminishing returns including on output where you will get better results with specific smaller and easily manageable models.

The Github GPU benchmark was nice, but most users either miss the number of steps or image resolution, there is also no data on other adjustments and nodes ratio that may slow down performance. I will keep testing for myself, I believe that Zluda will only get better and ROCm 6.2 already improved performance significantly since 03/2024. 3090 is still a very capable card, x2 even more. 5090 compared to 4090 is a joke, when I've got my 7900XTX it was almost half the price of a good 4090 (and I've got the top end Sapphire). Absolutely no regrets, will possibly go for the next top-end AMD while keeping this in another rig.

igormp · Jul 4, 2025

d@wn said:
I've tried with this whisper.cpp implementation first. Unfortunately it doesn't work with the large-v3 model (including the turbo one), but tiny and medium work great. The large model however works with the Subtitle Edit front end (GPU only). I need to check about specific performance since I've been trying to find the best front-end functionality to performance ratio, but it's fast and relatively accurate. Surprisingly the large model for certain tasks will perform worse than the medium (unsure why, the tiny is good only for very basic ones).

You may be able to get WhisperX running with a fork of CTranslate2 for ROCm mentioned in this comment:

Feature request: AMD GPU support with oneDNN AMD support · Issue #1072 · OpenNMT/CTranslate2

Hi, CTranslate2 uses oneDNN. oneDNN latest versions has support for AMD GPU. It require Intel oneAPI DPC++. The same approach can potentially enable NVIDIA GPU support too. It would help running th...

github.com

Processor	Ryzen 7 5700X
Memory	48 GB
Video Card(s)	RTX 4080
Storage	2x HDD RAID 1, 3x M.2 NVMe
Display(s)	30" 2560x1600 + 19" 1280x1024
Software	Windows 10 64-bit

Processor	9950x \| 5950x
Motherboard	x670e ProArt\| B550 ProArt
Cooling	PA 120 SE \|Fuma 2
Memory	4x64GB Kingston CUDIMM @5200MHz \| 4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	Corsair RM1000e \| XPG Core Reactor 850W
Software	I use Arch btw

Processor	9950x \| 5950x
Motherboard	x670e ProArt\| B550 ProArt
Cooling	PA 120 SE \|Fuma 2
Memory	4x64GB Kingston CUDIMM @5200MHz \| 4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	Corsair RM1000e \| XPG Core Reactor 850W
Software	I use Arch btw

Processor	9950x \| 5950x
Motherboard	x670e ProArt\| B550 ProArt
Cooling	PA 120 SE \|Fuma 2
Memory	4x64GB Kingston CUDIMM @5200MHz \| 4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	Corsair RM1000e \| XPG Core Reactor 850W
Software	I use Arch btw

ComfyUI-Zluda Experience (AMD GPUs)

d@wn

New Member

W1zzard

Administrator

d@wn

New Member

igormp

d@wn

New Member

igormp

GPU Benchmark! 4070 4080 4090 3080 3090 · comfyanonymous ComfyUI · Discussion #2970

d@wn

New Member

igormp

Feature request: AMD GPU support with oneDNN AMD support · Issue #1072 · OpenNMT/CTranslate2

Model	2x3090 (tok/s)	M3 Max (tok/s)
phi4:14b-q4_K_M	62	25.5
phi4:14b-q8_0	47.5	16.9
deepseek-r1:32b-qwen-distill-q4_K_M	27.8	12.6
deepseek-r1:7b-qwen-distill-q4_K_M	116	50.4
gemma:7b-instruct-v1.1-q4_0	113.7	46.3
llama3.1:8b-instruct-q4_K_M	110.1	46.08
llama3.1:8b-instruct-fp16	49.8	17.5
deepseek-r1:32b-qwen-distill-q8_0	21	-
deepseek-r1:70b-llama-distill-q4_K_M	16.6	-
llama3.3:70b-instruct-q4_K_M	16.7	-