• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

ComfyUI-Zluda Experience (AMD GPUs)

d@wn

New Member
Joined
Sep 9, 2024
Messages
8 (0.03/day)
Looking for fellow-users of ComfyUI-Zluda as I have some questions, especially regarding custom nodes. I have a 7900XTX and am really happy with the image-generation performance under ComfyUI but many things don't yet work as expected.

First, anyone managed to get crystools working properly via Zluda? You can install it but it reports being broken (forgot exact wording) and I found this about deleting the GPU check, which I tried with no success (as likely something changed since 2024 - python version or ComfyUI itself).

Second, ComfyUI Manager. It installs and works in terms of finding stuff and as an interface, but doesn't seem to install items (custom nodes for example) - it reports installing but nothing happens and I can see this in cmd. Is this my sole experience, anyone else using ComfyUI Manager with ComfyUI Zluda?

Third, unsure if based on ready-to-use GUI models from within the interface, but for example right-clicking on an empty area inside ComfyUI is not possible/doesn't do anything. It however works with dragging connection arrows or by right clicking on existing nodes.

I can't also seem to get live preview and save-node image display. In ComfyUI Manager I switched between Auto and TAESD preview, downloading and installing the respective TAESD decoders and encoders, but I still can't see the images/progress. They just stay blank while images are getting generated in the ComfyUI output folder. What am I doing wrong?

On a positive note, image generation works very well and 1024x1024 images with 20 steps takes less than 20 seconds to produce. 512 by 512 typically take less than 9 seconds with ROCm 6.2. My card is also undervolted and overclocked (incl. VRAM), but it would still be fast enough from what I can see...

It would be really great to share our experiences with ComfyUI-Zluda as information online is very fragmented and from what I've seen so far, there are not too many sources of it, plus Github developers are also ignoring AMD users.
 
Post got flagged by the spam filter. Doesn't look like spam to me. OP isn't using a VPN or proxy, UK .edu IP
 
Hi, this is definitely not spam nor abusive in any way. I only wanted to share my experiences with ComfyUI Zluda on AMD GPU and get feedback from other users as knowledge base for CUDA-designed apps running on AMD is extremely limited. There are workarounds and many people do it their own way (many of them completely user-unfriendly) but, hey this is what forums are for.

I managed to fix the live preview in the processor and the save-node by working with the browser and extension settings. It looks like one of the extensions was blocking it. Now I can confirm that ComfyUI will work better with some web browsers than with others as apart from JavaScript required there are other background settings that will run with no obvious adjustment/prompt.

Unless someone else confirms it being operational, it also looks like crystools will only work with CUDA cards on Windows, but is available on Linux for AMD, not Windows.

Are there any TPU users running Stable Diffusion (models, custom nodes, extensions, settings) on AMD GPUs or at all?
 
Afaik the zluda fork is really behind upstream, so it's missing lots of fixes and new features.
 
I've been using Zluda for a month or so and the set-up has been dreadsome, involved a lot of CLI/permissions/OS tinkering and nothing was straighforward, i.e. install and use. Plus error after error that had to be addressed individually. But once ComfyUI was finally working, performance on a 7900XTX with the latest ROCm 6.2 is so much better than I ever expected. 512x512 images get generated almost instantly while 1024x1024 take up to 20 seconds. How much does it take on a 4090/5090?

I also use local LLM 14B-parameters model and easily generate 50+ tok/sec. Even increased the context window to 7152 tokens from the 2048 default and it runs fast, stable and low-temperature. Whisper on GPU also work wonders, even transcribing songs (for fun) with 4-minute track taking a few seconds to fully transcribe. Curious to see how much it takes on 4/5090 and 9070XT.

Zluda was rewritten from scratch for over an year now since AMD took down the project and is now fully open source, but with ROCm 6.1 and 6.2 performance skyrocketed. With 7.0 in a few months and hopefully - with open source support by the community for older GPUs before the 7-XXX - things will only get better.
 
How much does it take on a 4090/5090?
You can have an idea how other GPU performs by looking here:

I've never used ComfyUI for diffusion stuff, so I can't really compare.

I also use local LLM 14B-parameters model and easily generate 50+ tok/sec. Even increased the context window to 7152 tokens from the 2048 default and it runs fast, stable and low-temperature. Whisper on GPU also work wonders, even transcribing songs (for fun) with 4-minute track taking a few seconds to fully transcribe. Curious to see how much it takes on 4/5090 and 9070XT.
What Whisper backend are you using? With WhisperX I can get 70~100x RTF on large-v3 with my 3090.
For LLMs I posted some benchmarks here:
I've tried a bunch of models, and keep switching back and forth between those.
Those are the ones currently downloaded into my MBP:
View attachment 385419

I usually use ollama as the backend, and either the python API for some software I run, or Open webUI for when I want a chat-like thingie.

Some performance numbers from my setups on a "what's the meaning of life?" prompt, without using fa or any other software speedups:

Model2x3090 (tok/s)M3 Max (tok/s)
phi4:14b-q4_K_M6225.5
phi4:14b-q8_047.516.9
deepseek-r1:32b-qwen-distill-q4_K_M27.812.6
deepseek-r1:7b-qwen-distill-q4_K_M11650.4
gemma:7b-instruct-v1.1-q4_0113.746.3
llama3.1:8b-instruct-q4_K_M110.146.08
llama3.1:8b-instruct-fp1649.817.5
deepseek-r1:32b-qwen-distill-q8_021-
deepseek-r1:70b-llama-distill-q4_K_M16.6-
llama3.3:70b-instruct-q4_K_M16.7-

With the above models that need 2 GPUs, both GPUs average their utilization at 50%. I did not run those into my MBP since I was out of memory for those given the other crap I had open.
Both my 3090s are also set to a 275W power limit.
 
Thank you, this is all very helpful.

What Whisper backend are you using? With WhisperX I can get 70~100x RTF on large-v3 with my 3090.
For LLMs I posted some benchmarks here:

I've tried with this whisper.cpp implementation first. Unfortunately it doesn't work with the large-v3 model (including the turbo one), but tiny and medium work great. The large model however works with the Subtitle Edit front end (GPU only). I need to check about specific performance since I've been trying to find the best front-end functionality to performance ratio, but it's fast and relatively accurate. Surprisingly the large model for certain tasks will perform worse than the medium (unsure why, the tiny is good only for very basic ones).

I also run Ollama with AnythingLLM front end (it simply provides so many more options than pure cmd) with single 7900XTX. GPU will sometimes hit 14GB and above VRAM used, but on this rig I only have 32GB RAM which is more important as, like you, I run other things simultaneously. From what I've seen exploring, really large models are not worth it in home/small-business environment for diminishing returns including on output where you will get better results with specific smaller and easily manageable models.

The Github GPU benchmark was nice, but most users either miss the number of steps or image resolution, there is also no data on other adjustments and nodes ratio that may slow down performance. I will keep testing for myself, I believe that Zluda will only get better and ROCm 6.2 already improved performance significantly since 03/2024. 3090 is still a very capable card, x2 even more. 5090 compared to 4090 is a joke, when I've got my 7900XTX it was almost half the price of a good 4090 (and I've got the top end Sapphire). Absolutely no regrets, will possibly go for the next top-end AMD while keeping this in another rig.
 
I've tried with this whisper.cpp implementation first. Unfortunately it doesn't work with the large-v3 model (including the turbo one), but tiny and medium work great. The large model however works with the Subtitle Edit front end (GPU only). I need to check about specific performance since I've been trying to find the best front-end functionality to performance ratio, but it's fast and relatively accurate. Surprisingly the large model for certain tasks will perform worse than the medium (unsure why, the tiny is good only for very basic ones).
You may be able to get WhisperX running with a fork of CTranslate2 for ROCm mentioned in this comment:
 
Back
Top