• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

What local LLM-s you use?

You should look into other forks that are way faster, such as WhisperX or faster-whisper:
https://github.com/m-bain/whisperX (uses faster-whisper underneath)

I run the 1st one as a public service out of my GPU.
Very nice! Will take a look. One thing that I really like about llama.cpp and whisper.cpp is no python - much easier to get working and keep working. I tried other python-based LLM engines in the past and it often has the result of breaking something else. Also both llama.cpp and whisper.cpp have nice web servers.
 
I'm using Q4_K_M quant of Qwen3 30B A3B with the following settings (tried the same as yours)
settings.png


This gives me around 19.38 tok/sec but the model crashes after returning the output with the following error. Unsure if it's related to the quant / CPU / context etc.

Code:
Failed to regenerate message
The model has crashed without additional information. (Exit code: 18446744072635812000)

tokens.png


Looks like it's a known issue - https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/297
Using QWEN3 30B Q6_K 41/48
 
Hello, please tell me which models are best suited for creating short stories, about 10,000 words? For example, so that you can enter the prompt "write a story about how the hero defeated the dragon, took the princess from the tower, and his donkey married the dragon" and get Shrek in response on the minimum. I have 64 GB of RAM, you can allocate no more than 32 GB to the volcano.
 
Went from 7800XT to 7900XTX. Run by Ollama, Phi4 had nearly perfect scaling according to VRAM speed (642GB/s vs 960GB/s) going from 42 to 62tps. Gemma3 would not fit into 7800XT's VRAM and was partially swapped in RAM, so that saw relatively bigger increase: 9->35tps making it now nicely usable.
Lowered the power limit -10% (wish AMD would lower this number further), capped clocks to 2.2Ghz (useless here from LLM perspective) but increased the VRAM to 2.7GHz (+fast timing), lowered voltage 1150->1100 got me from 339W to 226W when running Gemma3:27B at almost identical performance at 34 tps.
Hey @AMD , it's about time to include out-of-the-box profiles for running LLMs on high-end GPUs!
 
Went from 7800XT to 7900XTX. Run by Ollama, Phi4 had nearly perfect scaling according to VRAM speed (642GB/s vs 960GB/s) going from 42 to 62tps. Gemma3 would not fit into 7800XT's VRAM and was partially swapped in RAM, so that saw relatively bigger increase: 9->35tps making it now nicely usable.
Lowered the power limit -10% (wish AMD would lower this number further), capped clocks to 2.2Ghz (useless here from LLM perspective) but increased the VRAM to 2.7GHz (+fast timing), lowered voltage 1150->1100 got me from 339W to 226W when running Gemma3:27B at almost identical performance at 34 tps.
Hey @AMD , it's about time to include out-of-the-box profiles for running LLMs on high-end GPUs!
Nice, those numbers are really similar to the ones from my 3090 @ 275W with ollama as well.
Have you tried other inference engines? I managed to get a bit higher perf with engines like vLLM and SGLang, I believe both of those should support AMD as well (not sure how easy it is to setup tho).
 
Nice, those numbers are really similar to the ones from my 3090 @ 275W with ollama as well.
No surprise here, both got similar VRAM bandwidth and same 24 GB of VRAM. 7900XTX has newer TSMC 5/6nm transistors that should give it edge on power efficiency over Samsung 8nm, despite being chiplet based architecture.
Have you tried other inference engines? I managed to get a bit higher perf with engines like vLLM and SGLang, I believe both of those should support AMD as well (not sure how easy it is to setup tho).
After having bad experience with LM Studio (ROCm and Vulcan there still give only 12-13tps with Gemma3:27B, so 3x less than in Ollama), I haven't looked past Ollama.
Besides my home lab, I am now using Ollama also at work servers.
I see that vLLM is open source and has Open WebUI integration, I might give it a go later. SGLang seems also interesting.

E: my PC reached another milestone, I was able to run my first >100B parameter model - llama4:scout. Takes all my 24GB of VRAM and 57GB of RAM pushing puny 5-6 tps.
 
Last edited:
No surprise here, both got similar VRAM bandwidth and same 24 GB of VRAM. 7900XTX has newer TSMC 5/6nm transistors that should give it edge on power efficiency over Samsung 8nm, despite being chiplet based architecture.
But still both are being compute-limited in those models, given that for something like Phi4 (~9GB in size) we should be achieving something around ~900GB/s / 9GB = 100tok/s instead of the 60~65 we currently see.
I haven't looked past Ollama.
What backend are you using for ollama? The ROCm one?

Besides my home lab, I am now using Ollama also at work servers.
If you're dealing with multiple users, I really recommend moving away from ollama, it's not meant to deal with high concurrency nor deals properly with batched scenarios. vLLM and SGLang do way better in this regard.
At work we had started with just regular python+transformers, then moved into ollama, which has fine for the eventual requests, but when we started dealing with higher concurrency we ended up moving to vLLM.
 
But still both are being compute-limited in those models, given that for something like Phi4 (~9GB in size) we should be achieving something around ~900GB/s / 9GB = 100tok/s instead of the 60~65 we currently see.
In inferencing all this compute does is I/O wait. Just made quick test phi4 did 57tps (I have bunch of other stuff running on the background), when I dropped VRAM speed from 2700 to 2500MHz I get 52 tps. That's linear bandwidth scaling, tps dropped 1:1 with available bandwidth. Now when I bump compute by increasing max GPU clock 10% (75->85% or 2213->2508Mhz), guess what happens ? No tps increase at all => bandwidth starved.
Note to gamers: yes I know 7900XTX can run 3-3.1 GHz GPU core, but in LLM inferencing use case that compute is wasted energy.
What backend are you using for ollama? The ROCm one?
Yes, ROCm
If you're dealing with multiple users, I really recommend moving away from ollama, it's not meant to deal with high concurrency nor deals properly with batched scenarios. vLLM and SGLang do way better in this regard.
Yeah, I already noticed ollama is not for multi user scenarios. Fortunately this is not our use case (for now), we have single stream of tickets that gets solved by one LLM. When testing stuff out with ollama, I simply provide multiple ollama containers dedicated for each team/use case.
 
Last edited:
Hmm I get 35.89 tps for that same prompt. That is quite big 31% performance drop compared to 17% memory bandwidth difference between 7900XT and 7900XTX.
@csendesmark I am guessing the model won't fit into 7900 XT's VRAM 100% ? "ollama ps" command tells me gemma3:27b-it-qat takes 22 GB VRAM and vanilla gemma3:27B takes 21 GB VRAM, both are above 20GB that 7900 XT has available. That could explain the performance difference.
 
Hmm I get 35.89 tps for that same prompt. That is quite big 31% performance drop compared to 17% memory bandwidth difference between 7900XT and 7900XTX.
@csendesmark I am guessing the model won't fit into 7900 XT's VRAM 100% ? "ollama ps" command tells me gemma3:27b-it-qat takes 22 GB VRAM and vanilla gemma3:27B takes 21 GB VRAM, both are above 20GB that 7900 XT has available. That could explain the performance difference.
It should be fit into the 20GB VRAM,
Will check a different Gemma 27B model tomorrow.
-----------
latter ----
1748334373570.png

Somewhat better performance with this
Q4_K_M 16.5 GB
 
Last edited:
Back
Top