What local LLM-s you use?

qxp · May 6, 2025

igormp said:
You should look into other forks that are way faster, such as WhisperX or faster-whisper:
https://github.com/m-bain/whisperX (uses faster-whisper underneath)

GitHub - SYSTRAN/faster-whisper: Faster Whisper transcription with CTranslate2

Faster Whisper transcription with CTranslate2. Contribute to SYSTRAN/faster-whisper development by creating an account on GitHub.

github.com

I run the 1st one as a public service out of my GPU.

Very nice! Will take a look. One thing that I really like about llama.cpp and whisper.cpp is no python - much easier to get working and keep working. I tried other python-based LLM engines in the past and it often has the result of breaking something else. Also both llama.cpp and whisper.cpp have nice web servers.

lambda · May 6, 2025

I'm using Q4_K_M quant of Qwen3 30B A3B with the following settings (tried the same as yours)

This gives me around 19.38 tok/sec but the model crashes after returning the output with the following error. Unsure if it's related to the quant / CPU / context etc.

Code:

Failed to regenerate message
The model has crashed without additional information. (Exit code: 18446744072635812000)

Looks like it's a known issue - https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/297

csendesmark said:
Using QWEN3 30B Q6_K 41/48

korn87 · May 11, 2025

Hello, please tell me which models are best suited for creating short stories, about 10,000 words? For example, so that you can enter the prompt "write a story about how the hero defeated the dragon, took the princess from the tower, and his donkey married the dragon" and get Shrek in response on the minimum. I have 64 GB of RAM, you can allocate no more than 32 GB to the volcano.

Ultron1337 · May 19, 2025

Went from 7800XT to 7900XTX. Run by Ollama, Phi4 had nearly perfect scaling according to VRAM speed (642GB/s vs 960GB/s) going from 42 to 62tps. Gemma3 would not fit into 7800XT's VRAM and was partially swapped in RAM, so that saw relatively bigger increase: 9->35tps making it now nicely usable.
Lowered the power limit -10% (wish AMD would lower this number further), capped clocks to 2.2Ghz (useless here from LLM perspective) but increased the VRAM to 2.7GHz (+fast timing), lowered voltage 1150->1100 got me from 339W to 226W when running Gemma3:27B at almost identical performance at 34 tps.
Hey @AMD , it's about time to include out-of-the-box profiles for running LLMs on high-end GPUs!

igormp · May 20, 2025

Ultron1337 said:
Went from 7800XT to 7900XTX. Run by Ollama, Phi4 had nearly perfect scaling according to VRAM speed (642GB/s vs 960GB/s) going from 42 to 62tps. Gemma3 would not fit into 7800XT's VRAM and was partially swapped in RAM, so that saw relatively bigger increase: 9->35tps making it now nicely usable.
Lowered the power limit -10% (wish AMD would lower this number further), capped clocks to 2.2Ghz (useless here from LLM perspective) but increased the VRAM to 2.7GHz (+fast timing), lowered voltage 1150->1100 got me from 339W to 226W when running Gemma3:27B at almost identical performance at 34 tps.
Hey @AMD , it's about time to include out-of-the-box profiles for running LLMs on high-end GPUs!

Nice, those numbers are really similar to the ones from my 3090 @ 275W with ollama as well.
Have you tried other inference engines? I managed to get a bit higher perf with engines like vLLM and SGLang, I believe both of those should support AMD as well (not sure how easy it is to setup tho).

Ultron1337 · May 20, 2025

igormp said:
Nice, those numbers are really similar to the ones from my 3090 @ 275W with ollama as well.

No surprise here, both got similar VRAM bandwidth and same 24 GB of VRAM. 7900XTX has newer TSMC 5/6nm transistors that should give it edge on power efficiency over Samsung 8nm, despite being chiplet based architecture.

igormp said:
Have you tried other inference engines? I managed to get a bit higher perf with engines like vLLM and SGLang, I believe both of those should support AMD as well (not sure how easy it is to setup tho).

After having bad experience with LM Studio (ROCm and Vulcan there still give only 12-13tps with Gemma3:27B, so 3x less than in Ollama), I haven't looked past Ollama.
Besides my home lab, I am now using Ollama also at work servers.
I see that vLLM is open source and has Open WebUI integration, I might give it a go later. SGLang seems also interesting.

E: my PC reached another milestone, I was able to run my first >100B parameter model - llama4:scout. Takes all my 24GB of VRAM and 57GB of RAM pushing puny 5-6 tps.

igormp · May 20, 2025

Ultron1337 said:
No surprise here, both got similar VRAM bandwidth and same 24 GB of VRAM. 7900XTX has newer TSMC 5/6nm transistors that should give it edge on power efficiency over Samsung 8nm, despite being chiplet based architecture.

But still both are being compute-limited in those models, given that for something like Phi4 (~9GB in size) we should be achieving something around ~900GB/s / 9GB = 100tok/s instead of the 60~65 we currently see.

Ultron1337 said:
I haven't looked past Ollama.

What backend are you using for ollama? The ROCm one?

Ultron1337 said:
Besides my home lab, I am now using Ollama also at work servers.

If you're dealing with multiple users, I really recommend moving away from ollama, it's not meant to deal with high concurrency nor deals properly with batched scenarios. vLLM and SGLang do way better in this regard.
At work we had started with just regular python+transformers, then moved into ollama, which has fine for the eventual requests, but when we started dealing with higher concurrency we ended up moving to vLLM.

Ultron1337 · May 20, 2025

igormp said:
But still both are being compute-limited in those models, given that for something like Phi4 (~9GB in size) we should be achieving something around ~900GB/s / 9GB = 100tok/s instead of the 60~65 we currently see.

In inferencing all this compute does is I/O wait. Just made quick test phi4 did 57tps (I have bunch of other stuff running on the background), when I dropped VRAM speed from 2700 to 2500MHz I get 52 tps. That's linear bandwidth scaling, tps dropped 1:1 with available bandwidth. Now when I bump compute by increasing max GPU clock 10% (75->85% or 2213->2508Mhz), guess what happens ? No tps increase at all => bandwidth starved.
Note to gamers: yes I know 7900XTX can run 3-3.1 GHz GPU core, but in LLM inferencing use case that compute is wasted energy.

igormp said:
What backend are you using for ollama? The ROCm one?

Yes, ROCm

igormp said:
If you're dealing with multiple users, I really recommend moving away from ollama, it's not meant to deal with high concurrency nor deals properly with batched scenarios. vLLM and SGLang do way better in this regard.

Yeah, I already noticed ollama is not for multi user scenarios. Fortunately this is not our use case (for now), we have single stream of tickets that gets solved by one LLM. When testing stuff out with ollama, I simply provide multiple ollama containers dedicated for each team/use case.

csendesmark · May 26, 2025

Ultron1337 said:
give only 12-13tps with Gemma3:27B

Which Q? Please tell me the exact model.

Ultron1337 · May 26, 2025

csendesmark said:
Which Q? Please tell me the exact model.

Q4_K_M https://ollama.com/library/gemma3:27b

csendesmark · May 26, 2025

Ultron1337 said:
Q4_K_M https://ollama.com/library/gemma3:27b

Gave a shot:
Q4_K_M 16.5 GB

DevQuasar/google.gemma-3-27b-it-qat-q4_0-unquantized-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Ultron1337 · May 26, 2025

Hmm I get 35.89 tps for that same prompt. That is quite big 31% performance drop compared to 17% memory bandwidth difference between 7900XT and 7900XTX.
@csendesmark I am guessing the model won't fit into 7900 XT's VRAM 100% ? "ollama ps" command tells me gemma3:27b-it-qat takes 22 GB VRAM and vanilla gemma3:27B takes 21 GB VRAM, both are above 20GB that 7900 XT has available. That could explain the performance difference.

csendesmark · May 26, 2025

Ultron1337 said:
Hmm I get 35.89 tps for that same prompt. That is quite big 31% performance drop compared to 17% memory bandwidth difference between 7900XT and 7900XTX.
@csendesmark I am guessing the model won't fit into 7900 XT's VRAM 100% ? "ollama ps" command tells me gemma3:27b-it-qat takes 22 GB VRAM and vanilla gemma3:27B takes 21 GB VRAM, both are above 20GB that 7900 XT has available. That could explain the performance difference.

It should be fit into the 20GB VRAM,
Will check a different Gemma 27B model tomorrow.
-----------
latter ----

Somewhat better performance with this
Q4_K_M 16.5 GB

lmstudio-community/gemma-3-27b-it-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Ultron1337 · Jun 10, 2025

Reading about 3gb GDDR7 end user availability made me think there are great times ahead for local LLM runners and GPU tinkerers.

Rover4444 · Jun 10, 2025

I'm loving Wan and the Illustrious finetunes, they're so good! Really happy I got a 5060 Ti.

lambda · Jun 11, 2025

@csendesmark, have you tried Qwen 235B MoE?

Try this specific model, some of the 3bit quant might fit otherwise try some 2bit quant.

unsloth/Qwen3-235B-A22B-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

I do not have enough RAM currently to test it out.

vinacis_vivids · Jun 23, 2025

csendesmark said:
It should be fit into the 20GB VRAM,
Will check a different Gemma 27B model tomorrow.
-----------
latter ----
View attachment 401401
Somewhat better performance with this
Q4_K_M 16.5 GB

lmstudio-community/gemma-3-27b-it-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

deepseek-r1-0528-distill-qwen3-preview0 Q5_K_M
Kontext: 40k!

The gemma27b seems to be wrong!

Rover4444 · Jun 23, 2025

lambda said:
@csendesmark, have you tried Qwen 235B MoE?

Try this specific model, some of the 3bit quant might fit otherwise try some 2bit quant.

unsloth/Qwen3-235B-A22B-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

I do not have enough RAM currently to test it out.

I've tried it out, it's pretty OK but has some refusals sometimes. Used the q2 myself but haven't done too much with it.

It hurts being RAM poor, I really want to try r1.

igormp · Jun 23, 2025

Got the big deepseek to run in my desktop, using some really nice quants with the shared layers in my GPUs, and the experts on CPU:

Managed ~6tok/s, which is not great but not awful either.
Q2 quants from here:

ubergarm/DeepSeek-R1-0528-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Dr. Dro · Jun 23, 2025

lambda said:
I'm using Q4_K_M quant of Qwen3 30B A3B with the following settings (tried the same as yours)
View attachment 398477

This gives me around 19.38 tok/sec but the model crashes after returning the output with the following error. Unsure if it's related to the quant / CPU / context etc.

I have this model downloaded, and decided to try it... have the crashes been resolved?

igormp said:
Got the big deepseek to run in my desktop, using some really nice quants with the shared layers in my GPUs, and the experts on CPU:
View attachment 404909

Managed ~6tok/s, which is not great but not awful either.
Q2 quants from here:

ubergarm/DeepSeek-R1-0528-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Dang, I need a RAM upgrade...

igormp · Jun 23, 2025

Dr. Dro said:
Dang, I need a RAM upgrade...

Gotta choose between capacity of speed.
Maybe you could go for capacity on a secondary rig of yours instead of your main one.

Rover4444 · Jun 23, 2025

igormp said:
Gotta choose between capacity of speed.
Maybe you could go for capacity on a secondary rig of yours instead of your main one.

I'm looking into distributed inference so I can have my cake and eat it, too.

igormp · Jun 23, 2025

Rover4444 said:
I'm looking into distributed inference so I can have my cake and eat it, too.

At this point just get enough GPUs with a fast enough interconnect and you're good to go. I was mostly talking about single consumer node with system RAM offloading.
On a server platform you could just shove 24c of 5000MHz+ memory, mess a bit with NUMA and have good enough CPU-only inference with 1TB+ of RAM.

Rover4444 · Jun 23, 2025

igormp said:
At this point just get enough GPUs with a fast enough interconnect and you're good to go. I was mostly talking about single consumer node with system RAM offloading.
On a server platform you could just shove 24c of 5000MHz+ memory, mess a bit with NUMA and have good enough CPU-only inference with 1TB+ of RAM.

I'm not spending more on GPUs for a MoE. If I really wanted to spend more I'd just get a quad channel DDR4 board and rig something up, but I already have plenty of computers as it is.

You also don't need fast interconnects if you're not doing tensor parallel.

igormp · Jun 23, 2025

Rover4444 said:
You also don't need fast interconnects if you're not doing tensor parallel.

Are you talking about just single user inference with larger-ish models solely in VRAM across GPUs in different nodes? If so, llama.cpp with RPC should do the job already. Not much to look into IMO.

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

System Name	Mini pc gmktec k8
Processor	8845hs
Motherboard	GMK K8
Cooling	Thermalright Macho Rev.B
Memory	64gb 5600 cl40-40-40
Video Card(s)	780m 3.2-3.3GHz
Storage	WD sn740 1tb, Netac 7000t 2 tb
Display(s)	LG oled65c4
Audio Device(s)	daart aurora
Software	Win11

Processor	AMD 5600X
Motherboard	ASUS TUF GAMING B550M-Plus WiFi
Cooling	be quiet! Dark Rock 4
Memory	G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s)	Sapphire Nitro+ RX 7900 XTX 24GB
Storage	Kingston KC3000 2TB + QNAP TBS-464
Display(s)	LG 35" LCD 35WN75C-B 3440x1440
Case	Kolink Bastion RGB Midi-Tower
Power Supply	Seasonic VERTEX PX-750 80+ Platinum
Mouse	Razer Deathadder v2
Benchmark Scores	phi4 - 62 tokens/s gemma3:27B - 35 tps

Processor	9950x \| 5950x
Motherboard	x670e ProArt\| B550 ProArt
Cooling	PA 120 SE \|Fuma 2
Memory	4x64GB Kingston CUDIMM @5200MHz \| 4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	Corsair RM1000e \| XPG Core Reactor 850W
Software	I use Arch btw

Processor	AMD 5600X
Motherboard	ASUS TUF GAMING B550M-Plus WiFi
Cooling	be quiet! Dark Rock 4
Memory	G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s)	Sapphire Nitro+ RX 7900 XTX 24GB
Storage	Kingston KC3000 2TB + QNAP TBS-464
Display(s)	LG 35" LCD 35WN75C-B 3440x1440
Case	Kolink Bastion RGB Midi-Tower
Power Supply	Seasonic VERTEX PX-750 80+ Platinum
Mouse	Razer Deathadder v2
Benchmark Scores	phi4 - 62 tokens/s gemma3:27B - 35 tps

System Name	Kincsem
Processor	AMD Ryzen 9 9950X
Motherboard	ASUS ProArt X870E-CREATOR WIFI
Cooling	Be Quiet Dark Rock Pro 5
Memory	Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s)	Sapphire AMD RX 7900 XT Pulse
Storage	Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s)	Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case	Cooler Master CM 690 III
Power Supply	Seasonic 1300W 80+ Gold Prime
Mouse	Logitech G502 Hero
Keyboard	HyperX Alloy Elite RGB
Software	Windows 10-64
Benchmark Scores	https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc

Processor	13th Gen Intel Core i9-13900KS
Motherboard	ASUS ROG Maximus Z790 Apex Encore
Cooling	Pichau Lunara ARGB 360 + Honeywell PTM7950
Memory	32 GB G.Skill Trident Z5 RGB @ 7600 MT/s
Video Card(s)	Palit GameRock OC GeForce RTX 5090 32 GB
Storage	500 GB WD Black SN750 + 4x 300 GB WD VelociRaptor WD3000HLFS HDDs
Display(s)	55-inch LG G3 OLED
Case	Cooler Master MasterFrame 700 benchtable
Power Supply	EVGA 1300 G2 1.3kW 80+ Gold
Mouse	Microsoft Classic IntelliMouse
Keyboard	IBM Model M type 1391405
Software	Windows 10 Pro 22H2
Benchmark Scores	I pulled a Qiqi~

What local LLM-s you use?

New Member