What local LLM-s you use?

igormp · Apr 6, 2025

Rover4444 said:
Q4 17B should fit pretty easily in even 16GB of VRAM, it shouldn't be a problem. Processing/generation should outweigh load/unload.

If you manage to get it properly working with something like ktransformers, then maybe. You'd still require quite some RAM, but should be feasible with consumer platforms nonetheless.

Ultron1337 said:
When it comes to inferencing current models, bottleneck is the the VRAM bandwidth. GPU or CPU compute is almost irrelevant. As observed by ollama ps command, tps drops dramatically when even few % of the model is forced to run on a 10x slower system RAM compared to GPU.
LLM models have billions of parameters and each inference pass requires loading most or all of these. Its like billions of neurons firing up and communicating between each other when thinking. This creates huge storm of data traffic and memory bandwidth is the only key performance metric here holding back tps. Yeah at some point when memory becomes fast enough, compute needs to catch up, but due historical design (calculating frames is more compute than bandwidth intensive) GPUs today are bandwidth starved when doing LLM inferencing runs.

But now you're talking about DRAM offloading and making use of CPU for some of the layers, which is totally different from a mGPU setup that was the previous discussion point.
The PCIe bottleneck is really not much of an issue for a small number of GPUs for inference.

csendesmark · Apr 7, 2025

Wonder if we getting a smaller sized Llama 4 for the commerce GPUs.

csendesmark · Apr 10, 2025

DeepCoder-14B-Preview This one looks promising after some short testing!

Rover4444 · Apr 10, 2025

csendesmark said:
Wonder if we getting a smaller sized Llama 4 for the commerce GPUs.

Given that Meta stopped making the 33B and 14B models and haven't brought them back, I'd say making their models small enough to run on actual hardware is the last of their concerns.

cvaldes · Apr 13, 2025

Neither Alphabet nor Meta have any motivation to let Joe Consumer run their LLMs locally on their own hardware.

Both companies make the lion's share of their revenue selling their users' Internet usage data. They want people to upload their AI chatbot queries to the cloud. YOU are their product. This should be a surprise to no one here at TPU.

While many people online looovvve to hate on Apple, at least Apple prioritizes privacy and data security. That's why have taken the pains to run at least some of their AI operations locally on the user's hardware (Apple Silicon Macs, Apple Silicon iPads, iPhone 15 Pro and the iPhone 16 family) with only some of the operations being done on their Private Cloud Compute servers. It's probably why Apple is slow to roll out AI features since they need to also worry about privacy and security.

Look at Microsoft Recall. When it was first announced, Microsoft was ridiculed for crushingly inadequate data security and privacy. They listened and postponed deployment. It's almost a year later and there are finally some whispers that it's coming Real Soon Now™. Clearly Microsoft rewrote almost everything from scratch with some sort of attempt to reduce privacy and data security vulnerabilities.

lambda · Apr 13, 2025

I use the following on my little machine.
1. DeepSeek-R1-Distill-Qwen-14B
2. Llama 3 8B Instruct
3. https://huggingface.co/RichardErkhov/failspy_-_Meta-Llama-3-8B-Instruct-abliterated-v3-gguf

I wish to try out some 27B (Gemma3) and 32B models in the future. Worth mentioning is the above models were run a q4 quantization.

csendesmark said:
DeepCoder-14B-Preview This one looks promising after some short testing!

LM Studio + 7900XT?
I'd be interested in seeing your results even if they are just rough runs or whatever.

Edit: Looks like you've shared some numbers on the previous pages.

csendesmark · Apr 13, 2025

lambda said:
LM Studio + 7900XT?
I'd be interested in seeing your results even if they are just rough runs or whatever.

Edit: Looks like you've shared some numbers on the previous pages.

What models are you interested?
Currently I have these:

lambda · Apr 13, 2025

csendesmark said:
What models are you interested?
Currently I have these:
View attachment 394959

- Mistral Small (24B)
- Any Gemma 27B that fits completely in your VRAM.

If things go as planned, I might have the same GPU as you.

Thank you!

csendesmark · Apr 14, 2025

lambda said:
- Mistral Small (24B)
- Any Gemma 27B that fits completely in your VRAM.

If things go as planned, I might have the same GPU as you.

Thank you!

Mistral Small (24B) 15 token/s
Gemma 27B - I have only the Q6 and Q8 - Q6 does 8.41 token/s
Gemma 12B Q8 does 44 token/s

I would advice you to get the XTX version with the 24GB - you can thank me later

Even that "smol" extra of 4GB will be super handy at some point!

lambda · Apr 14, 2025

csendesmark said:
Mistral Small (24B) 15 token/s
Gemma 27B - I have only the Q6 and Q8 - Q6 does 8.41 token/s
Gemma 12B Q8 does 44 token/s

I would advice you to get the XTX version with the 24GB - you can thank me later
Even that "smol" extra of 4GB will be super handy at some point!

Thanks for the results.

As much as I'd like to get the 7900XTX for the extra memory, the prices are too high.

I paid $730 USD (equivalent) for the XT, the cheapest XTX is $1000. Not worth it for me although it would've been quite nice to have the extra 4GB.

Nvidia options in this range only have 1̶6̶G̶B̶ 12GB, infact that's the only reason I even considered the 7900XT (well I also can't find the Nv cards in stock).

johnspack · Apr 14, 2025

Don't know if anyone has noticed this or not but I seem get up to 20% better performance under linux....

InVasMani · Apr 20, 2025

cvaldes said:
Neither Alphabet nor Meta have any motivation to let Joe Consumer run their LLMs locally on their own hardware.

Both companies make the lion's share of their revenue selling their users' Internet usage data. They want people to upload their AI chatbot queries to the cloud. YOU are their product. This should be a surprise to no one here at TPU.

While many people online looovvve to hate on Apple, at least Apple prioritizes privacy and data security. That's why have taken the pains to run at least some of their AI operations locally on the user's hardware (Apple Silicon Macs, Apple Silicon iPads, iPhone 15 Pro and the iPhone 16 family) with only some of the operations being done on their Private Cloud Compute servers. It's probably why Apple is slow to roll out AI features since they need to also worry about privacy and security.

Look at Microsoft Recall. When it was first announced, Microsoft was ridiculed for crushingly inadequate data security and privacy. They listened and postponed deployment. It's almost a year later and there are finally some whispers that it's coming Real Soon Now™. Clearly Microsoft rewrote almost everything from scratch with some sort of attempt to reduce privacy and data security vulnerabilities.

I pretty much agree big tech have no real interest in consumers running LLM's locally at least not without being able to sell it to them or generate revenue in some way from them running it locally like via adverts much YouTube interrupting you every 3 or 4 minutes to watching another ad.

ir_cow · Apr 20, 2025

Just installed Deepseek R1. 1QM its massive. 167GB lol

lambda · Apr 20, 2025

This might be interesting - https://developers.googleblog.com/e...trained-state-of-the-art-ai-to-consumer-gpus/

csendesmark · Apr 20, 2025

ir_cow said:
Just installed Deepseek R1. 1QM its massive. 167GB lol

What monster rig you have?
And also, what speeds you can get?
I could get one more of the kit I have to run it, but it would be still like 0.7 token/s or maybe even less

ir_cow · Apr 20, 2025

csendesmark said:
What monster rig you have?

Just the test computer 285K / RTX 4090 and 4x64GB.

csendesmark said:
And also, what speeds you can get?
I could get one more of the kit I have to run it, but it would be still like 0.7 token/s or maybe even less

It is "slow" for response, but thats okay because the answers are much better vs Distilled 8B and some instances 70B. I don't know how to check the token rate in LLM Studio. Any ideas?

Rover4444 · Apr 20, 2025

cvaldes said:
Neither Alphabet nor Meta have any motivation to let Joe Consumer run their LLMs locally on their own hardware.

Both companies make the lion's share of their revenue selling their users' Internet usage data. They want people to upload their AI chatbot queries to the cloud. YOU are their product. This should be a surprise to no one here at TPU.

While many people online looovvve to hate on Apple, at least Apple prioritizes privacy and data security. That's why have taken the pains to run at least some of their AI operations locally on the user's hardware (Apple Silicon Macs, Apple Silicon iPads, iPhone 15 Pro and the iPhone 16 family) with only some of the operations being done on their Private Cloud Compute servers. It's probably why Apple is slow to roll out AI features since they need to also worry about privacy and security.

Look at Microsoft Recall. When it was first announced, Microsoft was ridiculed for crushingly inadequate data security and privacy. They listened and postponed deployment. It's almost a year later and there are finally some whispers that it's coming Real Soon Now™. Clearly Microsoft rewrote almost everything from scratch with some sort of attempt to reduce privacy and data security vulnerabilities.

Alphabet does actually have an incentive to let consumers run LLMs on their own hardware. Their models are made to be able to run on Android devices.

csendesmark · Apr 28, 2025

New, supposedly super efficient QWEN 3 is here

https://huggingface.co/Qwen/Qwen3-8B — 63 token/s with Q8
https://huggingface.co/Qwen/Qwen3-30B-A3B — 18 tokens/s with Q6_K
https://huggingface.co/Qwen/Qwen3-235B-A22B — n/a

unwind-protect · Apr 29, 2025

qwen3 seems nice.

Ninja Weedle · Apr 29, 2025

unwind-protect said:
qwen3 seems nice.

Seems a tad buggy and slow to load on KoboldCPP right now, it tends to "hang" on the second token for a good few seconds before continuing into gibberish after a few sentences. after a few minutes it reformats the message into something normal but it's still very weird behaviour. Doesn't even load properly on LMStudio. This is an early Q6_K GGUF of Qwen3-32B though on software not updated to support it and the author admits there are issues with the quants below Q6_K, so I won't condemn it just yet.

lambda · May 5, 2025

csendesmark said:
New, supposedly super efficient QWEN 3 is here
View attachment 397295
https://huggingface.co/Qwen/Qwen3-8B — 63 token/s with Q8
https://huggingface.co/Qwen/Qwen3-30B-A3B — 18 tokens/s with Q6_K
https://huggingface.co/Qwen/Qwen3-235B-A22B — n/a

Mind sharing your settings for GPU offload layers and all that for the 30B-A3B? A screenshot would be great (click on the arrow advanced options when selecting the model to load).

I'm only getting 15 tokens/s with 4096 context on the 7900XT.

csendesmark · May 6, 2025

lambda said:
Mind sharing your settings for GPU offload layers and all that for the 30B-A3B? A screenshot would be great (click on the arrow advanced options when selecting the model to load).

I'm only getting 15 tokens/s with 4096 context on the 7900XT.

Using QWEN3 30B Q6_K 41/48

getting 14.2 tks while watching youtube.
When Vivaldi closed:

When I asked for the definition of a "point"

unwind-protect · May 6, 2025

Llama 4 variants arrived in ollama.

I'm not running the 804 GB model anytime soon, but the Scout model looks reasonable.

qxp · May 6, 2025

I just tried out whisper.cpp - works great, and even the largest model is only 3GB. Works great on CPU and GPU.

igormp · May 6, 2025

qxp said:
I just tried out whisper.cpp - works great, and even the largest model is only 3GB. Works great on CPU and GPU.

You should look into other forks that are way faster, such as WhisperX or faster-whisper:
https://github.com/m-bain/whisperX (uses faster-whisper underneath)

GitHub - SYSTRAN/faster-whisper: Faster Whisper transcription with CTranslate2

Faster Whisper transcription with CTranslate2. Contribute to SYSTRAN/faster-whisper development by creating an account on GitHub.

github.com

I run the 1st one as a public service out of my GPU.

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

System Name	Kincsem
Processor	AMD Ryzen 9 9950X
Motherboard	ASUS ProArt X870E-CREATOR WIFI
Cooling	Be Quiet Dark Rock Pro 5
Memory	Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s)	Sapphire AMD RX 7900 XT Pulse
Storage	Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s)	Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case	Cooler Master CM 690 III
Power Supply	Seasonic 1300W 80+ Gold Prime
Mouse	Logitech G502 Hero
Keyboard	HyperX Alloy Elite RGB
Software	Windows 10-64
Benchmark Scores	https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc

System Name	Kincsem
Processor	AMD Ryzen 9 9950X
Motherboard	ASUS ProArt X870E-CREATOR WIFI
Cooling	Be Quiet Dark Rock Pro 5
Memory	Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s)	Sapphire AMD RX 7900 XT Pulse
Storage	Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s)	Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case	Cooler Master CM 690 III
Power Supply	Seasonic 1300W 80+ Gold Prime
Mouse	Logitech G502 Hero
Keyboard	HyperX Alloy Elite RGB
Software	Windows 10-64
Benchmark Scores	https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc

System Name	daily driver Mac mini M2 Pro
Processor	Apple proprietary M2 Pro (6 p-cores, 4 e-cores)
Motherboard	Apple proprietary
Cooling	Apple proprietary
Memory	Apple proprietary 16GB LPDDR5 unified memory
Video Card(s)	Apple proprietary M2 Pro (16-core GPU)
Storage	Apple proprietary onboard 512GB SSD + various external HDDs
Display(s)	LG UltraFine 27UL850W (4K@60Hz IPS)
Case	Apple proprietary
Audio Device(s)	Apple proprietary
Power Supply	Apple proprietary
Mouse	Apple Magic Trackpad 2
Keyboard	Keychron K1 tenkeyless (Gateron Reds)
VR HMD	Oculus Rift S (hosted on a different PC)
Software	macOS Sonoma 14.7
Benchmark Scores	(My Windows daily driver is a Beelink Mini S12 Pro. I'm not interested in benchmarking.)

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

System Name	System2 Blacknet , System1 Blacknet2
Processor	System2 Threadripper 1920x, System1 2699 v3
Motherboard	System2 Asrock Fatality x399 Professional Gaming, System1 Asus X99-A
Cooling	System2 Noctua NH-U14 TR4-SP3 Dual 140mm fans, System1 AIO
Memory	System2 64GBS DDR4 3000, System1 32gbs DDR4 2400
Video Card(s)	System2 GTX 980Ti System1 GTX 970
Storage	System2 4x SSDs + NVme= 2.250TB 2xStorage Drives=8TB System1 3x SSDs=2TB
Display(s)	1x27" 1440 display 1x 24" 1080 display
Case	System2 Some Nzxt case with soundproofing...
Audio Device(s)	Asus Xonar U7 MKII
Power Supply	System2 EVGA 750 Watt, System1 XFX XTR 750 Watt
Mouse	Logitech G900 Chaos Spectrum
Keyboard	Ducky
Software	Archlinux, Manjaro, Win11 Ent 24h2
Benchmark Scores	It's linux baby!

What local LLM-s you use?

Here For Good!