• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

What local LLM-s you use?

Q4 17B should fit pretty easily in even 16GB of VRAM, it shouldn't be a problem. Processing/generation should outweigh load/unload.
If you manage to get it properly working with something like ktransformers, then maybe. You'd still require quite some RAM, but should be feasible with consumer platforms nonetheless.
When it comes to inferencing current models, bottleneck is the the VRAM bandwidth. GPU or CPU compute is almost irrelevant. As observed by ollama ps command, tps drops dramatically when even few % of the model is forced to run on a 10x slower system RAM compared to GPU.
LLM models have billions of parameters and each inference pass requires loading most or all of these. Its like billions of neurons firing up and communicating between each other when thinking. This creates huge storm of data traffic and memory bandwidth is the only key performance metric here holding back tps. Yeah at some point when memory becomes fast enough, compute needs to catch up, but due historical design (calculating frames is more compute than bandwidth intensive) GPUs today are bandwidth starved when doing LLM inferencing runs.
But now you're talking about DRAM offloading and making use of CPU for some of the layers, which is totally different from a mGPU setup that was the previous discussion point.
The PCIe bottleneck is really not much of an issue for a small number of GPUs for inference.
 
Wonder if we getting a smaller sized Llama 4 for the commerce GPUs.
 
Neither Alphabet nor Meta have any motivation to let Joe Consumer run their LLMs locally on their own hardware.

Both companies make the lion's share of their revenue selling their users' Internet usage data. They want people to upload their AI chatbot queries to the cloud. YOU are their product. This should be a surprise to no one here at TPU.

While many people online looovvve to hate on Apple, at least Apple prioritizes privacy and data security. That's why have taken the pains to run at least some of their AI operations locally on the user's hardware (Apple Silicon Macs, Apple Silicon iPads, iPhone 15 Pro and the iPhone 16 family) with only some of the operations being done on their Private Cloud Compute servers. It's probably why Apple is slow to roll out AI features since they need to also worry about privacy and security.

Look at Microsoft Recall. When it was first announced, Microsoft was ridiculed for crushingly inadequate data security and privacy. They listened and postponed deployment. It's almost a year later and there are finally some whispers that it's coming Real Soon Now™. Clearly Microsoft rewrote almost everything from scratch with some sort of attempt to reduce privacy and data security vulnerabilities.
 
Last edited:
I use the following on my little machine.
1. DeepSeek-R1-Distill-Qwen-14B
2. Llama 3 8B Instruct
3. https://huggingface.co/RichardErkhov/failspy_-_Meta-Llama-3-8B-Instruct-abliterated-v3-gguf

I wish to try out some 27B (Gemma3) and 32B models in the future. Worth mentioning is the above models were run a q4 quantization.

DeepCoder-14B-Preview This one looks promising after some short testing!
LM Studio + 7900XT?
I'd be interested in seeing your results even if they are just rough runs or whatever.

Edit: Looks like you've shared some numbers on the previous pages.
 
Last edited:
LM Studio + 7900XT?
I'd be interested in seeing your results even if they are just rough runs or whatever.

Edit: Looks like you've shared some numbers on the previous pages.
What models are you interested?
Currently I have these:
1744583613620.png
 
- Mistral Small (24B)
- Any Gemma 27B that fits completely in your VRAM.

If things go as planned, I might have the same GPU as you.

Thank you!
Mistral Small (24B) 15 token/s
Gemma 27B - I have only the Q6 and Q8 - Q6 does 8.41 token/s
Gemma 12B Q8 does 44 token/s

I would advice you to get the XTX version with the 24GB - you can thank me later :D
Even that "smol" extra of 4GB will be super handy at some point!
 
Mistral Small (24B) 15 token/s
Gemma 27B - I have only the Q6 and Q8 - Q6 does 8.41 token/s
Gemma 12B Q8 does 44 token/s

I would advice you to get the XTX version with the 24GB - you can thank me later :D
Even that "smol" extra of 4GB will be super handy at some point!
Thanks for the results.

As much as I'd like to get the 7900XTX for the extra memory, the prices are too high.

I paid $730 USD (equivalent) for the XT, the cheapest XTX is $1000. Not worth it for me although it would've been quite nice to have the extra 4GB.

Nvidia options in this range only have 1̶6̶G̶B̶ 12GB, infact that's the only reason I even considered the 7900XT (well I also can't find the Nv cards in stock).
 
Last edited:
Don't know if anyone has noticed this or not but I seem get up to 20% better performance under linux....
 
Neither Alphabet nor Meta have any motivation to let Joe Consumer run their LLMs locally on their own hardware.

Both companies make the lion's share of their revenue selling their users' Internet usage data. They want people to upload their AI chatbot queries to the cloud. YOU are their product. This should be a surprise to no one here at TPU.

While many people online looovvve to hate on Apple, at least Apple prioritizes privacy and data security. That's why have taken the pains to run at least some of their AI operations locally on the user's hardware (Apple Silicon Macs, Apple Silicon iPads, iPhone 15 Pro and the iPhone 16 family) with only some of the operations being done on their Private Cloud Compute servers. It's probably why Apple is slow to roll out AI features since they need to also worry about privacy and security.

Look at Microsoft Recall. When it was first announced, Microsoft was ridiculed for crushingly inadequate data security and privacy. They listened and postponed deployment. It's almost a year later and there are finally some whispers that it's coming Real Soon Now™. Clearly Microsoft rewrote almost everything from scratch with some sort of attempt to reduce privacy and data security vulnerabilities.

I pretty much agree big tech have no real interest in consumers running LLM's locally at least not without being able to sell it to them or generate revenue in some way from them running it locally like via adverts much YouTube interrupting you every 3 or 4 minutes to watching another ad.
 
Just installed Deepseek R1. 1QM its massive. 167GB lol
 
Just installed Deepseek R1. 1QM its massive. 167GB lol
What monster rig you have?
And also, what speeds you can get?
I could get one more of the kit I have to run it, but it would be still like 0.7 token/s or maybe even less :D
 
What monster rig you have?
Just the test computer 285K / RTX 4090 and 4x64GB.

And also, what speeds you can get?
I could get one more of the kit I have to run it, but it would be still like 0.7 token/s or maybe even less :D
It is "slow" for response, but thats okay because the answers are much better vs Distilled 8B and some instances 70B. I don't know how to check the token rate in LLM Studio. Any ideas?
 
Neither Alphabet nor Meta have any motivation to let Joe Consumer run their LLMs locally on their own hardware.

Both companies make the lion's share of their revenue selling their users' Internet usage data. They want people to upload their AI chatbot queries to the cloud. YOU are their product. This should be a surprise to no one here at TPU.

While many people online looovvve to hate on Apple, at least Apple prioritizes privacy and data security. That's why have taken the pains to run at least some of their AI operations locally on the user's hardware (Apple Silicon Macs, Apple Silicon iPads, iPhone 15 Pro and the iPhone 16 family) with only some of the operations being done on their Private Cloud Compute servers. It's probably why Apple is slow to roll out AI features since they need to also worry about privacy and security.

Look at Microsoft Recall. When it was first announced, Microsoft was ridiculed for crushingly inadequate data security and privacy. They listened and postponed deployment. It's almost a year later and there are finally some whispers that it's coming Real Soon Now™. Clearly Microsoft rewrote almost everything from scratch with some sort of attempt to reduce privacy and data security vulnerabilities.
Alphabet does actually have an incentive to let consumers run LLMs on their own hardware. Their models are made to be able to run on Android devices.
 
Last edited:
qwen3 seems nice.
Seems a tad buggy and slow to load on KoboldCPP right now, it tends to "hang" on the second token for a good few seconds before continuing into gibberish after a few sentences. after a few minutes it reformats the message into something normal but it's still very weird behaviour. Doesn't even load properly on LMStudio. This is an early Q6_K GGUF of Qwen3-32B though on software not updated to support it and the author admits there are issues with the quants below Q6_K, so I won't condemn it just yet.
 
New, supposedly super efficient QWEN 3 is here
View attachment 397295
https://huggingface.co/Qwen/Qwen3-8B — 63 token/s with Q8
https://huggingface.co/Qwen/Qwen3-30B-A3B — 18 tokens/s with Q6_K
https://huggingface.co/Qwen/Qwen3-235B-A22B — n/a :D
Mind sharing your settings for GPU offload layers and all that for the 30B-A3B? A screenshot would be great (click on the arrow advanced options when selecting the model to load).

I'm only getting 15 tokens/s with 4096 context on the 7900XT.
 
Mind sharing your settings for GPU offload layers and all that for the 30B-A3B? A screenshot would be great (click on the arrow advanced options when selecting the model to load).

I'm only getting 15 tokens/s with 4096 context on the 7900XT.
Using QWEN3 30B Q6_K 41/48
1746543189852.png

getting 14.2 tks while watching youtube.
When Vivaldi closed:
1746543153704.png

When I asked for the definition of a "point"
1746543327935.png
 
Last edited:
Llama 4 variants arrived in ollama.

I'm not running the 804 GB model anytime soon, but the Scout model looks reasonable.
 
I just tried out whisper.cpp - works great, and even the largest model is only 3GB. Works great on CPU and GPU.
 
I just tried out whisper.cpp - works great, and even the largest model is only 3GB. Works great on CPU and GPU.
You should look into other forks that are way faster, such as WhisperX or faster-whisper:
https://github.com/m-bain/whisperX (uses faster-whisper underneath)

I run the 1st one as a public service out of my GPU.
 
  • Like
Reactions: qxp
Back
Top