• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What local LLM-s you use?

Are you talking about just single user inference with larger-ish models solely in VRAM across GPUs in different nodes?
No. Single user inference with two workers doing partial offloading so I don't have to buy more RAM for R1.
If so, llama.cpp with RPC should do the job already. Not much to look into IMO.
Yes, I know. I'm looking into it.
 
No. Single user inference with two workers doing partial offloading so I don't have to buy more RAM for R1.

Yes, I know. I'm looking into it.
In this case I can really recommend looking into ik's fork with the quants I shared, it can give a nice boost for cases where you offload to cpu, as well as allowing you to mmap the weights.
 
I've moved on from this model, I found Gemma3 27B QAT much better.

Other than that I've used Devstral.

And afaik, the crashes still persist.


If you have another GPU lying around then use that.

VRAM >>> RAM

Oooh, nice. Will definitely try that one too, thanks! :)
 
Hello, please tell me which models are best suited for creating short stories, about 10,000 words? For example, so that you can enter the prompt "write a story about how the hero defeated the dragon, took the princess from the tower, and his donkey married the dragon" and get Shrek in response on the minimum. I have 64 GB of RAM, you can allocate no more than 32 GB to the volcano.
I guess you can use EQ benchmark for evaluation, I've not tried LLMs for story writing or roleplaying personally.


Try this model via LLM Studio - https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506
 
You can't run R1 with just an extra card. For MoEs you should have enough RAM+VRAM to store the model then enough VRAM for the active params.
Certainly, I just meant in general. Although MoEs run ok on RAM as well.
 
Gemma 3 Abliterated :D

Though I admit that mostly I use the free offerings from DeepSeek and Microsoft because most of what I ask is either translation requests or powershell code and even the free tier offerings are better than what my 4090 will run locally with any kind of performance.

Looks like I somehow missed the Qwen 3 release - I'll have to take a look.

I think the Q2 quant of the full model might just barely run on 24GB VRAM and 96GB RAM? I tried out Llama 4 Scout Q4_K_M and I was surprised by how well it performed, but that's MoE for you.

Is anyone aware of any benchmarks of how the 235B Q2 performs vs. 30B Q4? In knowledge, I mean, not speed.


edit: Running Qwen3-235B-A22B-Q2_K_L puts me at 90% memory usage but it runs. And it's just below 5 Token/s so it's not unusably slow.

What really impressed me though was the combination of the speed of Qwen 3 30B combined with the quality of output to a rather tricky powershell code generation request. One of my favourite questions to ask any LLM I try out:
As a professional PowerShell developer, please write a script to assist with migrating mailboxes from an old three-server DAG to a new three-server DAG in Exchange. The script should automatically create the required number of new mailbox databases, ensuring that no database exceeds 500GB in size. Additionally, it should evenly distribute mailboxes across the new databases for optimal load balancing. The script should handle all necessary steps, including the creation of the databases, mailbox moves, and ensuring an even distribution of mailbox data across the new DAG.

I initially wanted to try Qwen3-235B-A22B.i1-IQ2_M, but the files use some strange partitioning that cause an error if I try to recombine them with llama.cpp, so I ended up using the somewhat larger Qwen3-235B-A22B-Q2_K_L instead. Has anyone dealt with a 'does not contain split.count metadata' error before? I'm guessing a different tool was used for the splitting.
 
Last edited:
Moonshot's Kimi is looking really nice. There are already quants going on for their 1T model:

And they have some other smaller dense models that appear to be distills as well:
 
Moonshot's Kimi is looking really nice. There are already quants going on for their 1T model:

And they have some other smaller dense models that appear to be distills as well:
1T?!?! I ain't NEVER running this shit dawg :cry: :cry: :cry:

By the way DeepSeek R1 doesn't work with RPC for me, but Qwen 235B does.
 
Gonna use gork in mah car.
 
1T?!?! I ain't NEVER running this shit dawg :cry: :cry: :cry:
If you have enough storage, you can run it by mmap'ing out of disk.
It should actually not be that bad given that it has "only" 32B experts active, so you could get actually reasonable performance with 64GB of RAM, 16GB of VRAM for PP, and a fast enough NVMe.

Actually have a question for the nerds in here. I wanted to setup a local LLM to break down images of local weather maps (Only once an hour or so) but my light googling (and lighter knowledge on LLM's) seem to point to me not having a GPU with the VRAM to pull something like this off. I'm confident in figuring out the backend programming automation part of this, just not sure what LLM specifically I should be looking at to do this.
What do you mean by "break down images"? How exactly would a LLM be useful for that? IMO this seems something more fit to some other vision model, not an LLM, but ofc I don't have a proper understand of what you're trying to achieve.
 
New models with audio support on the block:

Basically they added ASR support on top of their existing 3B and 24B models, and managed really nice results with those. Having a built-in LLM is awesome as well, depending on your application.

I'm personally more interested in the STT part, not really the LLM itself, so I'll be giving the 3B model a run this weekend and comparing it to my current WhisperX setup. If someone comes up with nice quants for the 24B version, I may end up giving it a go as well.
 
New models with audio support on the block:

Basically they added ASR support on top of their existing 3B and 24B models, and managed really nice results with those. Having a built-in LLM is awesome as well, depending on your application.

I'm personally more interested in the STT part, not really the LLM itself, so I'll be giving the 3B model a run this weekend and comparing it to my current WhisperX setup. If someone comes up with nice quants for the 24B version, I may end up giving it a go as well.

For a moment I thought you are talking about a DAW / audio processing helper.
 
1T?!?! I ain't NEVER running this shit dawg :cry: :cry: :cry:

By the way DeepSeek R1 doesn't work with RPC for me, but Qwen 235B does.
Tried it out some more, DS works with RPC but with CPUs and newer NVIDIA hardware and with f16 context ONLY in my experience. I'm getting around 2 t/s with 2.45 t/s with -fmoe on the ik_llama.cpp fork with DeepSeek-R1-UD-IQ1_S, all other options are: --ctx-size 16384 -mla 2 -fa -amb 512 -rtr
 
Back
Top