• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

What local LLM-s you use?

Are you talking about just single user inference with larger-ish models solely in VRAM across GPUs in different nodes?
No. Single user inference with two workers doing partial offloading so I don't have to buy more RAM for R1.
If so, llama.cpp with RPC should do the job already. Not much to look into IMO.
Yes, I know. I'm looking into it.
 
No. Single user inference with two workers doing partial offloading so I don't have to buy more RAM for R1.

Yes, I know. I'm looking into it.
In this case I can really recommend looking into ik's fork with the quants I shared, it can give a nice boost for cases where you offload to cpu, as well as allowing you to mmap the weights.
 
I've moved on from this model, I found Gemma3 27B QAT much better.

Other than that I've used Devstral.

And afaik, the crashes still persist.


If you have another GPU lying around then use that.

VRAM >>> RAM

Oooh, nice. Will definitely try that one too, thanks! :)
 
Hello, please tell me which models are best suited for creating short stories, about 10,000 words? For example, so that you can enter the prompt "write a story about how the hero defeated the dragon, took the princess from the tower, and his donkey married the dragon" and get Shrek in response on the minimum. I have 64 GB of RAM, you can allocate no more than 32 GB to the volcano.
I guess you can use EQ benchmark for evaluation, I've not tried LLMs for story writing or roleplaying personally.


Try this model via LLM Studio - https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506
 
You can't run R1 with just an extra card. For MoEs you should have enough RAM+VRAM to store the model then enough VRAM for the active params.
Certainly, I just meant in general. Although MoEs run ok on RAM as well.
 
Gemma 3 Abliterated :D

Though I admit that mostly I use the free offerings from DeepSeek and Microsoft because most of what I ask is either translation requests or powershell code and even the free tier offerings are better than what my 4090 will run locally with any kind of performance.

Looks like I somehow missed the Qwen 3 release - I'll have to take a look.

I think the Q2 quant of the full model might just barely run on 24GB VRAM and 96GB RAM? I tried out Llama 4 Scout Q4_K_M and I was surprised by how well it performed, but that's MoE for you.

Is anyone aware of any benchmarks of how the 235B Q2 performs vs. 30B Q4? In knowledge, I mean, not speed.


edit: Running Qwen3-235B-A22B-Q2_K_L puts me at 90% memory usage but it runs. And it's just below 5 Token/s so it's not unusably slow.

What really impressed me though was the combination of the speed of Qwen 3 30B combined with the quality of output to a rather tricky powershell code generation request. One of my favourite questions to ask any LLM I try out:
As a professional PowerShell developer, please write a script to assist with migrating mailboxes from an old three-server DAG to a new three-server DAG in Exchange. The script should automatically create the required number of new mailbox databases, ensuring that no database exceeds 500GB in size. Additionally, it should evenly distribute mailboxes across the new databases for optimal load balancing. The script should handle all necessary steps, including the creation of the databases, mailbox moves, and ensuring an even distribution of mailbox data across the new DAG.

I initially wanted to try Qwen3-235B-A22B.i1-IQ2_M, but the files use some strange partitioning that cause an error if I try to recombine them with llama.cpp, so I ended up using the somewhat larger Qwen3-235B-A22B-Q2_K_L instead. Has anyone dealt with a 'does not contain split.count metadata' error before? I'm guessing a different tool was used for the splitting.
 
Last edited:
Back
Top