What local LLM-s you use?

Rover4444 · Jun 24, 2025

igormp said:
Are you talking about just single user inference with larger-ish models solely in VRAM across GPUs in different nodes?

No. Single user inference with two workers doing partial offloading so I don't have to buy more RAM for R1.

igormp said:
If so, llama.cpp with RPC should do the job already. Not much to look into IMO.

Yes, I know. I'm looking into it.

igormp · Jun 24, 2025

Rover4444 said:
No. Single user inference with two workers doing partial offloading so I don't have to buy more RAM for R1.

Yes, I know. I'm looking into it.

In this case I can really recommend looking into ik's fork with the quants I shared, it can give a nice boost for cases where you offload to cpu, as well as allowing you to mmap the weights.

lambda · Jun 24, 2025

Dr. Dro said:
I have this model downloaded, and decided to try it... have the crashes been resolved?

View attachment 404931

I've moved on from this model, I found Gemma3 27B QAT much better.

Other than that I've used Devstral.

And afaik, the crashes still persist.

Dr. Dro said:
Dang, I need a RAM upgrade...

If you have another GPU lying around then use that.

VRAM >>> RAM

Dr. Dro · Jun 24, 2025

lambda said:
I've moved on from this model, I found Gemma3 27B QAT much better.

Other than that I've used Devstral.

And afaik, the crashes still persist.

If you have another GPU lying around then use that.

VRAM >>> RAM

Oooh, nice. Will definitely try that one too, thanks!

lambda · Jun 24, 2025

korn87 said:
Hello, please tell me which models are best suited for creating short stories, about 10,000 words? For example, so that you can enter the prompt "write a story about how the hero defeated the dragon, took the princess from the tower, and his donkey married the dragon" and get Shrek in response on the minimum. I have 64 GB of RAM, you can allocate no more than 32 GB to the volcano.

I guess you can use EQ benchmark for evaluation, I've not tried LLMs for story writing or roleplaying personally.

EQ-Bench Longform Creative Writing Leaderboard

Try this model via LLM Studio - https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506

Rover4444 · Jun 25, 2025

lambda said:
If you have another GPU lying around then use that.

VRAM >>> RAM

You can't run R1 with just an extra card. For MoEs you should have enough RAM+VRAM to store the model then enough VRAM for the active params.

lambda · Jun 25, 2025

Rover4444 said:
You can't run R1 with just an extra card. For MoEs you should have enough RAM+VRAM to store the model then enough VRAM for the active params.

Certainly, I just meant in general. Although MoEs run ok on RAM as well.

steamrick · Jun 26, 2025

Gemma 3 Abliterated

huihui_ai/gemma3-abliterated

The current, most capable model that runs on a single GPU.

ollama.com

Though I admit that mostly I use the free offerings from DeepSeek and Microsoft because most of what I ask is either translation requests or powershell code and even the free tier offerings are better than what my 4090 will run locally with any kind of performance.

Looks like I somehow missed the Qwen 3 release - I'll have to take a look.

I think the Q2 quant of the full model might just barely run on 24GB VRAM and 96GB RAM? I tried out Llama 4 Scout Q4_K_M and I was surprised by how well it performed, but that's MoE for you.

Is anyone aware of any benchmarks of how the 235B Q2 performs vs. 30B Q4? In knowledge, I mean, not speed.

edit: Running Qwen3-235B-A22B-Q2_K_L puts me at 90% memory usage but it runs. And it's just below 5 Token/s so it's not unusably slow.

What really impressed me though was the combination of the speed of Qwen 3 30B combined with the quality of output to a rather tricky powershell code generation request. One of my favourite questions to ask any LLM I try out:

As a professional PowerShell developer, please write a script to assist with migrating mailboxes from an old three-server DAG to a new three-server DAG in Exchange. The script should automatically create the required number of new mailbox databases, ensuring that no database exceeds 500GB in size. Additionally, it should evenly distribute mailboxes across the new databases for optimal load balancing. The script should handle all necessary steps, including the creation of the databases, mailbox moves, and ensuring an even distribution of mailbox data across the new DAG.

I initially wanted to try Qwen3-235B-A22B.i1-IQ2_M, but the files use some strange partitioning that cause an error if I try to recombine them with llama.cpp, so I ended up using the somewhat larger Qwen3-235B-A22B-Q2_K_L instead. Has anyone dealt with a 'does not contain split.count metadata' error before? I'm guessing a different tool was used for the splitting.

Processor	9950x \| 5950x
Motherboard	x670e ProArt\| B550 ProArt
Cooling	PA 120 SE \|Fuma 2
Memory	4x64GB Kingston CUDIMM @5200MHz \| 4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	Corsair RM1000e \| XPG Core Reactor 850W
Software	I use Arch btw

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

Processor	13th Gen Intel Core i9-13900KS
Motherboard	ASUS ROG Maximus Z790 Apex Encore
Cooling	Pichau Lunara ARGB 360 + Honeywell PTM7950
Memory	32 GB G.Skill Trident Z5 RGB @ 7600 MT/s
Video Card(s)	Palit GameRock OC GeForce RTX 5090 32 GB
Storage	500 GB WD Black SN750 + 4x 300 GB WD VelociRaptor WD3000HLFS HDDs
Display(s)	55-inch LG G3 OLED
Case	Cooler Master MasterFrame 700 benchtable
Audio Device(s)	EVGA NU Audio + Sony MDR-V7 headphones
Power Supply	EVGA 1300 G2 1.3kW 80+ Gold
Mouse	Microsoft Classic IntelliMouse
Keyboard	IBM Model M type 1391405
Software	Windows 10 Pro 22H2
Benchmark Scores	I pulled a Qiqi~

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

Processor	Ryzen 7 9800X3D
Motherboard	MSI MAG X870 Tomahawk WIFI
Cooling	Custom watercooling loop with Watercool MO-RA3 420
Memory	96GB DDR5-6000
Video Card(s)	Zotac RTX 4090 AIRO
Storage	Intel Optane DC P4800X 1.5TB + Micron 7400 PRO 3.84TB + KIOXIA CD6-R 7.68TB + Micron 9200 ECO 11TB
Display(s)	Dell UltraSharp U4025QW
Case	Fractal Design Vector RS
Audio Device(s)	Canton CT 800, beyerdynamic DT 990 Pro, Audio Technica AT3035
Power Supply	Cooler Master X Silent Edge Platinum 1100
Mouse	pwnage Stormbreaker
Keyboard	Wooting 80HE (zinc case version)