What local LLM-s you use?

Rover4444 · Jun 24, 2025

igormp said:
Are you talking about just single user inference with larger-ish models solely in VRAM across GPUs in different nodes?

No. Single user inference with two workers doing partial offloading so I don't have to buy more RAM for R1.

igormp said:
If so, llama.cpp with RPC should do the job already. Not much to look into IMO.

Yes, I know. I'm looking into it.

igormp · Jun 24, 2025

Rover4444 said:
No. Single user inference with two workers doing partial offloading so I don't have to buy more RAM for R1.

Yes, I know. I'm looking into it.

In this case I can really recommend looking into ik's fork with the quants I shared, it can give a nice boost for cases where you offload to cpu, as well as allowing you to mmap the weights.

lambda · Jun 24, 2025

Dr. Dro said:
I have this model downloaded, and decided to try it... have the crashes been resolved?

View attachment 404931

I've moved on from this model, I found Gemma3 27B QAT much better.

Other than that I've used Devstral.

And afaik, the crashes still persist.

Dr. Dro said:
Dang, I need a RAM upgrade...

If you have another GPU lying around then use that.

VRAM >>> RAM

Dr. Dro · Jun 24, 2025

lambda said:
I've moved on from this model, I found Gemma3 27B QAT much better.

Other than that I've used Devstral.

And afaik, the crashes still persist.

If you have another GPU lying around then use that.

VRAM >>> RAM

Oooh, nice. Will definitely try that one too, thanks!

lambda · Jun 24, 2025

korn87 said:
Hello, please tell me which models are best suited for creating short stories, about 10,000 words? For example, so that you can enter the prompt "write a story about how the hero defeated the dragon, took the princess from the tower, and his donkey married the dragon" and get Shrek in response on the minimum. I have 64 GB of RAM, you can allocate no more than 32 GB to the volcano.

I guess you can use EQ benchmark for evaluation, I've not tried LLMs for story writing or roleplaying personally.

EQ-Bench Longform Creative Writing Leaderboard

Try this model via LLM Studio - https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506

Rover4444 · Jun 25, 2025

lambda said:
If you have another GPU lying around then use that.

VRAM >>> RAM

You can't run R1 with just an extra card. For MoEs you should have enough RAM+VRAM to store the model then enough VRAM for the active params.

lambda · Jun 25, 2025

Rover4444 said:
You can't run R1 with just an extra card. For MoEs you should have enough RAM+VRAM to store the model then enough VRAM for the active params.

Certainly, I just meant in general. Although MoEs run ok on RAM as well.

steamrick · Jun 26, 2025

Gemma 3 Abliterated

huihui_ai/gemma3-abliterated

The current, most capable model that runs on a single GPU.

ollama.com

Though I admit that mostly I use the free offerings from DeepSeek and Microsoft because most of what I ask is either translation requests or powershell code and even the free tier offerings are better than what my 4090 will run locally with any kind of performance.

Looks like I somehow missed the Qwen 3 release - I'll have to take a look.

I think the Q2 quant of the full model might just barely run on 24GB VRAM and 96GB RAM? I tried out Llama 4 Scout Q4_K_M and I was surprised by how well it performed, but that's MoE for you.

Is anyone aware of any benchmarks of how the 235B Q2 performs vs. 30B Q4? In knowledge, I mean, not speed.

edit: Running Qwen3-235B-A22B-Q2_K_L puts me at 90% memory usage but it runs. And it's just below 5 Token/s so it's not unusably slow.

What really impressed me though was the combination of the speed of Qwen 3 30B combined with the quality of output to a rather tricky powershell code generation request. One of my favourite questions to ask any LLM I try out:

As a professional PowerShell developer, please write a script to assist with migrating mailboxes from an old three-server DAG to a new three-server DAG in Exchange. The script should automatically create the required number of new mailbox databases, ensuring that no database exceeds 500GB in size. Additionally, it should evenly distribute mailboxes across the new databases for optimal load balancing. The script should handle all necessary steps, including the creation of the databases, mailbox moves, and ensuring an even distribution of mailbox data across the new DAG.

I initially wanted to try Qwen3-235B-A22B.i1-IQ2_M, but the files use some strange partitioning that cause an error if I try to recombine them with llama.cpp, so I ended up using the somewhat larger Qwen3-235B-A22B-Q2_K_L instead. Has anyone dealt with a 'does not contain split.count metadata' error before? I'm guessing a different tool was used for the splitting.

igormp · Jul 16, 2025

Moonshot's Kimi is looking really nice. There are already quants going on for their 1T model:

unsloth/Kimi-K2-Instruct-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

And they have some other smaller dense models that appear to be distills as well:

moonshotai/Kimi-Dev-72B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Rover4444 · Jul 16, 2025

igormp said:
Moonshot's Kimi is looking really nice. There are already quants going on for their 1T model:

unsloth/Kimi-K2-Instruct-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

And they have some other smaller dense models that appear to be distills as well:

moonshotai/Kimi-Dev-72B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1T?!?! I ain't NEVER running this shit dawg :cry:

By the way DeepSeek R1 doesn't work with RPC for me, but Qwen 235B does.

thesmokingman · Jul 16, 2025

Gonna use gork in mah car.

igormp · Jul 16, 2025

Rover4444 said:
1T?!?! I ain't NEVER running this shit dawg

If you have enough storage, you can run it by mmap'ing out of disk.
It should actually not be that bad given that it has "only" 32B experts active, so you could get actually reasonable performance with 64GB of RAM, 16GB of VRAM for PP, and a fast enough NVMe.

ShiBDiB said:
Actually have a question for the nerds in here. I wanted to setup a local LLM to break down images of local weather maps (Only once an hour or so) but my light googling (and lighter knowledge on LLM's) seem to point to me not having a GPU with the VRAM to pull something like this off. I'm confident in figuring out the backend programming automation part of this, just not sure what LLM specifically I should be looking at to do this.

What do you mean by "break down images"? How exactly would a LLM be useful for that? IMO this seems something more fit to some other vision model, not an LLM, but ofc I don't have a proper understand of what you're trying to achieve.

igormp · Friday at 5:15 PM

New models with audio support on the block:

Voxtral | Mistral AI

Introducing frontier open source speech understanding models.

mistral.ai

Basically they added ASR support on top of their existing 3B and 24B models, and managed really nice results with those. Having a built-in LLM is awesome as well, depending on your application.

I'm personally more interested in the STT part, not really the LLM itself, so I'll be giving the 3B model a run this weekend and comparing it to my current WhisperX setup. If someone comes up with nice quants for the 24B version, I may end up giving it a go as well.

unwind-protect · Friday at 6:12 PM

igormp said:
New models with audio support on the block:

Voxtral | Mistral AI

Introducing frontier open source speech understanding models.

mistral.ai

Basically they added ASR support on top of their existing 3B and 24B models, and managed really nice results with those. Having a built-in LLM is awesome as well, depending on your application.

I'm personally more interested in the STT part, not really the LLM itself, so I'll be giving the 3B model a run this weekend and comparing it to my current WhisperX setup. If someone comes up with nice quants for the 24B version, I may end up giving it a go as well.

For a moment I thought you are talking about a DAW / audio processing helper.

Rover4444 · Monday at 1:55 AM

Rover4444 said:
1T?!?! I ain't NEVER running this shit dawg

By the way DeepSeek R1 doesn't work with RPC for me, but Qwen 235B does.

Tried it out some more, DS works with RPC but with CPUs and newer NVIDIA hardware and with f16 context ONLY in my experience. I'm getting around 2 t/s with 2.45 t/s with -fmoe on the ik_llama.cpp fork with DeepSeek-R1-UD-IQ1_S, all other options are: --ctx-size 16384 -mla 2 -fa -amb 512 -rtr

Processor	9950x \| 5950x
Motherboard	x670e ProArt\| B550 ProArt
Cooling	PA 120 SE \|Fuma 2
Memory	4x64GB Kingston CUDIMM @5200MHz \| 4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	Corsair RM1000e \| XPG Core Reactor 850W
Software	I use Arch btw

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

Processor	13th Gen Intel Core i9-13900KS
Motherboard	ASUS ROG Maximus Z790 Apex Encore
Cooling	Pichau Lunara ARGB 360 + Honeywell PTM7950
Memory	32 GB G.Skill Trident Z5 RGB @ 7600 MT/s
Video Card(s)	Palit GameRock OC GeForce RTX 5090 32 GB
Storage	500 GB WD Black SN750 + 4x 300 GB WD VelociRaptor WD3000HLFS HDDs
Display(s)	55-inch LG G3 OLED
Case	Cooler Master MasterFrame 700 benchtable
Audio Device(s)	EVGA NU Audio + Sony MDR-V7 headphones
Power Supply	EVGA 1300 G2 1.3kW 80+ Gold
Mouse	Microsoft Classic IntelliMouse
Keyboard	IBM Model M type 1391405
Software	Windows 10 Enterprise 22H2
Benchmark Scores	I pulled a Qiqi~

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

Processor	Ryzen 9 5900X
Motherboard	MSI B550M Pro VDH Wifi
Cooling	Deepcool AG400
Memory	16GB DDR4 3000mhz
Video Card(s)	Asus Tuf 7900XT 20GB
Storage	SN 570 512GB x2
Display(s)	LG 24GN650
Case	Lian Li A3 mATX
Power Supply	Cooler Master V850w SFX ATX 3.1 (avoid previous versions)
Software	Win 11, will soon move back to Linux

Processor	Ryzen 7 9800X3D
Motherboard	MSI MAG X870 Tomahawk WIFI
Cooling	Custom watercooling loop with Watercool MO-RA3 420
Memory	96GB DDR5-6000
Video Card(s)	Zotac RTX 4090 AIRO
Storage	Intel Optane DC P4800X 1.5TB + Micron 7400 PRO 3.84TB + KIOXIA CD6-R 7.68TB + Micron 9200 ECO 11TB
Display(s)	Dell UltraSharp U4025QW
Case	Fractal Design Vector RS
Audio Device(s)	Canton CT 800, beyerdynamic DT 990 Pro, Audio Technica AT3035
Power Supply	Cooler Master X Silent Edge Platinum 1100
Mouse	pwnage Stormbreaker
Keyboard	Wooting 80HE (zinc case version)

Processor	AMD 5900x
Motherboard	Asus x570 Strix-E
Cooling	Hardware Labs
Memory	G.Skill 4000c17 2x16gb
Video Card(s)	RTX 3090
Storage	Sabrent
Display(s)	Samsung G9
Case	Phanteks 719
Audio Device(s)	Fiio K5 Pro
Power Supply	EVGA 1000 P2
Mouse	Logitech G600
Keyboard	Corsair K95