What local LLM-s you use?

johnspack · Feb 20, 2025

I actually have that model, but would like to go up a bit, maybe q8? I also see llama 70b. But don't see any download links....
I have to find models that will fit in 64gbs of ram.

tpa-pr · Feb 20, 2025

Divide Overflow said:
Get your own data center cards and leave my gaming GPUs alone!

Never fear friend, my card is primarily for gaming. The AI stuff is just for experimenting from time-to-time

Anyway, I decided to download llama3.3, but unfortunately I don't have the VRAM to run it. It maxed out my VRAM and any responses were INCREDIBLY slow. So I suspect i'll need to stick to smaller models.

csendesmark · Feb 20, 2025

johnspack said:
I actually have that model, but would like to go up a bit, maybe q8? I also see llama 70b. But don't see any download links....
I have to find models that will fit in 64gbs of ram.

It is all there:
DeepSeek-R1-Distill-Qwen-32B-Q8_0
Llama-3.3-70B-Instruct-GGUF

Alternatively you can download it with LM Studio like this:

Super convenient :toast:

johnspack · Feb 21, 2025

Thanks yeah I finally found more downloads. Right now I have to use Koboldcpp, and it doesn't have the download feature. LMStudio was failing on me, so I switched.
Although after some time the model f's up, but in Kobold I just use start new session and it clears up.
Yep, now have DeepSeek-R1-Distill-Qwen-32B-Q8_0 running just fine. Not bad for an ancient computer!
Oh and Q8 is using around 35gbs of ram.

It's a bit slow... not really using my gpu as much as I'd like:

Rover4444 · Feb 21, 2025

johnspack said:
Thanks yeah I finally found more downloads. Right now I have to use Koboldcpp, and it doesn't have the download feature. LMStudio was failing on me, so I switched.
Although after some time the model f's up, but in Kobold I just use start new session and it clears up.
Yep, now have DeepSeek-R1-Distill-Qwen-32B-Q8_0 running just fine. Not bad for an ancient computer!
Oh and Q8 is using around 35gbs of ram.

It's a bit slow... not really using my gpu as much as I'd like:
View attachment 385850

The more layers you can put on VRAM the faster it'll perform. Use the Q4 quants or check how much of your VRAM is being used.

R-T-B · Feb 21, 2025

Divide Overflow said:
Get your own data center cards and leave my gaming GPUs alone!

Why not both? These guys likely game too given the audience here. You are being mad at the wrong group.

csendesmark · Feb 21, 2025

Rover4444 said:
The more layers you can put on VRAM the faster it'll perform. Use the Q4 quants or check how much of your VRAM is being used.

Or the other way,
It is more and more painful if you put more and more layers into your system RAM :roll:

R-T-B said:
Why not both? These guys likely game too given the audience here. You are being mad at the wrong group.

Yeah!
PC is a general computer it can do it all,
You can load and unload those programs on demand! :toast:

Wanted to try out yesterday but did not have the smirki/UIGEN-T1-Qwen-7b is doing good job with match problems, with language, not that great.
And it is quite fast with 74 token/s for me.

johnspack · Feb 22, 2025

Well, can't seem to run any llama models, not sure why. Fortunately deepseek-r1-distill models all work fine. Would like to figure out how to offload more to my gpu though.
Looks like it only assigns about 3.5gbs of vram.

csendesmark · Feb 23, 2025

johnspack said:
Well, can't seem to run any llama models, not sure why. Fortunately deepseek-r1-distill models all work fine. Would like to figure out how to offload more to my gpu though.
Looks like it only assigns about 3.5gbs of vram.

More info?
Not running? You mean not running at all or not on GPU?
What daemon you run the models?

10tothemin9volts · Feb 23, 2025

The LLM I use locally is Qwen2.5-32B-Instruct-Q6_K.gguf. It has replaced all the smaller ones for me. Frontend: text-generation-webui (this was what worked first when I tried a LLM GUI on Linux and I stuck with it). Speed: ~2.6 tokens/second when 23 layers are offloaded to my 4070 (set CPU threads: 4. DDR5-4800 dual channel). For bigger LLMs I use Chatbot Arena' Direct Chat.
I benchmarked RAM vs VRAM offloading:

csendesmark · Feb 23, 2025

DeepSeek promises something new for the next week!

https://x.com/deepseek_ai/status/1892786555494019098

igormp · Feb 23, 2025

csendesmark said:
DeepSeek promises something new for the next week!
View attachment 386288
https://x.com/deepseek_ai/status/1892786555494019098

I really hope they release the source for their training pipeline, or at least part of it.
Their whole dualpipe idea got me really interested, same goes for their fp8 training arch.

johnspack · Feb 24, 2025

csendesmark said:
More info?
Not running? You mean not running at all or not on GPU?
What daemon you run the models?

I'm having to use the Koboldcpp daemon, LmStudio doesn't seem to want to use my gpu at all. Just tested Kobold-nocuda and am able to run Llama, but horribly slow. Didn't realize
my old gpu helped that much. Guess I'm stuck with Deepseek until I get a 3090 or something....
Oh and I can load it with clblast I just found out, heats up my gpu quite a bit, but almost as slow as no gpu. CuBlas is by far the fastest, but I can't run Llama models with it.
Vulkan works but on my old computer it's terrible slow. Dam I need a new computer!

Rover4444 · Feb 25, 2025

johnspack said:
I'm having to use the Koboldcpp daemon, LmStudio doesn't seem to want to use my gpu at all. Just tested Kobold-nocuda and am able to run Llama, but horribly slow. Didn't realize
my old gpu helped that much. Guess I'm stuck with Deepseek until I get a 3090 or something....
Oh and I can load it with clblast I just found out, heats up my gpu quite a bit, but almost as slow as no gpu. CuBlas is by far the fastest, but I can't run Llama models with it.
Vulkan works but on my old computer it's terrible slow. Dam I need a new computer!

I have no idea what's happening, but it shouldn't be. Turn on flash attention, set the layers manually, or use --lowvram. You might be getting OOM errors when trying to run llama.

Just to make sure it's an issue with llamacpp and nothing else, try to offload one layer only. If it works it's OOMing when layers are set to auto and if it doesn't you should try recreating your venv and reinstalling packages.

csendesmark · Feb 25, 2025

johnspack said:
I'm having to use the Koboldcpp daemon, LmStudio doesn't seem to want to use my gpu at all. Just tested Kobold-nocuda and am able to run Llama, but horribly slow. Didn't realize
my old gpu helped that much. Guess I'm stuck with Deepseek until I get a 3090 or something....
Oh and I can load it with clblast I just found out, heats up my gpu quite a bit, but almost as slow as no gpu. CuBlas is by far the fastest, but I can't run Llama models with it.
Vulkan works but on my old computer it's terrible slow. Dam I need a new computer!

Sorry mate I don't know Koboldcpp at all.
But when you arrive to get your 3090, make sure that's the 24GB version!

igormp · Feb 25, 2025

Btw, seems like there are some 4090s with 96GB of VRAM floating around in China:

https://x.com/Yomix1337/status/1893686635239289307

Models with 48gb were already a thing, too bad I can't easily grab one from China

source

Those would make for a really nice local LLM setup.

Ultron1337 · Feb 25, 2025

igormp said:
Models with 48gb were already a thing, too bad I can't easily grab one from China
View attachment 386626
source

Those would make for a really nice local LLM setup.

I believe these are known as nVidia L40 or L40s

igormp · Feb 25, 2025

Ultron1337 said:
I believe these are known as nVidia L40 or L40s

The L40 has a higher bin of the AD102 compared to the 4090, but the 4090 has the faster GDDR6X which gives it more memory bandwidth.
The 4090 in raw perf should be faster than the L40 for LLMs, and that 48GB model at a third/quarter of the price of the L40 makes it really interesting.

Ultron1337 · Feb 26, 2025

igormp said:
The L40 has a higher bin of the AD102 compared to the 4090, but the 4090 has the faster GDDR6X which gives it more memory bandwidth.
The 4090 in raw perf should be faster than the L40 for LLMs, and that 48GB model at a third/quarter of the price of the L40 makes it really interesting.

Ok I checked the Chiphell links for 48GB edition. Indeed using GDDR6X, is it double-sided GDDR6X ? That blower looks nasty and not surprised by that 50dB measured noise/roar. For desktop use, should come with noise cancelling helmet.
Getting existing GPU to support double or triple amount of RAM without problems in not trivial task. BIOS needs to be correctly modified, memory power and cooling requirements met etc.
While these GPU based solutions seem cool now, I think unified memory solutions like nVidia project DIGITS are the way to go here. Why limit yourself to GPU memory, when whole system memory can be fast ?

csendesmark · Feb 26, 2025

Ultron1337 said:
Ok I checked the Chiphell links for 48GB edition. Indeed using GDDR6X, is it double-sided GDDR6X ? That blower looks nasty and not surprised by that 50dB measured noise/roar. For desktop use, should come with noise cancelling helmet.
Getting existing GPU to support double or triple amount of RAM without problems in not trivial task. BIOS needs to be correctly modified, memory power and cooling requirements met etc.
While these GPU based solutions seem cool now, I think unified memory solutions like nVidia project DIGITS are the way to go here. Why limit yourself to GPU memory, when whole system memory can be fast ?

"For desktop use, should come with noise cancelling helmet." :roll:

Oh well, sadly these are not for home use, but for datacenters

These are not a place where you want to be - for longer periods anyway.

igormp · Feb 26, 2025

Ultron1337 said:
Ok I checked the Chiphell links for 48GB edition. Indeed using GDDR6X, is it double-sided GDDR6X ?

Seems like it's a custom PCB based on the one from the 3090, so yeah, clamshell design with 24x 16Gb GDDR6X modules.

Ultron1337 said:
That blower looks nasty and not surprised by that 50dB measured noise/roar. For desktop use, should come with noise cancelling helmet.

Apparently this can be made way better by reducing the power limit. That blower format is also perfect for using multiple GPUs in a single setup.

Ultron1337 said:
While these GPU based solutions seem cool now, I think unified memory solutions like nVidia project DIGITS are the way to go here. Why limit yourself to GPU memory, when whole system memory can be fast ?

While I do agree that those devices are cool and will fill a really nice niche, the Issue is that so fa those unified memory system are not as fast as a dGPU.
We're yet to see the specs on DIGITS, but something like the Strix Halo only has 256GB/s, which is on the level of a 6600xt. A 4090 does 4x that, and a 5090 does 1.8TB/s.
And that's only talking about memory bandwidth, those devices also have way more raw compute power compared to those iGPUs.

The M1/M2 Ultra do have nice memory bw (~800GB/s), but their actual iGPU is slower compared to the likes of a 3090/4090, and they're not cheap at all. A theoretical M4 Ultra should achieve 1TB/s, similar to a 4090, but we're yet to see how fast it'll be, and pricing should be on the higher end as well.

Ultron1337 · Feb 26, 2025

igormp said:
While I do agree that those devices are cool and will fill a really nice niche, the Issue is that so fa those unified memory system are not as fast as a dGPU.
We're yet to see the specs on DIGITS, but something like the Strix Halo only has 256GB/s, which is on the level of a 6600xt. A 4090 does 4x that, and a 5090 does 1.8TB/s.
And that's only talking about memory bandwidth, those devices also have way more raw compute power compared to those iGPUs.
The M1/M2 Ultra do have nice memory bw (~800GB/s), but their actual iGPU is slower compared to the likes of a 3090/4090, and they're not cheap at all. A theoretical M4 Ultra should achieve 1TB/s, similar to a 4090, but we're yet to see how fast it'll be, and pricing should be on the higher end as well.

I agree that Strix Halo's 256GB/s is puny, but then again that's first gen unified memory PC platform from AMD. It has potential to get much better over time.
DIGITS should get 900GB/s and a GPU to match that bandwidth. With 128GB of RAM, this will best many datacenter class GPUs that cost a lot more money.
I've seen various Apple's benchmarks on LLMs and considering hardware, they do not fare well, I suspect software optimizations are partly to blame here.
Ultimately merging system and GPU memory over high speed link offers superior bang for buck, than just cranking up GDDRN on a GPU each gen.

csendesmark said:
Oh well, sadly these are not for home use, but for datacenters

These are not a place where you want to be - for longer periods anyway.

I've roamed around various datacenters for decades. Did so even last week. Low oxygen ones are "fun". Most high end compute in DCs is nowadays water cooled (DLC), especially GPUs, but we are going off topic here.

igormp · Feb 26, 2025

Ultron1337 said:
DIGITS should get 900GB/s and a GPU to match that bandwidth. With 128GB of RAM, this will best many datacenter class GPUs that cost a lot more money.

That's one possibility, but I think something in the 450~512GB/s mark is more realistic.
Agreed with all your other points tho, those devices have lots of potential in the future and may cover most use cases.

cal5582 · Feb 26, 2025

Ultron1337 said:
Ok I checked the Chiphell links for 48GB edition. Indeed using GDDR6X, is it double-sided GDDR6X ? That blower looks nasty and not surprised by that 50dB measured noise/roar. For desktop use, should come with noise cancelling helmet.
Getting existing GPU to support double or triple amount of RAM without problems in not trivial task. BIOS needs to be correctly modified, memory power and cooling requirements met etc.
While these GPU based solutions seem cool now, I think unified memory solutions like nVidia project DIGITS are the way to go here. Why limit yourself to GPU memory, when whole system memory can be fast ?

didnt amd do that years ago with the memory controller on vega that could address system ram and even use nvme storage as gpu "ram"

igormp · Feb 26, 2025

cal5582 said:
didnt amd do that years ago with the memory controller on vega that could address system ram and even use nvme storage as gpu "ram"

No, that was just a dumb PCIe switch/mux, no different than having a regular NVMe in your motherboard and using PCIe P2P to access stuff between devices.

That has nothing to do with unified memory.

System Name	System2 Blacknet , System1 Blacknet2
Processor	System2 Threadripper 1920x, System1 2699 v3
Motherboard	System2 Asrock Fatality x399 Professional Gaming, System1 Asus X99-A
Cooling	System2 Noctua NH-U14 TR4-SP3 Dual 140mm fans, System1 AIO
Memory	System2 64GBS DDR4 3000, System1 32gbs DDR4 2400
Video Card(s)	System2 GTX 980Ti System1 GTX 970
Storage	System2 4x SSDs + NVme= 2.250TB 2xStorage Drives=8TB System1 3x SSDs=2TB
Display(s)	1x27" 1440 display 1x 24" 1080 display
Case	System2 Some Nzxt case with soundproofing...
Audio Device(s)	Asus Xonar U7 MKII
Power Supply	System2 EVGA 750 Watt, System1 XFX XTR 750 Watt
Mouse	Logitech G900 Chaos Spectrum
Keyboard	Ducky
Software	Archlinux, Manjaro, Win11 Ent 24h2
Benchmark Scores	It's linux baby!

System Name	IZALITH (or just "Lith")
Processor	AMD Ryzen 7 7800X3D (4.2Ghz base, 5.0Ghz boost, -30 PBO offset)
Motherboard	Gigabyte X670E Aorus Master Rev 1.0
Cooling	Deepcool Gammaxx AG400 Single Tower
Memory	Corsair Vengeance 64GB (2x32GB) 6000MHz CL40 DDR5 XMP (XMP enabled)
Video Card(s)	PowerColor Radeon RX 7900 XTX Red Devil OC 24GB (2.39Ghz base, 2.99Ghz boost, -30 core offset)
Storage	2x1TB SSD, 2x2TB SSD, 2x 8TB HDD
Display(s)	Samsung Odyssey G51C 27" QHD (1440p 165Hz) + Samsung Odyssey G3 24" FHD (1080p 165Hz)
Case	Corsair 7000D Airflow Full Tower
Audio Device(s)	Corsair HS55 Surround Wired Headset/LG Z407 Speaker Set
Power Supply	Corsair HX1000 Platinum Modular (1000W)
Mouse	Logitech G502 X LIGHTSPEED Wireless Gaming Mouse
Keyboard	Keychron K4 Wireless Mechanical Keyboard
Software	Arch Linux

System Name	Kincsem
Processor	AMD Ryzen 9 9950X
Motherboard	ASUS ProArt X870E-CREATOR WIFI
Cooling	Be Quiet Dark Rock Pro 5
Memory	Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s)	Sapphire AMD RX 7900 XT Pulse
Storage	Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s)	Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case	Cooler Master CM 690 III
Power Supply	Seasonic 1300W 80+ Gold Prime
Mouse	Logitech G502 Hero
Keyboard	HyperX Alloy Elite RGB
Software	Windows 10-64
Benchmark Scores	https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc

System Name	System2 Blacknet , System1 Blacknet2
Processor	System2 Threadripper 1920x, System1 2699 v3
Motherboard	System2 Asrock Fatality x399 Professional Gaming, System1 Asus X99-A
Cooling	System2 Noctua NH-U14 TR4-SP3 Dual 140mm fans, System1 AIO
Memory	System2 64GBS DDR4 3000, System1 32gbs DDR4 2400
Video Card(s)	System2 GTX 980Ti System1 GTX 970
Storage	System2 4x SSDs + NVme= 2.250TB 2xStorage Drives=8TB System1 3x SSDs=2TB
Display(s)	1x27" 1440 display 1x 24" 1080 display
Case	System2 Some Nzxt case with soundproofing...
Audio Device(s)	Asus Xonar U7 MKII
Power Supply	System2 EVGA 750 Watt, System1 XFX XTR 750 Watt
Mouse	Logitech G900 Chaos Spectrum
Keyboard	Ducky
Software	Archlinux, Manjaro, Win11 Ent 24h2
Benchmark Scores	It's linux baby!

System Name	Pioneer
Processor	Ryzen 9 9950X
Motherboard	MSI MAG X670E Tomahawk Wifi
Cooling	Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory	64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6200(Running 1T no GDM)
Video Card(s)	XFX RX 7900 XTX Speedster Merc 310
Storage	Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64, other office machines run Windows 11 Enterprise

What local LLM-s you use?

johnspack

Here For Good!

tpa-pr

csendesmark

johnspack

Here For Good!

Rover4444

R-T-B

csendesmark

johnspack

Here For Good!

csendesmark

10tothemin9volts

csendesmark

igormp

johnspack

Here For Good!

Rover4444

csendesmark

igormp

Ultron1337

igormp

Ultron1337

csendesmark

igormp

Ultron1337

igormp

cal5582

igormp

Processor	7800X3D @ Curve Optimizer: All Core: -25
Motherboard	TUF Gaming B650-Plus
Memory	2xKSM48E40BD8KM-32HM ECC RAM (ECC enabled in BIOS)
Video Card(s)	4070 @ 110W
Display(s)	SAMSUNG S95B 55" QD-OLED TV
Power Supply	RM850x

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

Processor	AMD 5600X
Motherboard	ASUS TUF GAMING B550M-Plus WiFi
Cooling	be quiet! Dark Rock 4
Memory	G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s)	Sapphire Nitro+ RX 7900 XTX 24GB
Storage	Kingston KC3000 2TB + QNAP TBS-464
Display(s)	LG 35" LCD 35WN75C-B 3440x1440
Case	Kolink Bastion RGB Midi-Tower
Power Supply	Enermax Digifanless 550W
Mouse	Razer Deathadder v2
Benchmark Scores	phi4 - 62 tokens/s gemma3:27B - 35 tps

System Name	Nirn
Processor	Amd Ryzen 7950X3D
Motherboard	MSI MEG ACE X670e
Cooling	Noctua NH-D15
Memory	128 GB Kingston DDR5 6000 (running at 4000)
Video Card(s)	Radeon RX 7900XTX (24G) + Geforce 4070ti (12G) Physx
Storage	SAMSUNG 990 EVO SSD 2TB Gen 5 x2 (OS)+SAMSUNG 980 SSD 1TB PCle 3.0x4 (Primocache) +2X 22TB WD Gold
Display(s)	Samsung UN55NU8000 (Freesync)
Case	Corsair Graphite Series 780T White
Audio Device(s)	Creative Soundblaster AE-7 + Sennheiser GSP600
Power Supply	Seasonic PRIME TX-1000 Titanium
Mouse	Razer Mamba Elite Wired
Keyboard	Razer BlackWidow Chroma v1
VR HMD	Oculus Quest 2
Software	Windows 10