What local LLM-s you use?

Rover4444 · Mar 18, 2025

LabRat 891 said:
IIRC, you cannot 'spread' LLM workloads across seperate VRAM segments.

You're mistaken. Layers are not swapped between GPUs by default. Each card will do it's own thing.

LabRat 891 said:
I know. I've given up on looking towards AM5 until a new PROMOTORY is released. The standardized fanout for PCIe on AM5 is unironically, literally, and un-amusingly "retarded".

It's so fucking dumb. A top slot, 3-4 slots of fucking nothing, then a 5.0 "AI" slot at the very bottom of the board.

Everybody who worked on X870 should get fired.

igormp · Mar 18, 2025

Rover4444 said:
It's so fucking dumb. A top slot, 3-4 slots of fucking nothing, then a 5.0 "AI" slot at the very bottom of the board.

The Taichi is the major offender when it comes to that.
The ProArt models still have a reasonable layout.

csendesmark · Mar 18, 2025

LabRat 891 said:
5800X3D+7900XTX+9070XT,

Choose your PSU for max wattage of the planned rig +20~25%
So for that maybe consider larger than 1000W

igormp said:
ProArt

ProART X670 or X870 is great if you wanna use multiple GPU-s
But dividing LLM-s between multiple GPUs is not that easy

johnspack · Mar 18, 2025

Well that's nice, Koboldcpp now supports Gemma-3. Running gemma-3-4b-it-16bf and it's screaming fast. Bigger models still beat up on my system, but that's expected.

igormp · Mar 18, 2025

csendesmark said:
But dividing LLM-s between multiple GPUs is not that easy

What do you mean? Something like ollama does this by default as soon as you load any model that's bigger than the free vram of a single GPU. I even posted some benchmarking numbers some pages back with my 2x3090s and some larger models.

R-T-B · Mar 18, 2025

igormp said:
Long time since I last used an AMD dGPU, but I hope one of those tools could be of use for that:

GitHub - Lurkki14/tuxclocker: Qt overclocking tool for GNU/Linux

Qt overclocking tool for GNU/Linux. Contribute to Lurkki14/tuxclocker development by creating an account on GitHub.

github.com

CoreCtrl / CoreCtrl · GitLab

Profile based system control utility

gitlab.com

Lact is a better OC/monitoring tool for AMD IMO. Also supports Intel cards oddly enough.

LabRat 891 · Mar 19, 2025

R-T-B said:
Lact is a better OC/monitoring tool for AMD IMO. Also supports Intel cards oddly enough.

I have a 10GB B570 that I irresponsibly pressed buy it now on late last month :laugh:

(I also still have my RX 7900 GRE Nitro+, that'd fit better as a secondary, but be worse than the XTX))
How are B570s for these LLM workloads? or, 'too new, lacking support'? (ala SR-IOV, etc.)

Rover4444 · Mar 19, 2025

igormp said:
The Taichi is the major offender when it comes to that.
The ProArt models still have a reasonable layout.

Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.

LabRat 891 said:
I have a 10GB B570 that I irresponsibly pressed buy it now on late last month (I also still have my RX 7900 GRE Nitro+, that'd fit better as a secondary, but be worse than the XTX))
How are B570s for these LLM workloads? or, 'too new, lacking support'? (ala SR-IOV, etc.)

They're trash for LLMs. Prompt processing takes forever with Vulkan. SD performance should be respectable though, probably faster than any RDNA 2 and older card.

LabRat 891 · Mar 19, 2025

Rover4444 said:
SD performance should be respectable though, probably faster than any RDNA 2 card and below

Relevant. I recently bought a Navi 24 6500XT 8GB for my collection, potentially as (secondary) Display/scaling card. Guess the B570 might get used, instead of sitting.

Rover4444 said:
Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.

Looking at the 'retail-available' AI/ML-leaning cards, these last few gens (and actual market pricing)....
They kinda expect people that NEED the expansion, will pay for it. IMHO, yet again we got generational feature regression, industry-wide.

igormp · Mar 19, 2025

Rover4444 said:
Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.

I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).

Rover4444 · Mar 19, 2025

LabRat 891 said:
Relevant. I recently bought a Navi 24 6500XT 8GB for my collection, potentially as (secondary) Display/scaling card. Guess the B570 might get used, instead of sitting.

Yep. Make sure it's the only card in your system when rolling up SD or Comfy, I've had problems where --use-ipex just gets straight-up ignored during install and installing the packages beforehand in the venv doesn't help.

igormp said:
I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).

Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.

LabRat 891 · Mar 19, 2025

igormp said:
I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).

A lil off topic but, I am curious how bad it is vs. Australia - New Zealand?
I have genuine empathy for those poor Enthusiasts and PC Gamers.

-merely because I'm aware of the severity, there.

Rover4444 said:
Yep. Make sure it's the only card in your system when rolling up SD or Comfy, I've had problems where --use-ipex just gets straight-up ignored during install and installing the packages beforehand in the venv doesn't help.

~~Crud. Was hoping to slot it in as a secondary.~~
Could that be worked-around if I can figure out how to directly (and exclusively) pass-through the card to a VM running w/in Linux? (in windows, it'd be a secondary display adapter and scaler card)
edit: Just needs to be the only card *when installed initially*. -got it.

Rover4444 said:
Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.

Pedantically browsing the AM5 boardlist googledoc that's oft circulated, can find oneself more than a couple decently expandable mobos.
More often than not though, they're OoS or no longer Retail Available.

Rover4444 · Mar 19, 2025

LabRat 891 said:
Crud. Was hoping to slot it in as a secondary.
Could that be worked-around if I can figure out how to directly (and exclusively) pass-through the card to a VM running w/in Linux? (in windows, it'd be a secondary display adapter and scaler card)

It only needs to be the only card during install. After install, you're fine running it as a secondary with --use-ipex.

igormp · Mar 19, 2025

Rover4444 said:
Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.

There aren't many options for AM5 to begin with, sadly.
Nonetheless, there are some other options, but they are both even more expensive in my region and have a worse PCIe layout.

LabRat 891 said:
A lil off topic but, I am curious how bad it is vs. Australia - New Zealand?
I have genuine empathy for those poor Enthusiasts and PC Gamers.
-merely because I'm aware of the severity, there

I guess Australia has it worse, from what I've seen in the 5000 posts from some aussies. Major problem is that the minimum wage here is way lower.
To put into perspective, I paid ~$750 for my x670e proart, and a 5090 goes for ~$4k currently.

johnspack · Mar 24, 2025

Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.

Rover4444 · Mar 25, 2025

johnspack said:
Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.

If it's slow it's because you're using too big of a quant or you're not offloading properly.

Turn on flash attention, set the layers manually, use --lowvram, use a lower quant, repeat until your speed improves.

I seriously hope you're using the bf16 of the mmproj.

Ultron1337 · Mar 25, 2025

igormp said:
Turns out it's really disappointing for its price, it's 256-bit 8533MHz, so 273GB/s

NVIDIA DGX Spark

A Grace Blackwell AI supercomputer on your desk.

www.nvidia.com

Bummer

Pay 3k to get 273 GB/s ... I've lost my interest in DGX Spark

johnspack said:
Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.

I'd personally try it with ollama & gemma3:12B (Q_4_k) and analyze image file from command line
ollama run gemma3:12B --verbose
>>>describe this image C:\path\to\myimage.jpeg (or /home/myimage.png etc)
If performance is satisfactory, slap GUI on top of ollama. Open WebUI for container folks and for example Msty for those liking to install stuff. This way can analyze files directly from images, WebUI even has small ChromaDB under hood, to speed file analyze up. Super useful IMHO.

Edit: now I got curios about gemma3:12b image analyzing performance with my 7800XT. Attaching the image it analyzed.

>>> what is in this image C:\temp\SC2-UltronJunk.png
Added image 'C:\temp\SC2-UltronJunk.png'
Here's a breakdown of what's in the image:

* **Character:** A man in a futuristic uniform is standing with his arms crossed. He appears to be a character
from a video game, likely a space-themed one.
* **Background:** The background shows a view through a window, revealing a planet or celestial body with a
reddish hue.
* **Text:** There's text overlayed on the image, which reads: "SUMMARY: In the future, please don't waste our
valuable time on this so-called 'Ultron'."
* **Interface Elements:** There are interface elements visible, including a "Starbase" label and other icons,
suggesting this is a screenshot from a video game.

The overall impression is that this is a screenshot from a space-themed video game, possibly a strategy or
simulation game.

total duration: 11.226313s
load duration: 3.681836s
prompt eval count: 667 token(s)
prompt eval duration: 2.6451341s
prompt eval rate: 252.16 tokens/s
eval count: 179 token(s)
eval duration: 4.8552628s
eval rate: 36.87 tokens/s

igormp · Mar 25, 2025

Ultron1337 said:
Pay 3k to get 273 GB/s ... I've lost my interest in DGX Spark

It gets worse, Nvidia's version with 4TB of storage goes for $4k. Asus does have an 1tb model that goes for $3k.

johnspack · Mar 26, 2025

Yep, using minicpm and the matching minicpm-mmproj-f16 model, many times faster for images. Quite the learning curve....
Now running minicpm ggml-model-f16 with mmproj-model-f16, and still really fast but smarter. Also can do handwriting recognition.
Need to test that a bit more.

Ultron1337 · Apr 6, 2025

LLAMA4 is out in 3 MoE flavors

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

We’re introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context support and our first built using a mixture-of-experts (MoE) architecture.

ai.meta.com

"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.

Rover4444 · Apr 6, 2025

Ultron1337 said:
LLAMA4 is out in 3 MoE flavors

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

We’re introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context support and our first built using a mixture-of-experts (MoE) architecture.

ai.meta.com

"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.

Just use the GGUF when it comes out, MoE models run well on limited VRAM. H100 is just 80GB, pretty easy to hit with a couple of 3090s. I'm hoping even more of the experts can be removed, a 17Bx4E might be really usable.

igormp · Apr 6, 2025

Ultron1337 said:
LLAMA4 is out in 3 MoE flavors

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

We’re introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context support and our first built using a mixture-of-experts (MoE) architecture.

ai.meta.com

"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.

Even a distill would be too big, at 109B params that would still require over 50GB of VRAM at Q4. If you move down to Q2 or something like that, then it becomes doable with just a couple 24GB GPUs, but at this point one gets to wonder if the quality will be any good.

Rover4444 said:
Just use the GGUF when it comes out, MoE models run well on limited VRAM. H100 is just 80GB, pretty easy to hit with a couple of 3090s. I'm hoping even more of the experts can be removed, a 17Bx4E might be really usable.

Yeah, the problem is just to get the selected expert in VRAM, which takes some time if you can't fit the entire model in VRAM at once.

Ultron1337 · Apr 6, 2025

I am pretty sure there will be some sort of reduced experts distill of Llama 4 for LLM enthusiast with gaming GPUs.

I have one relativistic speed spaceship exercise I am giving LLMs to solve. Most cloud based >100B models get it right, including 405B parameter Llama3.1. I expected Llama 4 Maverick to solve this easily, but it got only 1/2 exercise right (thought time flows slower for the observer, not for the crew). Llama 4 Scout was totally off and gave very different answers each time. Even my local 27B Gemma3 solved the problem more accurately :eek:

Sometimes models get (partially) dumber, same happened with Gemini 1.5 Pro and 2.0 Pro.
1.5 got flawless answer, while 2.0 was off by 182% at first and 598% by second calculation of that exercise (Edit: Gemini 2.5 Pro experimental got the answers right again).

Rover4444 said:
H100 is just 80GB, pretty easy to hit with a couple of 3090s

Well H100 has 2.04 TB/s speed across that 80 GB VRAM, AMDs MI300X has even more bonkers 10.3 TB/s over 192GB. Consumer GPU solutions can be linked over PCI5.0 x16 max and that's 64GB/s. Inferencing speed would be far from ideal as LLMs parameters try to exchange data over 3 different GPUs being severely botllenecked by PCIe link speed.

Rover4444 · Apr 6, 2025

igormp said:
Yeah, the problem is just to get the selected expert in VRAM, which takes some time if you can't fit the entire model in VRAM at once.

Q4 17B should fit pretty easily in even 16GB of VRAM, it shouldn't be a problem. Processing/generation should outweigh load/unload.

Ultron1337 said:
Well H100 has 2.04 TB/s speed across that 80 GB VRAM, AMDs MI300X has even more bonkers 10.3 TB/s over 192GB. Consumer GPU solutions can be linked over PCI5.0 x16 max and that's 64GB/s. Inferencing speed would be far from ideal as LLMs parameters try to exchange data over 3 different GPUs being severely botllenecked by PCIe link speed.

GPUs aren't bottlenecked by PCIe link speed as much as you think. That really only comes into play with tensor parallelism, and even then it's extremely usable.

Ultron1337 · Apr 6, 2025

Rover4444 said:
GPUs aren't bottlenecked by PCIe link speed as much as you think. That really only comes into play with tensor parallelism, and even then it's extremely usable.

When it comes to inferencing current models, bottleneck is the the VRAM bandwidth. GPU or CPU compute is almost irrelevant. As observed by ollama ps command, tps drops dramatically when even few % of the model is forced to run on a 10x slower system RAM compared to GPU.
LLM models have billions of parameters and each inference pass requires loading most or all of these. Its like billions of neurons firing up and communicating between each other when thinking. This creates huge storm of data traffic and memory bandwidth is the only key performance metric here holding back tps. Yeah at some point when memory becomes fast enough, compute needs to catch up, but due historical design (calculating frames is more compute than bandwidth intensive) GPUs today are bandwidth starved when doing LLM inferencing runs.

Processor	9950x \| 5950x
Motherboard	x670e ProArt\| B550 ProArt
Cooling	PA 120 SE \|Fuma 2
Memory	4x64GB Kingston CUDIMM @5200MHz \| 4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	Corsair RM1000e \| XPG Core Reactor 850W
Software	I use Arch btw

System Name	Kincsem
Processor	AMD Ryzen 9 9950X
Motherboard	ASUS ProArt X870E-CREATOR WIFI
Cooling	Be Quiet Dark Rock Pro 5
Memory	Kingston Fury KF560C32RSK2-96 (2×48GB 6GHz)
Video Card(s)	Sapphire AMD RX 7900 XT Pulse
Storage	Samsung 990PRO 2TB + Samsung 980PRO 2TB + FURY Renegade 2TB+ Adata 2TB + WD Ultrastar HC550 16TB
Display(s)	Acer QHD 27"@144Hz 1ms + UHD 27"@60Hz
Case	Cooler Master CM 690 III
Power Supply	Seasonic 1300W 80+ Gold Prime
Mouse	Logitech G502 Hero
Keyboard	HyperX Alloy Elite RGB
Software	Windows 10-64
Benchmark Scores	https://valid.x86.fr/9qw7iq https://valid.x86.fr/4d8n02 X570 https://www.techpowerup.com/gpuz/g46uc

System Name	System2 Blacknet , System1 Blacknet2
Processor	System2 Threadripper 1920x, System1 2699 v3
Motherboard	System2 Asrock Fatality x399 Professional Gaming, System1 Asus X99-A
Cooling	System2 Noctua NH-U14 TR4-SP3 Dual 140mm fans, System1 AIO
Memory	System2 64GBS DDR4 3000, System1 32gbs DDR4 2400
Video Card(s)	System2 GTX 980Ti System1 GTX 970
Storage	System2 4x SSDs + NVme= 2.250TB 2xStorage Drives=8TB System1 3x SSDs=2TB
Display(s)	1x27" 1440 display 1x 24" 1080 display
Case	System2 Some Nzxt case with soundproofing...
Audio Device(s)	Asus Xonar U7 MKII
Power Supply	System2 EVGA 750 Watt, System1 XFX XTR 750 Watt
Mouse	Logitech G900 Chaos Spectrum
Keyboard	Ducky
Software	Archlinux, Manjaro, Win11 Ent 24h2
Benchmark Scores	It's linux baby!

Processor	9950x \| 5950x
Motherboard	x670e ProArt\| B550 ProArt
Cooling	PA 120 SE \|Fuma 2
Memory	4x64GB Kingston CUDIMM @5200MHz \| 4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	Corsair RM1000e \| XPG Core Reactor 850W
Software	I use Arch btw

System Name	Pioneer
Processor	Ryzen 9 9950X
Motherboard	MSI MAG X670E Tomahawk Wifi
Cooling	Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory	64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6200(Running 1T no GDM)
Video Card(s)	PNY RTX 5080 OC
Storage	Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W 80Plus Titanium PSU
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64 / Windows 11 Enterprise (yes it's legit)

What local LLM-s you use?

Rover4444

igormp

csendesmark

johnspack

Here For Good!

igormp

R-T-B

GitHub - Lurkki14/tuxclocker: Qt overclocking tool for GNU/Linux

CoreCtrl / CoreCtrl · GitLab

LabRat 891

Rover4444

LabRat 891

igormp

Rover4444

LabRat 891

Rover4444

igormp

johnspack

Here For Good!

Rover4444

Ultron1337

NVIDIA DGX Spark

Attachments

igormp

johnspack

Here For Good!

Ultron1337

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

Rover4444

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

igormp

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

Ultron1337

Rover4444

Ultron1337

System Name	Metalia
Processor	AMD Ryzen 7 5800X3D
Motherboard	Asus TuF Gaming X570-PLUS
Cooling	ID Cooling 280mm AIO w/ Arctic P14s
Memory	2x32GB DDR4-3600
Video Card(s)	Sapphire Pulse RX 9070 XT
Storage	Optane P5801X 400GB, Samsung 990Pro 2TB
Display(s)	LG ‎32GS95UV 32" OLED 240/480hz 4K/1080P Dual Mode
Case	Geometric Future M8 Dharma
Audio Device(s)	Xonar Essence STX
Power Supply	Seasonic Focus GX-1000 Gold
Mouse	Attack Shark R3 Magnesium - White
Keyboard	Keychron K8 Pro - White - Tactile Brown Switch
Software	Windows 10 IoT Enterprise LTSC 2021

Processor	AMD 5600X
Motherboard	ASUS TUF GAMING B550M-Plus WiFi
Cooling	be quiet! Dark Rock 4
Memory	G.Skill Ripjaws 2 x 32 GB DDR4-3600 CL18-22-22-42 1.35V F4-3600C18D-64GVK
Video Card(s)	Sapphire Nitro+ RX 7900 XTX 24GB
Storage	Kingston KC3000 2TB + QNAP TBS-464
Display(s)	LG 35" LCD 35WN75C-B 3440x1440
Case	Kolink Bastion RGB Midi-Tower
Power Supply	Seasonic VERTEX PX-750 80+ Platinum
Mouse	Razer Deathadder v2
Benchmark Scores	phi4 - 62 tokens/s gemma3:27B - 35 tps