• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What local LLM-s you use?

IIRC, you cannot 'spread' LLM workloads across seperate VRAM segments.
You're mistaken. Layers are not swapped between GPUs by default. Each card will do it's own thing.
I know. I've given up on looking towards AM5 until a new PROMOTORY is released. The standardized fanout for PCIe on AM5 is unironically, literally, and un-amusingly "retarded".
It's so fucking dumb. A top slot, 3-4 slots of fucking nothing, then a 5.0 "AI" slot at the very bottom of the board.

Everybody who worked on X870 should get fired.
 
It's so fucking dumb. A top slot, 3-4 slots of fucking nothing, then a 5.0 "AI" slot at the very bottom of the board.
The Taichi is the major offender when it comes to that.
The ProArt models still have a reasonable layout.
 
5800X3D+7900XTX+9070XT,
Choose your PSU for max wattage of the planned rig +20~25%
So for that maybe consider larger than 1000W

ProART X670 or X870 is great if you wanna use multiple GPU-s
But dividing LLM-s between multiple GPUs is not that easy
 
Well that's nice, Koboldcpp now supports Gemma-3. Running gemma-3-4b-it-16bf and it's screaming fast. Bigger models still beat up on my system, but that's expected.
 
But dividing LLM-s between multiple GPUs is not that easy
What do you mean? Something like ollama does this by default as soon as you load any model that's bigger than the free vram of a single GPU. I even posted some benchmarking numbers some pages back with my 2x3090s and some larger models.
 
Lact is a better OC/monitoring tool for AMD IMO. Also supports Intel cards oddly enough.
I have a 10GB B570 that I irresponsibly pressed buy it now on late last month :laugh: (I also still have my RX 7900 GRE Nitro+, that'd fit better as a secondary, but be worse than the XTX))
How are B570s for these LLM workloads? or, 'too new, lacking support'? (ala SR-IOV, etc.)
 
The Taichi is the major offender when it comes to that.
The ProArt models still have a reasonable layout.
Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.
I have a 10GB B570 that I irresponsibly pressed buy it now on late last month :laugh: (I also still have my RX 7900 GRE Nitro+, that'd fit better as a secondary, but be worse than the XTX))
How are B570s for these LLM workloads? or, 'too new, lacking support'? (ala SR-IOV, etc.)
They're trash for LLMs. Prompt processing takes forever with Vulkan. SD performance should be respectable though, probably faster than any RDNA 2 and older card.
 
SD performance should be respectable though, probably faster than any RDNA 2 card and below
Relevant. I recently bought a Navi 24 6500XT 8GB for my collection, potentially as (secondary) Display/scaling card. Guess the B570 might get used, instead of sitting.

Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.
Looking at the 'retail-available' AI/ML-leaning cards, these last few gens (and actual market pricing)....
They kinda expect people that NEED the expansion, will pay for it. IMHO, yet again we got generational feature regression, industry-wide.
 
Yeah, but it's also an ASUS board. These people are crazy for thinking I'm paying nearly $500 for a mobo.
I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).
 
Relevant. I recently bought a Navi 24 6500XT 8GB for my collection, potentially as (secondary) Display/scaling card. Guess the B570 might get used, instead of sitting.
Yep. Make sure it's the only card in your system when rolling up SD or Comfy, I've had problems where --use-ipex just gets straight-up ignored during install and installing the packages beforehand in the venv doesn't help.
I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).
Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.
 
I pay way more than that due to bad exchange rates in my country, but OTOH it was the cheapest AM5 motherboard that could do x8/x8 (bar the b650 model).
A lil off topic but, I am curious how bad it is vs. Australia - New Zealand?
I have genuine empathy for those poor Enthusiasts and PC Gamers. :(
-merely because I'm aware of the severity, there.

Yep. Make sure it's the only card in your system when rolling up SD or Comfy, I've had problems where --use-ipex just gets straight-up ignored during install and installing the packages beforehand in the venv doesn't help.
Crud. Was hoping to slot it in as a secondary.
Could that be worked-around if I can figure out how to directly (and exclusively) pass-through the card to a VM running w/in Linux? (in windows, it'd be a secondary display adapter and scaler card)
edit: Just needs to be the only card *when installed initially*. -got it.
Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.
Pedantically browsing the AM5 boardlist googledoc that's oft circulated, can find oneself more than a couple decently expandable mobos.
More often than not though, they're OoS or no longer Retail Available.
 
Last edited:
Crud. Was hoping to slot it in as a secondary.
Could that be worked-around if I can figure out how to directly (and exclusively) pass-through the card to a VM running w/in Linux? (in windows, it'd be a secondary display adapter and scaler card)
It only needs to be the only card during install. After install, you're fine running it as a secondary with --use-ipex.
 
Ehh. Don't really care about that personally, but I'm sure there's some x670e boards that could do it.
There aren't many options for AM5 to begin with, sadly.
Nonetheless, there are some other options, but they are both even more expensive in my region and have a worse PCIe layout.
A lil off topic but, I am curious how bad it is vs. Australia - New Zealand?
I have genuine empathy for those poor Enthusiasts and PC Gamers. :(
-merely because I'm aware of the severity, there
I guess Australia has it worse, from what I've seen in the 5000 posts from some aussies. Major problem is that the minimum wage here is way lower.
To put into perspective, I paid ~$750 for my x670e proart, and a 5090 goes for ~$4k currently.
 
Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.
 
Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.
If it's slow it's because you're using too big of a quant or you're not offloading properly.

Turn on flash attention, set the layers manually, use --lowvram, use a lower quant, repeat until your speed improves.

I seriously hope you're using the bf16 of the mmproj.
 
Turns out it's really disappointing for its price, it's 256-bit 8533MHz, so 273GB/s

Bummer
Pay 3k to get 273 GB/s ... I've lost my interest in DGX Spark

Does anyone know the best text/vision model combo to use on a lower end computer? Currently trying gemma-3-12b-it-q6_k_l with gemma-12b-mmproj, but it's stupidly slow.
I'd personally try it with ollama & gemma3:12B (Q_4_k) and analyze image file from command line
ollama run gemma3:12B --verbose
>>>describe this image C:\path\to\myimage.jpeg
(or /home/myimage.png etc)
If performance is satisfactory, slap GUI on top of ollama. Open WebUI for container folks and for example Msty for those liking to install stuff. This way can analyze files directly from images, WebUI even has small ChromaDB under hood, to speed file analyze up. Super useful IMHO.

Edit: now I got curios about gemma3:12b image analyzing performance with my 7800XT. Attaching the image it analyzed.

>>> what is in this image C:\temp\SC2-UltronJunk.png
Added image 'C:\temp\SC2-UltronJunk.png'
Here's a breakdown of what's in the image:

* **Character:** A man in a futuristic uniform is standing with his arms crossed. He appears to be a character
from a video game, likely a space-themed one.
* **Background:** The background shows a view through a window, revealing a planet or celestial body with a
reddish hue.
* **Text:** There's text overlayed on the image, which reads: "SUMMARY: In the future, please don't waste our
valuable time on this so-called 'Ultron'."
* **Interface Elements:** There are interface elements visible, including a "Starbase" label and other icons,
suggesting this is a screenshot from a video game.

The overall impression is that this is a screenshot from a space-themed video game, possibly a strategy or
simulation game.

total duration: 11.226313s
load duration: 3.681836s
prompt eval count: 667 token(s)
prompt eval duration: 2.6451341s
prompt eval rate: 252.16 tokens/s
eval count: 179 token(s)
eval duration: 4.8552628s
eval rate: 36.87 tokens/s
 

Attachments

  • SC2-UltronJunk.png
    SC2-UltronJunk.png
    376.7 KB · Views: 24
Last edited:
Pay 3k to get 273 GB/s ... I've lost my interest in DGX Spark
It gets worse, Nvidia's version with 4TB of storage goes for $4k. Asus does have an 1tb model that goes for $3k.
 
Yep, using minicpm and the matching minicpm-mmproj-f16 model, many times faster for images. Quite the learning curve....
Now running minicpm ggml-model-f16 with mmproj-model-f16, and still really fast but smarter. Also can do handwriting recognition.
Need to test that a bit more.
 
Last edited:
LLAMA4 is out in 3 MoE flavors
"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.
 
LLAMA4 is out in 3 MoE flavors
"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.
Just use the GGUF when it comes out, MoE models run well on limited VRAM. H100 is just 80GB, pretty easy to hit with a couple of 3090s. I'm hoping even more of the experts can be removed, a 17Bx4E might be really usable.
 
LLAMA4 is out in 3 MoE flavors
"fitting in a single NVIDIA H100 GPU", that's only $32,000-$40,000 GPU. Hope there will be a distill for regular mortals also.
Even a distill would be too big, at 109B params that would still require over 50GB of VRAM at Q4. If you move down to Q2 or something like that, then it becomes doable with just a couple 24GB GPUs, but at this point one gets to wonder if the quality will be any good.
Just use the GGUF when it comes out, MoE models run well on limited VRAM. H100 is just 80GB, pretty easy to hit with a couple of 3090s. I'm hoping even more of the experts can be removed, a 17Bx4E might be really usable.
Yeah, the problem is just to get the selected expert in VRAM, which takes some time if you can't fit the entire model in VRAM at once.
 
I am pretty sure there will be some sort of reduced experts distill of Llama 4 for LLM enthusiast with gaming GPUs.

I have one relativistic speed spaceship exercise I am giving LLMs to solve. Most cloud based >100B models get it right, including 405B parameter Llama3.1. I expected Llama 4 Maverick to solve this easily, but it got only 1/2 exercise right (thought time flows slower for the observer, not for the crew). Llama 4 Scout was totally off and gave very different answers each time. Even my local 27B Gemma3 solved the problem more accurately :eek: Sometimes models get (partially) dumber, same happened with Gemini 1.5 Pro and 2.0 Pro.
1.5 got flawless answer, while 2.0 was off by 182% at first and 598% by second calculation of that exercise (Edit: Gemini 2.5 Pro experimental got the answers right again).

H100 is just 80GB, pretty easy to hit with a couple of 3090s
Well H100 has 2.04 TB/s speed across that 80 GB VRAM, AMDs MI300X has even more bonkers 10.3 TB/s over 192GB. Consumer GPU solutions can be linked over PCI5.0 x16 max and that's 64GB/s. Inferencing speed would be far from ideal as LLMs parameters try to exchange data over 3 different GPUs being severely botllenecked by PCIe link speed.
 
Last edited:
Yeah, the problem is just to get the selected expert in VRAM, which takes some time if you can't fit the entire model in VRAM at once.
Q4 17B should fit pretty easily in even 16GB of VRAM, it shouldn't be a problem. Processing/generation should outweigh load/unload.
Well H100 has 2.04 TB/s speed across that 80 GB VRAM, AMDs MI300X has even more bonkers 10.3 TB/s over 192GB. Consumer GPU solutions can be linked over PCI5.0 x16 max and that's 64GB/s. Inferencing speed would be far from ideal as LLMs parameters try to exchange data over 3 different GPUs being severely botllenecked by PCIe link speed.
GPUs aren't bottlenecked by PCIe link speed as much as you think. That really only comes into play with tensor parallelism, and even then it's extremely usable.
 
GPUs aren't bottlenecked by PCIe link speed as much as you think. That really only comes into play with tensor parallelism, and even then it's extremely usable.
When it comes to inferencing current models, bottleneck is the the VRAM bandwidth. GPU or CPU compute is almost irrelevant. As observed by ollama ps command, tps drops dramatically when even few % of the model is forced to run on a 10x slower system RAM compared to GPU.
LLM models have billions of parameters and each inference pass requires loading most or all of these. Its like billions of neurons firing up and communicating between each other when thinking. This creates huge storm of data traffic and memory bandwidth is the only key performance metric here holding back tps. Yeah at some point when memory becomes fast enough, compute needs to catch up, but due historical design (calculating frames is more compute than bandwidth intensive) GPUs today are bandwidth starved when doing LLM inferencing runs.
 
Back
Top