• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

What local LLM-s you use?

I actually have that model, but would like to go up a bit, maybe q8? I also see llama 70b. But don't see any download links....
I have to find models that will fit in 64gbs of ram.
 
Get your own data center cards and leave my gaming GPUs alone!
Never fear friend, my card is primarily for gaming. The AI stuff is just for experimenting from time-to-time :)

Anyway, I decided to download llama3.3, but unfortunately I don't have the VRAM to run it. It maxed out my VRAM and any responses were INCREDIBLY slow. So I suspect i'll need to stick to smaller models.
 
I actually have that model, but would like to go up a bit, maybe q8? I also see llama 70b. But don't see any download links....
I have to find models that will fit in 64gbs of ram.
It is all there:
DeepSeek-R1-Distill-Qwen-32B-Q8_0
Llama-3.3-70B-Instruct-GGUF

Alternatively you can download it with LM Studio like this:
1740048755321.png

Super convenient :toast:
 
Last edited:
Thanks yeah I finally found more downloads. Right now I have to use Koboldcpp, and it doesn't have the download feature. LMStudio was failing on me, so I switched.
Although after some time the model f's up, but in Kobold I just use start new session and it clears up.
Yep, now have DeepSeek-R1-Distill-Qwen-32B-Q8_0 running just fine. Not bad for an ancient computer!
Oh and Q8 is using around 35gbs of ram.

It's a bit slow... not really using my gpu as much as I'd like:
1740111467983.png
 
Last edited:
Thanks yeah I finally found more downloads. Right now I have to use Koboldcpp, and it doesn't have the download feature. LMStudio was failing on me, so I switched.
Although after some time the model f's up, but in Kobold I just use start new session and it clears up.
Yep, now have DeepSeek-R1-Distill-Qwen-32B-Q8_0 running just fine. Not bad for an ancient computer!
Oh and Q8 is using around 35gbs of ram.

It's a bit slow... not really using my gpu as much as I'd like:
View attachment 385850
The more layers you can put on VRAM the faster it'll perform. Use the Q4 quants or check how much of your VRAM is being used.
 
Get your own data center cards and leave my gaming GPUs alone!

Why not both? These guys likely game too given the audience here. You are being mad at the wrong group.
 
The more layers you can put on VRAM the faster it'll perform. Use the Q4 quants or check how much of your VRAM is being used.
Or the other way,
It is more and more painful if you put more and more layers into your system RAM :roll:
Why not both? These guys likely game too given the audience here. You are being mad at the wrong group.
Yeah!
PC is a general computer it can do it all,
You can load and unload those programs on demand! :toast:

Wanted to try out yesterday but did not have the smirki/UIGEN-T1-Qwen-7b is doing good job with match problems, with language, not that great.
And it is quite fast with 74 token/s for me.
 
Well, can't seem to run any llama models, not sure why. Fortunately deepseek-r1-distill models all work fine. Would like to figure out how to offload more to my gpu though.
Looks like it only assigns about 3.5gbs of vram.
 
Well, can't seem to run any llama models, not sure why. Fortunately deepseek-r1-distill models all work fine. Would like to figure out how to offload more to my gpu though.
Looks like it only assigns about 3.5gbs of vram.
More info?
Not running? You mean not running at all or not on GPU?
What daemon you run the models?
 
The LLM I use locally is Qwen2.5-32B-Instruct-Q6_K.gguf. It has replaced all the smaller ones for me. Frontend: text-generation-webui (this was what worked first when I tried a LLM GUI on Linux and I stuck with it). Speed: ~2.6 tokens/second when 23 layers are offloaded to my 4070 (set CPU threads: 4. DDR5-4800 dual channel). For bigger LLMs I use Chatbot Arena' Direct Chat.
I benchmarked RAM vs VRAM offloading:
layer-vs-tokens.png
 
More info?
Not running? You mean not running at all or not on GPU?
What daemon you run the models?
I'm having to use the Koboldcpp daemon, LmStudio doesn't seem to want to use my gpu at all. Just tested Kobold-nocuda and am able to run Llama, but horribly slow. Didn't realize
my old gpu helped that much. Guess I'm stuck with Deepseek until I get a 3090 or something....
Oh and I can load it with clblast I just found out, heats up my gpu quite a bit, but almost as slow as no gpu. CuBlas is by far the fastest, but I can't run Llama models with it.
Vulkan works but on my old computer it's terrible slow. Dam I need a new computer!
 
Last edited:
I'm having to use the Koboldcpp daemon, LmStudio doesn't seem to want to use my gpu at all. Just tested Kobold-nocuda and am able to run Llama, but horribly slow. Didn't realize
my old gpu helped that much. Guess I'm stuck with Deepseek until I get a 3090 or something....
Oh and I can load it with clblast I just found out, heats up my gpu quite a bit, but almost as slow as no gpu. CuBlas is by far the fastest, but I can't run Llama models with it.
Vulkan works but on my old computer it's terrible slow. Dam I need a new computer!
I have no idea what's happening, but it shouldn't be. Turn on flash attention, set the layers manually, or use --lowvram. You might be getting OOM errors when trying to run llama.

Just to make sure it's an issue with llamacpp and nothing else, try to offload one layer only. If it works it's OOMing when layers are set to auto and if it doesn't you should try recreating your venv and reinstalling packages.
 
I'm having to use the Koboldcpp daemon, LmStudio doesn't seem to want to use my gpu at all. Just tested Kobold-nocuda and am able to run Llama, but horribly slow. Didn't realize
my old gpu helped that much. Guess I'm stuck with Deepseek until I get a 3090 or something....
Oh and I can load it with clblast I just found out, heats up my gpu quite a bit, but almost as slow as no gpu. CuBlas is by far the fastest, but I can't run Llama models with it.
Vulkan works but on my old computer it's terrible slow. Dam I need a new computer!
Sorry mate I don't know Koboldcpp at all.
But when you arrive to get your 3090, make sure that's the 24GB version! :D
 
I believe these are known as nVidia L40 or L40s
The L40 has a higher bin of the AD102 compared to the 4090, but the 4090 has the faster GDDR6X which gives it more memory bandwidth.
The 4090 in raw perf should be faster than the L40 for LLMs, and that 48GB model at a third/quarter of the price of the L40 makes it really interesting.
 
The L40 has a higher bin of the AD102 compared to the 4090, but the 4090 has the faster GDDR6X which gives it more memory bandwidth.
The 4090 in raw perf should be faster than the L40 for LLMs, and that 48GB model at a third/quarter of the price of the L40 makes it really interesting.
Ok I checked the Chiphell links for 48GB edition. Indeed using GDDR6X, is it double-sided GDDR6X ? That blower looks nasty and not surprised by that 50dB measured noise/roar. For desktop use, should come with noise cancelling helmet.
Getting existing GPU to support double or triple amount of RAM without problems in not trivial task. BIOS needs to be correctly modified, memory power and cooling requirements met etc.
While these GPU based solutions seem cool now, I think unified memory solutions like nVidia project DIGITS are the way to go here. Why limit yourself to GPU memory, when whole system memory can be fast ?
 
Ok I checked the Chiphell links for 48GB edition. Indeed using GDDR6X, is it double-sided GDDR6X ? That blower looks nasty and not surprised by that 50dB measured noise/roar. For desktop use, should come with noise cancelling helmet.
Getting existing GPU to support double or triple amount of RAM without problems in not trivial task. BIOS needs to be correctly modified, memory power and cooling requirements met etc.
While these GPU based solutions seem cool now, I think unified memory solutions like nVidia project DIGITS are the way to go here. Why limit yourself to GPU memory, when whole system memory can be fast ?
"For desktop use, should come with noise cancelling helmet." :roll:
Oh well, sadly these are not for home use, but for datacenters
These are not a place where you want to be - for longer periods anyway.
 
Ok I checked the Chiphell links for 48GB edition. Indeed using GDDR6X, is it double-sided GDDR6X ?
Seems like it's a custom PCB based on the one from the 3090, so yeah, clamshell design with 24x 16Gb GDDR6X modules.
That blower looks nasty and not surprised by that 50dB measured noise/roar. For desktop use, should come with noise cancelling helmet.
Apparently this can be made way better by reducing the power limit. That blower format is also perfect for using multiple GPUs in a single setup.
While these GPU based solutions seem cool now, I think unified memory solutions like nVidia project DIGITS are the way to go here. Why limit yourself to GPU memory, when whole system memory can be fast ?
While I do agree that those devices are cool and will fill a really nice niche, the Issue is that so fa those unified memory system are not as fast as a dGPU.
We're yet to see the specs on DIGITS, but something like the Strix Halo only has 256GB/s, which is on the level of a 6600xt. A 4090 does 4x that, and a 5090 does 1.8TB/s.
And that's only talking about memory bandwidth, those devices also have way more raw compute power compared to those iGPUs.

The M1/M2 Ultra do have nice memory bw (~800GB/s), but their actual iGPU is slower compared to the likes of a 3090/4090, and they're not cheap at all. A theoretical M4 Ultra should achieve 1TB/s, similar to a 4090, but we're yet to see how fast it'll be, and pricing should be on the higher end as well.
 
While I do agree that those devices are cool and will fill a really nice niche, the Issue is that so fa those unified memory system are not as fast as a dGPU.
We're yet to see the specs on DIGITS, but something like the Strix Halo only has 256GB/s, which is on the level of a 6600xt. A 4090 does 4x that, and a 5090 does 1.8TB/s.
And that's only talking about memory bandwidth, those devices also have way more raw compute power compared to those iGPUs.
The M1/M2 Ultra do have nice memory bw (~800GB/s), but their actual iGPU is slower compared to the likes of a 3090/4090, and they're not cheap at all. A theoretical M4 Ultra should achieve 1TB/s, similar to a 4090, but we're yet to see how fast it'll be, and pricing should be on the higher end as well.
I agree that Strix Halo's 256GB/s is puny, but then again that's first gen unified memory PC platform from AMD. It has potential to get much better over time.
DIGITS should get 900GB/s and a GPU to match that bandwidth. With 128GB of RAM, this will best many datacenter class GPUs that cost a lot more money.
I've seen various Apple's benchmarks on LLMs and considering hardware, they do not fare well, I suspect software optimizations are partly to blame here.
Ultimately merging system and GPU memory over high speed link offers superior bang for buck, than just cranking up GDDRN on a GPU each gen.
Oh well, sadly these are not for home use, but for datacenters
These are not a place where you want to be - for longer periods anyway.
I've roamed around various datacenters for decades. Did so even last week. Low oxygen ones are "fun". Most high end compute in DCs is nowadays water cooled (DLC), especially GPUs, but we are going off topic here.
 
Last edited:
DIGITS should get 900GB/s and a GPU to match that bandwidth. With 128GB of RAM, this will best many datacenter class GPUs that cost a lot more money.
That's one possibility, but I think something in the 450~512GB/s mark is more realistic.
Agreed with all your other points tho, those devices have lots of potential in the future and may cover most use cases.
 
Ok I checked the Chiphell links for 48GB edition. Indeed using GDDR6X, is it double-sided GDDR6X ? That blower looks nasty and not surprised by that 50dB measured noise/roar. For desktop use, should come with noise cancelling helmet.
Getting existing GPU to support double or triple amount of RAM without problems in not trivial task. BIOS needs to be correctly modified, memory power and cooling requirements met etc.
While these GPU based solutions seem cool now, I think unified memory solutions like nVidia project DIGITS are the way to go here. Why limit yourself to GPU memory, when whole system memory can be fast ?
didnt amd do that years ago with the memory controller on vega that could address system ram and even use nvme storage as gpu "ram"
 
didnt amd do that years ago with the memory controller on vega that could address system ram and even use nvme storage as gpu "ram"
No, that was just a dumb PCIe switch/mux, no different than having a regular NVMe in your motherboard and using PCIe P2P to access stuff between devices.

That has nothing to do with unified memory.
 
Back
Top