• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What local LLM-s you use?

@AusWolf
Local LLM-s are very important, I really reject the trend everything getting "cloud" based, micro$oft wants even you windows account to be online...
But after you posted 3 times, you could tell us what LLM-s you use, and maybe some performance data too!
 
  • Like
Reactions: qxp
Maybe I'm a little bit behind on stuff but... Got to ask... What's the point of this for any regular home user?
It's like asking "whats the point of using your brain ?". First time in recorded history humanity has the ability to use thinking tool. It can supercharge almost any skill you have. Just as human brain can be used in nearly limitless ways, same applies to local LLM. Use it to check kids homework, use it to make your homework, help analyze scientific papers, write code for you, explain why vitamin K is good, count starts in the sky, analyze insurance offerings etc etc etc. I am not even going to pretend I know even a fraction of uses cases local LLMs will have next 10 years, but I know its going to be wild. On a level how internet changed our lives (yeah some of us grew up without internet).
 
Maybe I'm a little bit behind on stuff but... Got to ask... What's the point of this for any regular home user?
That sounds interesting. Can you explain? :)
At least the larger, 70B+ models, are typically sufficiently knowledgeable that you can ask them some complicated questions and expect a reasonable and not necessarily banal and expected answer. There are things you do not want to send to commercial services, most typically personal information. Some of the latest advancements made even 70B scale models competent with illusion-shattering problems previous generations of models have difficulty with, like how many r's in strawberry, how many boys does Mary have when one of them is gross, et cetera.

Now they are usually useful for common math and programming problems when used with care, can explore human philosophy and condition quite competently, and can tell stories of some interest with the right prompt. They are also useful for getting familiar with what LLM output looked like. Half of the internet looks LLM generated these days.

Some open-weight models are capable of API use, such as those provided by the framework it is running on, including requesting web services. Usefulness of such capabilities is apparently unremarkable given other limitations, and for that matter, the state of the internet and search engine results these days. That require support by the framework the model is running on, and is usually the only time the model - note, not the framework - would access the Internet.

They can also provide some silly fun, especially the roleplay-finetuned ones, for uses where hallucination actually provides some emulation of creativity. Think of it as a text-based holodeck. Throw in an image generator and it is text and image. You typically don't want a lot of those elsewhere, as well: As with all things requiring an account, all things you put into a networked service would be recorded by the provider, and linked to you. Not everyone feel comfortable with the nothing-to-hide mentality even when they really don't, and more than a few have objections to their interactions and personal info being used to train future commercial AI models.

Personally, I've sized my setup to be able to run a "future larger model" in early 2024, which would turn out to be mistral-large-2407, 123B, quantized. The best correctness and general task performance is probably currently achieved by the LLAMA 3 70B distilled version of DeepSeek R1. Anything larger would be costly and impractical for the moment, to me. Might as well make them useful while they are there.
 
70B+ models, are typically sufficiently knowledgeable that you can ask them some complicated questions and expect a reasonable and not necessarily banal and expected answer.
Did you skip DeepSeek?
1739550449205.png
 
@AusWolf
Local LLM-s are very important, I really reject the trend everything getting "cloud" based, micro$oft wants even you windows account to be online...
But after you posted 3 times, you could tell us what LLM-s you use, and maybe some performance data too!
I'm not using anything. I didn't even know that you could run them locally until recently. I'm only trying to learn what use it is, to see whether it's something I'd want to do or not.

At least the larger, 70B+ models, are typically sufficiently knowledgeable that you can ask them some complicated questions and expect a reasonable and not necessarily banal and expected answer. There are things you do not want to send to commercial services, most typically personal information. Some of the latest advancements made even 70B scale models competent with illusion-shattering problems previous generations of models have difficulty with, like how many r's in strawberry, how many boys does Mary have when one of them is gross, et cetera.

Now they are usually useful for common math and programming problems when used with care, can explore human philosophy and condition quite competently, and can tell stories of some interest with the right prompt. They are also useful for getting familiar with what LLM output looked like. Half of the internet looks LLM generated these days.

Some open-weight models are capable of API use, such as those provided by the framework it is running on, including requesting web services. Usefulness of such capabilities is apparently unremarkable given other limitations, and for that matter, the state of the internet and search engine results these days. That require support by the framework the model is running on, and is usually the only time the model - note, not the framework - would access the Internet.

They can also provide some silly fun, especially the roleplay-finetuned ones, for uses where hallucination actually provides some emulation of creativity. Think of it as a text-based holodeck. Throw in an image generator and it is text and image. You typically don't want a lot of those elsewhere, as well: As with all things requiring an account, all things you put into a networked service would be recorded by the provider, and linked to you. Not everyone feel comfortable with the nothing-to-hide mentality even when they really don't, and more than a few have objections to their interactions and personal info being used to train future commercial AI models.

Personally, I've sized my setup to be able to run a "future larger model" in early 2024, which would turn out to be mistral-large-2407, 123B, quantized. The best correctness and general task performance is probably currently achieved by the LLAMA 3 70B distilled version of DeepSeek R1. Anything larger would be costly and impractical for the moment, to me. Might as well make them useful while they are there.
Text-based holodeck running locally on your PC... Now that caught my attention! :)

I'm just having a hard time imagining it. LLM still lives in my head as a glorified search engine. :ohwell:
 
Did you skip DeepSeek?
The way they do chain-of-thought makes for interesting reading. I think they are the first to do it well enough in an open-weight model, too. Wherever they might be from, I do not have quite sufficient trust to send any hosted services anything confidential or profiling.

Even "free" services come with the implicit permission of using your interaction for any number of further purposes buried in the user agreement assuming it is followed, and God forbid if there is a data breach.

FWIW 70B Q6_k quantized models are a bit more than ~0.9 token/s to almost 1.2 token/s on my setup running on official distribution of Ollama 0.5.7. Latest llama.cpp compiled from source gives ~1.2 token/s.

Text-based holodeck running locally on your PC... Now that caught my attention! :)

I'm just having a hard time imagining it. LLM still lives in my head as a glorified search engine. :ohwell:
To be fair, they are still even worse than that when used for factual stuff without verification. And whatever it is that the various search engines are integrating, they certainly aren't doing it quite right yet.

They do have uses where even current models could play to their strengths though, and even the smaller models have the superhuman passing familiarity with almost everything anyone would - or could - ever have seen in text on a computer display. As long as you don't try something too unusual, they'd often do fine.
 
Last edited:
Maybe I'm a little bit behind on stuff but... Got to ask... What's the point of this for any regular home user?
This video helped me get started with running llm locally. It should also answer many of the questions you raised.
 
This video helped me get started with running llm locally. It should also answer many of the questions you raised.
OMG, this scare on the first seconds of the video.... :ohwell: :ohwell: :ohwell:
instant downvote from me for these kind of "content"
There is no worries when you run it locally on your local program.
I would never install DeepSeek's app on my phone tho...
 
I'm still fairly new to the scene, but have a phi-4 and deepseek-r1 instance locally. Right now just use them when I run into coding issues or need some inspiration. Was using a lot of Grok to fill that role, but the local stuff is neat.
 
I've tried a bunch of models, and keep switching back and forth between those.
Those are the ones currently downloaded into my MBP:
1739838571050.png


I usually use ollama as the backend, and either the python API for some software I run, or Open webUI for when I want a chat-like thingie.

Some performance numbers from my setups on a "what's the meaning of life?" prompt, without using fa or any other software speedups:

Model2x3090 (tok/s)M3 Max (tok/s)
phi4:14b-q4_K_M6225.5
phi4:14b-q8_047.516.9
deepseek-r1:32b-qwen-distill-q4_K_M27.812.6
deepseek-r1:7b-qwen-distill-q4_K_M11650.4
gemma:7b-instruct-v1.1-q4_0113.746.3
llama3.1:8b-instruct-q4_K_M110.146.08
llama3.1:8b-instruct-fp1649.817.5
deepseek-r1:32b-qwen-distill-q8_021-
deepseek-r1:70b-llama-distill-q4_K_M16.6-
llama3.3:70b-instruct-q4_K_M16.7-

With the above models that need 2 GPUs, both GPUs average their utilization at 50%. I did not run those into my MBP since I was out of memory for those given the other crap I had open.
Both my 3090s are also set to a 275W power limit.
 
Which one did you guys find to be the best for code?

I'm also curious how something like AMD Ryzen Al Max+ 395 would perform compared to 16gb GPUs.
 
Last edited:
I'm not using anything. I didn't even know that you could run them locally until recently. I'm only trying to learn what use it is, to see whether it's something I'd want to do or not.


Text-based holodeck running locally on your PC... Now that caught my attention! :)

I'm just having a hard time imagining it. LLM still lives in my head as a glorified search engine. :ohwell:

It's quite impressive how knowledgeable a local LLM can be without having internet access. I use a couple of models locally for benchmarking my GPU, answering the odd question or just for fun and there's been a few times where the models have surprised me with their answers. Here's a quick example running llama3.2 of it translating your post into Spanish. There is no internet access involved here, it's all running on my 7900XTX. And the model is only ~9GB in size.

1739856375929.png


The obvious downside of local LLMs is that they only have knowledge up until a point, eg: the date they were compiled. llama3.2 for example knows the current president of the United States is Joe Biden and the 2024 election hasn't happened yet. I imagine if I updated to llama3.3 (which is 43GB up from 9GB in 3.2!) it would have more up to date information. But I'd highly recommend you give it a try, even if it doesn't become a daily use tool on your machine it's a good benchmark and neat gimmick.

To answer the original question, I have the following models installed via ollama:
- codegemma
- deepseek-coder-v2
- gemma2
- llama3.2
- nemotron-mini
- phi4
- qwen2

Note that I mostly use gemma2 as it seems to be the most "accurate". I did experiment with the coding-focused ones to assist me with work from time to time, they are mostly not helpful and prone to hallucination or just outright wrong information :(

EDIT: I would also add that ollama is an excellent starting point to get set up, especially for AMD users with the ROCm version. It is very easy to get going across all major OSes.
 
It's quite impressive how knowledgeable a local LLM can be without having internet access. I use a couple of models locally for benchmarking my GPU, answering the odd question or just for fun and there's been a few times where the models have surprised me with their answers. Here's a quick example running llama3.2 of it translating your post into Spanish. There is no internet access involved here, it's all running on my 7900XTX. And the model is only ~9GB in size.

View attachment 385432

The obvious downside of local LLMs is that they only have knowledge up until a point, eg: the date they were compiled. llama3.2 for example knows the current president of the United States is Joe Biden and the 2024 election hasn't happened yet. I imagine if I updated to llama3.3 (which is 43GB up from 9GB in 3.2!) it would have more up to date information. But I'd highly recommend you give it a try, even if it doesn't become a daily use tool on your machine it's a good benchmark and neat gimmick.

To answer the original question, I have the following models installed via ollama:
- codegemma
- deepseek-coder-v2
- gemma2
- llama3.2
- nemotron-mini
- phi4
- qwen2

Note that I mostly use gemma2 as it seems to be the most "accurate". I did experiment with the coding-focused ones to assist me with work from time to time, they are mostly not helpful and prone to hallucination or just outright wrong information :(

EDIT: I would also add that ollama is an excellent starting point to get set up, especially for AMD users with the ROCm version. It is very easy to get going across all major OSes.
My entry point was that my colleague introduced me to LM Studio, I did not even hear about ollama until last December.
And I was not able to get it work with my GPU so I stick to LM Studio, which works real nice plus I like the great UI, even if I am using console on a daily basis
On the llama 3.2, why do you keeping the obsolete v3.2?
 
Which one did you guys find to be the best for code?

I'm also curious how something like AMD Ryzen Al Max+ 395 would perform compared to 16gb GPUs.
I'm guessing the new DeepSeek model is gonna be the best one. I'm also waiting for Strix Halo, it should perform at minimum as well as a 16GB RX 580 seeing as they'll have similar bandwidths. The rumors I hear for Medusa are insane though, 30% performance increase over Strix is crazy...
 
My entry point was that my colleague introduced me to LM Studio, I did not even hear about ollama until last December.
And I was not able to get it work with my GPU so I stick to LM Studio, which works real nice plus I like the great UI, even if I am using console on a daily basis
On the llama 3.2, why do you keeping the obsolete v3.2?

Because llama3.3 is 45GB and I am on Starlink :) I have to pick my day to download large files or I'll be waiting a long time!
 
I am on Starlink :) I have to pick my day to download large files or I'll be waiting a long time!
I see, Well you could get the file from a friend on a flashdrive maybe, or...
With LM Studio you can also pause/resume the downloads! So maybe you could have a look!
:toast:
I did not even consider this kind of limitations, sorry!
Where are you live that you need starlink?
 
I see, Well you could get the file from a friend on a flashdrive maybe, or...
With LM Studio you can also pause/resume the downloads! So maybe you could have a look!
:toast:
I did not even consider this kind of limitations, sorry!
Where are you live that you need starlink?

I'll get around to downloading it eventually. The thing is I don't use LLMs in my day-to-day, they're essentially a novelty I fire up from time to time. So getting the latest model isn't high on the priority list.

I live in rural Australia, far from any civilization! :)
 
I'm also curious how something like AMD Ryzen Al Max+ 395 would perform compared to 16gb GPUs.
RAM speed of 256-bit LPDDR5x-8000 is not that great. It's only 256GB/s compared to 256-bit GDDR6 7800XT having 624.1GB/s. So all models that fit into 16GB will be much faster on 7800XT or similar 16GB GPUs. The only advantage is when you go with 32GB or more RAM with Ryzen AI setup, then those bigger models not fitting into 16GB GPU will run faster on Ryzen AI laptop.
 
I'll get around to downloading it eventually. The thing is I don't use LLMs in my day-to-day, they're essentially a novelty I fire up from time to time. So getting the latest model isn't high on the priority list.

I live in rural Australia, far from any civilization! :)
Yeah, it has limited use, but still good to have!
So you live in the outback, now I get it!
Well local LLMs are great that you don't need to be connected yet it may answer many questions you have! :)
 
anyone got any local llms that can generate unskinned 3d models?
 
Finally got it working, but with Koboldcpp-cuda. My vid card heats up real nice when its thinking! Doesn't seem to be a download feature in it though so
difficult to install ggufs from Hugging Face. The other day I somehow managed to dl deepseek-r1-distill-qwen-32b-q5 and it runs it just fine. Don't remember
how I did that...... glad I have 64gbs ram, just for that it uses 30gbs.
 
Well now I've got DeepSeek R1 Distill Qwen 32b-q6 running, but only under windows arg. I signed up to Hugging Face, but still can't find any download links for gguf files.
Kobold needs the actual gguf file to load, there is no way to add using the interface. How do I dl bigger ggufs to use?
 
Well now I've got DeepSeek R1 Distill Qwen 32b-q6 running, but only under windows arg. I signed up to Hugging Face, but still can't find any download links for gguf files.
Kobold needs the actual gguf file to load, there is no way to add using the interface. How do I dl bigger ggufs to use?

For Q6_K: https://huggingface.co/bartowski/De...b/main/DeepSeek-R1-Distill-Qwen-32B-Q6_K.gguf
 
Get your own data center cards and leave my gaming GPUs alone!
 
Back
Top