• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What local LLM-s you use?

You should look into other forks that are way faster, such as WhisperX or faster-whisper:
https://github.com/m-bain/whisperX (uses faster-whisper underneath)

I run the 1st one as a public service out of my GPU.
Very nice! Will take a look. One thing that I really like about llama.cpp and whisper.cpp is no python - much easier to get working and keep working. I tried other python-based LLM engines in the past and it often has the result of breaking something else. Also both llama.cpp and whisper.cpp have nice web servers.
 
I'm using Q4_K_M quant of Qwen3 30B A3B with the following settings (tried the same as yours)
settings.png


This gives me around 19.38 tok/sec but the model crashes after returning the output with the following error. Unsure if it's related to the quant / CPU / context etc.

Code:
Failed to regenerate message
The model has crashed without additional information. (Exit code: 18446744072635812000)

tokens.png


Looks like it's a known issue - https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/297
Using QWEN3 30B Q6_K 41/48
 
Back
Top