- Joined
- May 10, 2023
- Messages
- 853 (1.19/day)
- Location
- Brazil
Processor | 5950x |
---|---|
Motherboard | B550 ProArt |
Cooling | Fuma 2 |
Memory | 4x32GB 3200MHz Corsair LPX |
Video Card(s) | 2x RTX 3090 |
Display(s) | LG 42" C2 4k OLED |
Power Supply | XPG Core Reactor 850W |
Software | I use Arch btw |
If you manage to get it properly working with something like ktransformers, then maybe. You'd still require quite some RAM, but should be feasible with consumer platforms nonetheless.Q4 17B should fit pretty easily in even 16GB of VRAM, it shouldn't be a problem. Processing/generation should outweigh load/unload.
But now you're talking about DRAM offloading and making use of CPU for some of the layers, which is totally different from a mGPU setup that was the previous discussion point.When it comes to inferencing current models, bottleneck is the the VRAM bandwidth. GPU or CPU compute is almost irrelevant. As observed by ollama ps command, tps drops dramatically when even few % of the model is forced to run on a 10x slower system RAM compared to GPU.
LLM models have billions of parameters and each inference pass requires loading most or all of these. Its like billions of neurons firing up and communicating between each other when thinking. This creates huge storm of data traffic and memory bandwidth is the only key performance metric here holding back tps. Yeah at some point when memory becomes fast enough, compute needs to catch up, but due historical design (calculating frames is more compute than bandwidth intensive) GPUs today are bandwidth starved when doing LLM inferencing runs.
The PCIe bottleneck is really not much of an issue for a small number of GPUs for inference.