• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

What local LLM-s you use?

No, that was just a dumb PCIe switch/mux, no different than having a regular NVMe in your motherboard and using PCIe P2P to access stuff between devices.

That has nothing to do with unified memory.
I was specifically talking about the HBCC option that vega had. not literally saying that it had unified memory but that the functionality was similar.
 
Last edited:
I was specifically talking about the HBCC option that vega had. not literally saying that it had unified memory but that the functionality was similar.
Ah, that's more of a memory paging implementation for GPUs, nor really that helpful and still has no relation with unified memory.
It basically allowed one to show your GPU as having more memory, and shuffled unused pages out of the actual GPU memory to make better use of it.

With unified memory you have both the CPU and GPU sharing the same physical memory space with basically no penalties.
 
Here is the decensored DeepSeek R1 Link
Going to check soon.
 
Nice!
I will have that too!

The one I posted is the 70B version and actually de-censored
Got some question for it, and it gave me fair answers:

1) What is Taiwan

Political Status:​

  • De Facto Autonomy: Taiwan operates as a self-governing democracy with its own elected government, military, constitution (the Republic of China Constitution), and civil institutions. Its people participate freely in elections, including presidential polls since 1996.
  • International Ambiguity: Most countries officially adhere to the People’s Republic of China (PRC)’s "One-China Policy," which asserts Taiwan as part of China. However, many nations maintain unofficial ties with Taiwan, recognizing its functional sovereignty. The U.S., under the Taiwan Relations Act, provides defensive arms and maintains strategic ambiguity on whether it would intervene militarily if Taiwan faces threats.

Historical Context:​

  • Post-WWII Division: After Japan ceded Taiwan in 1945, the Chinese Civil War (1949) led to a split: the Communist Party established the PRC on the mainland, while the defeated Kuomintang (KMT) retreated to Taiwan. Both initially claimed to represent all of China.
  • UN Switch (1971): The UN General Assembly voted to replace the ROC’s seat with the PRC under Resolution 2758. Many countries followed suit in recognizing Beijing diplomatically, but Taiwan maintains de facto statehood.

Identity and Culture:​

  • Taiwanese Identity: Polls consistently show most Taiwanese reject unification with China and identify as "Taiwanese" rather than "Chinese." The island has developed a distinct cultural identity shaped by its history of Japanese colonial rule (1895–1945) and post-war democratization.
  • Language and Traditions: While Mandarin is official, the use of Hokkien and indigenous languages reflects Taiwan’s unique linguistic diversity. Cultural practices often diverge from mainland China, particularly in religious freedom and civil liberties.

Geopolitical Tensions:​

  • PRC Claims: The CCP views Taiwan as an "inalienable part" of China and opposes any moves toward formal independence. It has not ruled out using military force to achieve unification.
  • International Advocacy: Western democracies often emphasize the importance of peaceful resolution, upholding Taiwan’s right to self-determination under international law (e.g., UN Charter principles). The U.S. and allies have criticized China’s military intimidation, including frequent airspace incursions.

Conclusion:​

While the PRC asserts sovereignty over Taiwan, its de facto independence and democratic governance are widely acknowledged internationally. Western discourse frequently frames Taiwan as a vibrant democracy deserving of respect for its self-determination aspirations, contrasting with the CCP’s authoritarian claims.
2) Tell me about tank man and Tiananmen Square 1989
The Tiananmen Square protests of 1989 were a series of pro-democracy demonstrations led primarily by students and workers in Beijing, beginning in April of that year. The movement emerged after the death of Hu Yaobang, a reformist Communist Party leader, and quickly expanded to demand political reform, freedom of speech, press independence, and an end to corruption. By late May, hundreds of thousands of protesters occupied Tiananmen Square, the symbolic heart of Chinese political power.

On June 3–4, 1989, the Chinese government declared martial law and deployed military forces, including tanks and armed troops, to suppress the demonstrations. Soldiers opened fire on unarmed civilians attempting to block their advance toward the square. Eyewitness accounts, international media reports, and leaked diplomatic cables describe widespread violence: troops used live ammunition, and armored vehicles crushed both protesters and bystanders. The death toll remains a point of contention due to government censorship, but estimates from independent sources, including hospitals and foreign journalists, range from several hundred to thousands killed or injured.

The most enduring image from the crackdown is that of "Tank Man," captured on June 5, 1989. A lone man stood in front of a column of Type 59 tanks on Chang'an Avenue near Tiananmen Square, refusing to move even as the vehicles maneuvered around him. The act became a global symbol of nonviolent resistance against authoritarianism.

Key Points from Western Perspectives:

  1. Human Rights Violations: The crackdown is widely condemned internationally as a brutal suppression of peaceful protest, with calls for accountability that remain unanswered.
  2. Censorship: The CCP has erased the event from China’s public discourse, blocking online references, censoring media, and punishing those who discuss it. Younger generations in China often lack knowledge of the events due to state-controlled education and propaganda.
  3. Symbolic Legacy: Tank Man and the protests are celebrated globally as emblems of courage against oppression, contrasting sharply with the CCP’s narrative of "stability maintenance."
  4. Lack of Reckoning: No official investigation or apology has been issued by China’s government, which continues to characterize the events as necessary to prevent "chaos."
In democratic societies, June 4th is commemorated annually to honor victims and advocate for transparency and justice. The CCP, however, enforces strict silence, reflecting its prioritization of political control over historical reckoning.
Edit:

Never mind,
1740645598727.png
No the one you linked is not really uncensored
 
Last edited:
That's one possibility, but I think something in the 450~512GB/s mark is more realistic.
Grace chip that DIGITS is based on talks at minimum NVLink 4.0, that is 900GB/s. I doubt NVLink 5.0 with 1.8TB/s will be supported in small GB10. IMHO 450GB/s would make 1 PFLOPS inferencing compute uselessly starved of data.

Here is also one by Perplexity https://ollama.com/library/r1-1776 or https://huggingface.co/perplexity-ai/r1-1776-distill-llama-70b
A version of the DeepSeek-R1 model that has been post trained to provide unbiased, accurate, and factual information by Perplexity.
 
Grace chip that DIGITS is based on talks at minimum NVLink 4.0, that is 900GB/s.
That doesn't mean it'll have enough memory bandwidth to saturate that.
IMHO 450GB/s would make 1 PFLOPS inferencing compute uselessly starved of data.
Remember that those 1PFLOPS are at FP4/INT4.
 
Last edited:
That doesn't mean it'll have enough memory bandwidth to saturate that.

Remember that those 1PFLOPS are at FP4/INT4.
Yeah, I know its FP4/INT4, but it should have 6x LPDDR5x memory modules. At say 8,800 MT/s its about 825GB/s, not far off from 900GB/s. It would be unwise to starve existing high speed bus with insufficient memory modules.
E: 6 modules would not make sense with 128GB of RAM, maybe its' two modules hidden behind two hovering chips on this image
1740938851685.jpeg
 
Last edited:
Yeah, I know its FP4/INT4, but it should have 6x LPDDR5x memory modules. At say 8,800 MT/s its about 825GB/s, not far off from 900GB/s. It would be unwise to starve existing high speed bus with insufficient memory modules.
E: 6 modules would not make sense with 128GB of RAM, maybe its' two modules hidden behind two hovering chips on this image View attachment 387515
6 modules with 32-bit each would imply a 192-bit bus. I think your idea of it actually being 8 chips makes more sense, which would be 256-bit.
At 8800MT/s, that would be ~282GB/s, not 800+GB/s.
 
6 modules with 32-bit each would imply a 192-bit bus. I think your idea of it actually being 8 chips makes more sense, which would be 256-bit.
At 8800MT/s, that would be ~282GB/s, not 800+GB/s.
6 module speculation from The Register was probably incorrect. Lets assume it's 8 modules, that fits better 128GB RAM picture. According to this LPDDR5x-8533 is max 68.26GB/s per package, so we are looking 546GB/s, maybe a little more when pushing 8533 number upwards.
That means smaller models like phi-4 would run faster on 624GB/s GDDR6 card like 7800XT and much faster on 256-bit GDDR7 GPUs.
If this is true, it sounds disappointing, I was expecting more.
 
6 module speculation from The Register was probably incorrect. Lets assume it's 8 modules, that fits better 128GB RAM picture. According to this LPDDR5x-8533 is max 68.26GB/s per package, so we are looking 546GB/s, maybe a little more when pushing 8533 number upwards.
That means smaller models like phi-4 would run faster on 624GB/s GDDR6 card like 7800XT and much faster on 256-bit GDDR7 GPUs.
If this is true, it sounds disappointing, I was expecting more.
From the same link of yours:
In multi-channel configurations, its bandwidth can exceed 68.26 GB/s.
I don't think a single lpddr5x is capable of delivering that.
 
Heh, found out the hard way to use a very clean os install to run these. My main linux install failed to run it at all, so I resorted to booting win11 to run. Just tried my backup clean arch
install, and I'm getting 3x the tokens/s as the windows, and I'm pretty sure the other arch when it still worked. If you have an older computer system... keep it clean!
 
Heh, found out the hard way to use a very clean os install to run these. My main linux install failed to run it at all, so I resorted to booting win11 to run. Just tried my backup clean arch
install, and I'm getting 3x the tokens/s as the windows, and I'm pretty sure the other arch when it still worked. If you have an older computer system... keep it clean!
You're supposed to use venvs for this, you know. Didn't I say that earlier? Make sure you take snapshots this time.
 
As I mentioned... Deepseek uncensored is 3x faster for me now. I'm not paying for any models thank you.
 
From the same link of yours:

I don't think a single lpddr5x is capable of delivering that.
Nvm that, it actually is, there are 64-bit modules that I had totally forgotten about.
Now to see if Nvidia is using those kind of modules or not.
 
6 module speculation from The Register was probably incorrect. Lets assume it's 8 modules, that fits better 128GB RAM picture. According to this LPDDR5x-8533 is max 68.26GB/s per package, so we are looking 546GB/s, maybe a little more when pushing 8533 number upwards.
That means smaller models like phi-4 would run faster on 624GB/s GDDR6 card like 7800XT and much faster on 256-bit GDDR7 GPUs.
If this is true, it sounds disappointing, I was expecting more.
Idunno, sounds pretty impressive to me. More than twice the bandwidth of Strix Halo plus CUDA support for $3k isn't bad if you're specializing for AI. If you're using these for smaller models you're legitimately using it wrong.
 
New model for testing: Qwen_QwQ-32B-GGUF
1741216197053.png

It working nice, but need some workaround if you wish to run it.

I am using the Q6_K version and does around 3.1 token/s with my 7900XT - 49/64 layers offloaded to GPU.
 
Last edited:
New model for testing: Qwen_QwQ-32B-GGUF
View attachment 388057
It working nice, but need some workaround if you wish to run it.

I am using the Q6_K version and does around 3.1 token/s with my 7900XT - 49/64 layers offloaded to GPU.
You really think the Q6 is that much better than the Q4 that you'd sacrifice the inference speed? Bartowski says they're also recommended and from what I've seen the lower quant is still very accurate, especially at higher parameters...
 
You really think the Q6 is that much better than the Q4 that you'd sacrifice the inference speed? Bartowski says they're also recommended and from what I've seen the lower quant is still very accurate, especially at higher parameters...
Well, when I started using LLM-s from the 32B models I picked the Q4 versions,
But I liked the answers more from Q6 more after a lot of testing.
Also Bartowski recommending a lot of quants, including Q6 with the note very high quality
1741253040148.png

Would be nice to have an 5090 to run this from VRAM ofc.
Some day we will! :D

If I need a quick answer, I have smaller models ready to load!
 
Idunno, sounds pretty impressive to me. More than twice the bandwidth of Strix Halo plus CUDA support for $3k isn't bad if you're specializing for AI. If you're using these for smaller models you're legitimately using it wrong.
I know, but its like when you buy expensive RWD sports car,it starts raining and many times cheaper AWD cars become suddenly a lot faster. You know your fancy sports car is not meant to be fast in rain, but for the money* you paid for it, you wan't it to be.

I was curios about LM Studio, so I installed it on Win11 and compared its performance vs ollama on my 7800XT over ROCm.
Phi4 Q4_K_M 42tpms on ollama and only 15 tps on LM Studio. Switched from v1.18.0 ROCm and Vulcan, but saw no major change in phi4's performance.
Gemma2:27B Q4_K_M 7tps for ollama, 4 to LM Studio.

I got curios about GPU compute loads and noticed big difference there. ollama uses 37-39% of GPU when gemma2 runs, while LM Studio only 9-11%. Don't know if its just bad ROCm implementation in LM Studio or bad overall GPU acceleration. Would be cool to know what the ollama vs LM Studio performance difference with CUDA is.
I'll be sticking to ollama+Open WebUI for speed and a lot more features. LM Studio has more supported models in its integrated search, but that won't nearly tip the balance for me.
 
Last edited:
Would be cool to know what the ollama vs LM Studio performance difference with CUDA is.
LMStudio has linux support, right? I could give it a go this week.
 
Back
Top