• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Reports Suggest DeepSeek Running Inference on Huawei Ascend 910C AI GPUs

T0@st

News Editor
Joined
Mar 7, 2023
Messages
3,113 (3.91/day)
Location
South East, UK
System Name The TPU Typewriter
Processor AMD Ryzen 5 5600 (non-X)
Motherboard GIGABYTE B550M DS3H Micro ATX
Cooling DeepCool AS500
Memory Kingston Fury Renegade RGB 32 GB (2 x 16 GB) DDR4-3600 CL16
Video Card(s) PowerColor Radeon RX 7800 XT 16 GB Hellhound OC
Storage Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME SSD
Display(s) Lenovo Legion Y27q-20 27" QHD IPS monitor
Case GameMax Spark M-ATX (re-badged Jonsbo D30)
Audio Device(s) FiiO K7 Desktop DAC/Amp + Philips Fidelio X3 headphones, or ARTTI T10 Planar IEMs
Power Supply ADATA XPG CORE Reactor 650 W 80+ Gold ATX
Mouse Roccat Kone Pro Air
Keyboard Cooler Master MasterKeys Pro L
Software Windows 10 64-bit Home Edition
Huawei's Ascend 910C AI chip was positioned as one of the better Chinese-developed alternatives to NVIDIA's H100 accelerator—reports from last autumn suggested that samples were being sent to highly important customers. The likes of Alibaba, Baidu, and Tencent have long relied on Team Green enterprise hardware for all manner of AI crunching, but trade sanctions have severely limited the supply and potency of Western-developed AI chips. NVIDIA's region-specific B20 "Blackwell" accelerator is due for release this year, but industry watchdogs reckon that the Ascend 910C AI GPU is a strong rival. The latest online rumblings have pointed to another major Huawei customer—DeepSeek—having Ascend silicon in their back pockets.

DeepSeek's recent unveiling of its R1 open-source large language model has disrupted international AI markets. A lot of press attention has focused on DeepSeek's CEO stating that his team can access up to 50,000 NVIDIA H100 GPUs, but many have not looked into the company's (alleged) pool of natively-made chips. Yesterday, Alexander Doria—an LLM enthusiast—shared an interesting insight: "I feel this should be a much bigger story—DeepSeek has trained on NVIDIA H800, but is running inference on the new home Chinese chips made by Huawei, the 910C." Experts believe that there will be a plentiful supply of Ascend 910C GPUs—estimates from last September posit that 70,000 chips (worth around $2 billion) were in the mass production pipeline. Additionally, industry whispers suggest that Huawei is already working on a—presumably, even more powerful—successor.



View at TechPowerUp Main Site | Source
 
"Alexander Doria—an LLM enthusiast" ah so literally the most boring person in the world
 
Shouldnt it be NPU instead of GPU since there is no graphics being processed with LLMs.
 
Would their choce of the Nvidia assembly language for R1 be influenced by the hardware? I don’t know much about AI training in general. It's interesting when you read the training was done on Huawei's hardware, however.
 
Would their choce of the Nvidia assembly language for R1 be influenced by the hardware? I don’t know much about AI training in general. It's interesting when you read the training was done on Huawei's hardware, however.
Training done on Nvidia, inference done on Huawei.
 
That claim is misleading. That picture that served as the source is only for the distilled, smaller models, some of which you can even run in your smartphone.

Shouldnt it be NPU instead of GPU since there is no graphics being processed with LLMs.
NPUs are solely meant for inference of quantized types (like INT4 and INT8), not that useful for training nor inference with higher precision weights.
Would their choce of the Nvidia assembly language for R1 be influenced by the hardware? I don’t know much about AI training in general. It's interesting when you read the training was done on Huawei's hardware, however.
They used PTX only for a small portion of the pipeline. From their own paper:
Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.
Most of the code was likely in CUDA.
Also keep in mind that OP is talking about inference, not training.
 
Training done on Nvidia, inference done on Huawei.

That claim is misleading. That picture that served as the source is only for the distilled, smaller models, some of which you can even run in your smartphone.


NPUs are solely meant for inference of quantized types (like INT4 and INT8), not that useful for training nor inference with higher precision weights.

They used PTX only for a small portion of the pipeline. From their own paper:

Most of the code was likely in CUDA.
Also keep in mind that OP is talking about inference, not training.
I see. So they're limited to US hardware in some way.
Had to check what inference and training mean in regards to AI. I thought it's the same, or English as 2nd language plays a role :)
 
I see. So they're limited to US hardware in some way.
Had to check what inference and training mean in regards to AI. I thought it's the same, or English as 2nd language plays a role :)
Basically training is creating the AI model (based on an architectural blueprint) and inference is running an AI model that already exists.
 
One does reassure itself how he/she can.

Training is more intensive but you perform inference a lot more times.

The hardware functions required for neural networks are basic mathematical operations on huge matrices and vectors. This is not rocket science nor a secret sauce.

The performance difference between a 3/4 nm chip or a 7nm (as I understand the chinese are able to produce) does not seem that important.

And really, trying to have smaller nodes and large numbers of GPU is trying to brute force the problem.

DeepSeek just demonstrated that cleverness and algorithmic optimizations provides much more benefits than brute force.

That's why I think that the usage of nVidia chips is more related to the software stack associated with it (CUDA) than the hardware, as it allows better capacities (to easily add, remove, change) to develop the model. Inference once you have the model with the layers and the weights is straightforward and thus does not need such flexible developpement.
 
Last edited:
DeepSeek just demonstrated that cleverness and algorithmic optimizations provides much more benefits than brute force.
.
that’s the joke, now that cat is out of the bag, everyone can use it and brute force matters again.
the market reaction on this is dumb and extremely short sighted.

problem with the bubble still remains how are they going to make money with it, it’s not 60 queries a month for $20
 
that’s the joke, now that cat is out of the bag, everyone can use it and brute force matters again.
the market reaction on this is dumb and extremely short sighted.

problem with the bubble still remains how are they going to make money with it, it’s not 60 queries a month for $20

You have to pay for brute force, and that may prove difficult if you are competing with a free model you can run at home.

Hence all the media chatter trying to smear it or fearmonger over it.

Also, you can always find better optimizations, and that is not something that the US can somehow restrict China from developping.

And that it is China that did it while the US was stuck on brute force is quite a problem, I think.
 
Last edited:
Back
Top