• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Hygon Prepares 128-Core, 512-Threaded x86 CPU with Four-Way SMT and AVX-512 Support

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
3,114 (1.09/day)
Chinese server CPU maker Hygon, which owns a Zen core IP from AMD, has put a roadmap for C86-5G, its most powerful server processor to date, featuring up to 128 cores and an astonishing 512 threads. Thanks to a complete microarchitectural redesign, the new chip delivers more than 17 percent higher instructions per cycle (IPC) than its predecessor. It also supports the AVX-512 vector instruction set and four-way simultaneous multithreading, making it a strong contender for highly parallel workloads. Sixteen channels of DDR5-5600 memory feed data-intensive tasks, while CXL 2.0 interconnect support enables seamless scaling across multiple sockets. Built on an unknown semiconductor node, the C86-5G includes advanced power management and a hardened security engine. With 128 lanes of PCIe 5.0, it offers ample bandwidth for accelerators, NVMe storage, and high-speed networking. Hygon positions this flagship CPU as ideal for artificial intelligence training clusters, large-scale analytics platforms, and virtualized enterprise environments.

The C86-5G is the culmination of five years of steady development. The journey began with the C86-1G, an AMD-licensed design that served as a testbed for domestic engineers. It offered up to 32 cores, 64 threads, eight channels of DDR4-2666 memory, and 128 lanes of PCIe 3.0. Its goal was to absorb proven technology and build local know-how. Next came the C86-2G, which kept the same core count but introduced a revamped floating-point unit, 21 custom security instructions, and hardware-accelerated features for memory encryption, virtualization, and trusted computing. This model marked Hygon's first real step into independent research and development. With the C86-3G, Hygon rolled out a fully homegrown CPU core and system-on-chip framework. Memory support increased to DDR4-3200, I/O doubled to PCIe 4.0, and on-die networking included four 10 GbE and eight 1 GbE ports. The C86-4G raised the bar further by doubling compute density to 64 cores and 128 threads, boosting IPC by around 15 percent and adding 12-channel DDR5-4800 memory plus 128 lanes of PCIe 5.0. Socket options expanded to dual and quad configurations. Now, with the C86-5G, Hygon has shown it can compete head-to-head with global server CPU leaders, putting more faith in China's growing capabilities in high-performance computing.



View at TechPowerUp Main Site | Source
 
Why 4 threads per core? Two threads keeps the core almost fully busy, so more threads gains little to nothing.
 
Why 4 threads per core? Two threads keeps the core almost fully busy, so more threads gains little to nothing.
I think it's possible, not certain but possible, that the team of hundreds of highly trained and qualified microprocessor engineers might know something about this? Other high end CPUs have used more than 2 way SMT in the past.

They'll have simulated this every which way to Sunday.
 
Why 4 threads per core? Two threads keeps the core almost fully busy, so more threads gains little to nothing.
IIRC IBM Power has 4 SMT, very specific scientific workloads. Think nuclear decomposition.
 
I think it's possible, not certain but possible, that the team of hundreds of highly trained and qualified microprocessor engineers might know something about this? Other high end CPUs have used more than 2 way SMT in the past.

They'll have simulated this every which way to Sunday.
Oh yes, because hundreds of highly trained and qualified microprocessor engineers NEVER push something that doesnt work right *cough cough* 13th gen intel*cough**cough* AMD bulldozer*cough*.
 
Curious to know what the die size is on this thing.
 
Chinese server CPU maker Hygon, which owns an x86 CPU license from AMD
What is the source for this statement? As far as I know AMD is not capable of sub-licensing x86 without Intel's approval, and that's not what happened with Hygon.
AnandTech's analysis from 2020:

AMD Does Due Diligence​

Simply stating ‘AMD sublicensed the IP of one of its x86 designs’ sounds a bit farfetched on most days of the week. If either AMD or Intel believed that the opportunity to let others sell its CPU designs was profitable, how come it took until 2015/2016 to ever come to fruition? Part of this story covers that while there was clearly some money in it for AMD here, it didn’t fall foul of any Intel-AMD licensing agreements. And most importantly, it didn’t contravene any US laws regarding the export of high-performance computing intellectual property.

This last point is important. The US government gives every CPU that comes out of Intel, AMD, and others, a value based on its performance. This is some combination of FLOPs and power, and those that surpass a specific threshold are deemed too powerful to be sold in certain markets. This includes semi-custom processors, where AMD/Intel fiddle with the core count/frequency and provide off-roadmap parts.

AMD at the time made the following statement:

Starting in 2015, AMD diligently and proactively briefed the Department of Defense, the Department of Commerce and multiple other agencies within the U.S. Government before entering into the joint ventures. AMD received no objections whatsoever from any agency to the formation of the joint ventures or to the transfer of technology – technology which was of lower performance than other commercially available processors. In fact, prior to the formation of the joint ventures and the transfer of technology, the Department of Commerce notified AMD that the technology proposed was not restricted or otherwise prohibited from being transferred. Given this clear feedback, AMD moved ahead with the joint ventures.
AMD had contacted the DoD and DoC, as well as all others, and had been given the green light. The new microarchitecture was deemed of low enough performance to not hit any of the export bans. AMD was also given crystal clear confirmation that the ‘technology proposed was not restricted or otherwise prohibited from being transferred’, which is a rather stark statement. At this point it should be clear that AMD may have submitted a modified version of its IP to the relevant US departments, rather than the microarchitecture we saw in the Ryzen 1000-series. This is part of what this review is about.
 
I think it's possible, not certain but possible, that the team of hundreds of highly trained and qualified microprocessor engineers might know something about this? Other high end CPUs have used more than 2 way SMT in the past.

They'll have simulated this every which way to Sunday.
coming from the country that invented tofu-dreg construction.... i doubt it.
 
"21 custom security instructions"
I do wonder what those entails :(

Why 4 threads per core? Two threads keeps the core almost fully busy, so more threads gains little to nothing.
SMT is a relic of the past, and stopped making sense for user-interactive workloads after quad cores, but will stick around for a while in the server space, partly due to marketing reasons, but also because there are certain server workloads where it sort-of "makes sense", but that rationale is still shrinking. This is limited to workloads where the core is stalled most of the time thanks to cache misses and mispredictions, each worker thread is async, and the only thing that matters is overall throughput (not latency). Remember, the 4 threads will compete over caches and front-end resources, so the effective throughput for a single thread for the intended workload would have to be pretty miserable in order to justify 4-way SMT (or even 8-way like with PPC).

While modern x86 microarchitectures from Intel and AMD aren't anywhere close to saturating the CPU resources, their continuing advancement have made SMT less and less useful over time. So the less idle cycles there are, the less "free performance" can be extracted through SMT, which is probably what you're thinking about.

Meanwhile, Intel's upcoming Diamond Rapids and hopefully Nova Lake will introduce APX, which according to their documentation should bring a significant uplift in throughput.

I think it's possible, not certain but possible, that the team of hundreds of highly trained and qualified microprocessor engineers might know something about this? Other high end CPUs have used more than 2 way SMT in the past.

They'll have simulated this every which way to Sunday.
They probably have extracted the performance they could the easiest way within their time-frame and constraints, and the end result is a CPU with lots of resources on the execution side, but with a very weak front-end to feed it.
It could also be that their SMT implementation works different from Intel and AMD, e.g. executing two of four threads intermixed (where Intel/AMD switches between two threads). If this happens to be the case, the saturation for each thread would be dreadful.

For instance PPC with its 8-way SMT is(was?) popular for certain java workloads, which are so inefficient that they barely execute at all. :p (more like a traffic jam…)
 
Long time since I new CPU came with SMT4 or higher, cool to see.
Let's see how it performs in practice.

Remember, the 4 threads will compete over caches and front-end resources, so the effective throughput for a single thread for the intended workload would have to be pretty miserable in order to justify 4-way SMT (or even 8-way like with PPC).
It could also be that their SMT implementation works different from Intel and AMD, e.g. executing two of four threads intermixed (where Intel/AMD switches between two threads). If this happens to be the case, the saturation for each thread would be dreadful.
Going for SMT makes your front-end way simpler as well, and allows you to do some fancy strategies to increase IPC and maximize EU utilization for some given scenarios.
Zen 5, as an example, has 2x 4-wide decoders, which are pretty bog standard to implement (compared to Intel's 6-wide and larger implementations). A single thread will end up bottlenecked by it, and is not able to make use of both decoders, both with SMT then it's possible to basically double up the IPC.

Intel has that fancy 3x3-wide cluster than a single thread can use, but those are used in those E-cores which lack a µop-cache.

Given how that Hygon CPU is meant for servers and not your usual desktop use-case, I believe it does make more sense to go with SMT, specially for IO bound workloads that basically fit into the description you gave:
This is limited to workloads where the core is stalled most of the time thanks to cache misses and mispredictions, each worker thread is async, and the only thing that matters is overall throughput (not latency).
Many server-ish workloads can pretty much be summarized to that (specially in web-related stuff) and SMT on Zen CPUs gives a really significant boost to throughput, and even latency in some scenarios.
 
"21 custom security instructions"
I do wonder what those entails :(
It's most likely Chinese crypto instructions as in SM3 and SM4 which are already supported by some RISC-V, ARM cores and Intel Arrow/Lunar Lake.
SMT is a relic of the past, and stopped making sense for user-interactive workloads after quad cores, but will stick around for a while in the server space, partly due to marketing reasons, but also because there are certain server workloads where it sort-of "makes sense", but that rationale is still shrinking. This is limited to workloads where the core is stalled most of the time thanks to cache misses and mispredictions, each worker thread is async, and the only thing that matters is overall throughput (not latency). Remember, the 4 threads will compete over caches and front-end resources, so the effective throughput for a single thread for the intended workload would have to be pretty miserable in order to justify 4-way SMT (or even 8-way like with PPC).

While modern x86 microarchitectures from Intel and AMD aren't anywhere close to saturating the CPU resources, their continuing advancement have made SMT less and less useful over time. So the less idle cycles there are, the less "free performance" can be extracted through SMT, which is probably what you're thinking about.
AMD doesn't agree since they built Zen 5 specifically for SMT. It has dual 4-way decoders with each dedicated to one thread. NVIDIA doesn't agree since their next ARM Vera CPU will feature SMT. Intel disagrees since their workstation and server CPUs based on P-cores will keep including SMT. It's only E-core designs that won't.
It's still easy to saturate a modern x86 core with an integer load, for example the 7-zip benchmark scales to almost 100% on both AMD and Intel SMT. Floating point is a different story, but still possible to extract tangible benefits especially on modern implementations (Zen 4+, Alder Lake+).
Meanwhile, Intel's upcoming Diamond Rapids and hopefully Nova Lake will introduce APX, which according to their documentation should bring a significant uplift in throughput.
APX is solving different issues, chiefly of register pressure. It's not like APX server P-core CPUs will not feature SMT or at least I haven't read anything that would suggest it.
They probably have extracted the performance they could the easiest way within their time-frame and constraints, and the end result is a CPU with lots of resources on the execution side, but with a very weak front-end to feed it.
It could also be that their SMT implementation works different from Intel and AMD, e.g. executing two of four threads intermixed (where Intel/AMD switches between two threads). If this happens to be the case, the saturation for each thread would be dreadful.
I'm not sure about "Intel/AMD switches between two threads" when both are executing at the same time inside the core, and in case of AMD Zen 5 are even being decoded at the same time. Intel has also an 8-wide decoder which supposedly can be split. I haven't seen any confirmation that it happens for SMT, but I suspect it does.
For instance PPC with its 8-way SMT is(was?) popular for certain java workloads, which are so inefficient that they barely execute at all. :p (more like a traffic jam…)
It was more for the Oracle database side where operations were simple, but had to be kept "intact" (without context switching) in order to optimize throughput while maintaining latency. POWER 8 cores were also heavily overbuilt compared to modern x86 - basically two full cores in one with shared caches, that allowed higher order SMT. I'd say the closest in design for x86 would be the infamous Bulldozer cores, but POWER went further and didn't share the FP unit.
Coincidentally Oracle's own SPARC CPUs also supported up to SMT8.
 
Zen 5, as an example, has 2x 4-wide decoders…
Zen 5 implements two-ahead branch prediction, in an effort to reduce the cost of mispredicitons by having the alternative branch ready to be executed. Such improvements are just another example of reducing idle clock cycles which in return means gains from SMT will be reduced.

I haven't seen any evidence of significant gains from this yet, but with refinement and combined with APX it has some great (theoretical) potential.

Going for SMT makes your front-end way simpler as well, and allows you to do some fancy strategies to increase IPC and maximize EU utilization for some given scenarios.<snip>
A single thread will end up bottlenecked by it, and is not able to make use of both decoders, both with SMT then it's possible to basically double up the IPC.
Just to be clear, SMT the way Intel and AMD implements it doesn't improve IPC at all. It just tries to keep the core fed, as if it was one single thread saturating the core.

APX is solving different issues, chiefly of register pressure.
Actually not, the benefits from less register shuffling is just an added bonus, and a rather minimal one to be honest.
APX is about maximizing the efficiency of the branch predictor to saturate the CPU which is very clearly explained in the official documentation:

The performance features introduced so far will have a limited impact on workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates the performance of such workloads. Branch predictor improvements can mitigate this only to a limited extent as data-dependent branches are fundamentally hard to predict.

To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for the broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).

Intel APX adds conditional forms of load, store, and compare/test instructions and adds an option for the compiler to suppress the status flag writes of common instructions. These enhancements expand the applicability of if-conversion to much larger code regions, cutting down on the number of branches that may incur misprediction penalties.

So as you can clearly see, this is very much about saturating the CPU.

How successful it will be, remains to be seen. This is pretty much in line with what myself and other programmers have requested for many years, if anything I'm wondering it it's enough.
 
Why 4 threads per core? Two threads keeps the core almost fully busy, so more threads gains little to nothing.
Not true for highly threaded workloads, otherwise SPARC would never have existed.
 
I think it's possible, not certain but possible, that the team of hundreds of highly trained and qualified microprocessor engineers might know something about this? Other high end CPUs have used more than 2 way SMT in the past.

They'll have simulated this every which way to Sunday.
4way SMT would be very bad for music production, as it requires higher speeds than needing moar coars. VSTs or Virtual instruments as fully emulated peices of music hardware done on software that emulate all the aspects of an actual instrument and they are very CPU intense ane need very high clock speeds and IPC. If the DSP usage maxxess out you get cuttouts and stuttering badly, Music production requires realtime performance and very low DPC latency as well. I know these 4way SMT CPUs are going to do badly for music production! each core will work too hard and there will be quaduple digit latency!! In some cases people even disable hyperthreading to get better performance while making and performing music!

I have a few systems that are only 4c8t that do better with certain tracks than 1 of my 8c16t systems due to higher clock speeds and lower DPC. I have some synth patches I made that will bring any CPU to it's knees.
 
Zen 5 implements two-ahead branch prediction, in an effort to reduce the cost of mispredicitons by having the alternative branch ready to be executed. Such improvements are just another example of reducing idle clock cycles which in return means gains from SMT will be reduced.
Yes, but those are not mutually exclusive. There are still considerable gains from SMT within Zen 5 nonetheless.

Just to be clear, SMT the way Intel and AMD implements it doesn't improve IPC at all. It just tries to keep the core fed, as if it was one single thread saturating the core.
It does improve IPC in absolute terms in practice, given that a single thread is not able to effectively saturate the core. As an example, see this micro-benchmark from chips and cheese:

You can argue that it's a workaround for a front-end bottleneck (which I'd agree with), but that doesn't change the end results.
 
It does improve IPC in absolute terms in practice, given that a single thread is not able to effectively saturate the core. As an example, see this micro-benchmark from chips and cheese:
Not w music production. A VST runs everything on one thread and when it pushes it hard.. see my above post...
 
Zen 5 implements two-ahead branch prediction, in an effort to reduce the cost of mispredicitons by having the alternative branch ready to be executed. Such improvements are just another example of reducing idle clock cycles which in return means gains from SMT will be reduced.

I haven't seen any evidence of significant gains from this yet, but with refinement and combined with APX it has some great (theoretical) potential.
Improvements to the branch prediction affect both SMT threads since both raw decoding and branch prediction+opcache are active at the same time in Zen 5:
Both the fetch+decode and op cache pipelines can be active at the same time, and both feed into the in-order micro-op queue.
(source - AMD via Chips and Cheese)
Just to be clear, SMT the way Intel and AMD implements it doesn't improve IPC at all. It just tries to keep the core fed, as if it was one single thread saturating the core.
No, SMT does increase IPC, and in case of Zen 5 it doubles it when Op Cache runs out as expected from the decoder design:
https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2659f108-5039-4dfc-ae47-8e4b8a8f9ba3_1140x530.png

(source - Chips and Cheese)
Even if the Op Cache is disabled:
https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F693afadc-4b79-47d2-a214-b30346371254_2171x1000.png

(source - Chips and Cheese)
Actually not, the benefits from less register shuffling is just an added bonus, and a rather minimal one to be honest.
APX is about maximizing the efficiency of the branch predictor to saturate the CPU which is very clearly explained in the official documentation:

The performance features introduced so far will have a limited impact on workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates the performance of such workloads. Branch predictor improvements can mitigate this only to a limited extent as data-dependent branches are fundamentally hard to predict.

To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for the broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).

Intel APX adds conditional forms of load, store, and compare/test instructions and adds an option for the compiler to suppress the status flag writes of common instructions. These enhancements expand the applicability of if-conversion to much larger code regions, cutting down on the number of branches that may incur misprediction penalties.


So as you can clearly see, this is very much about saturating the CPU.

How successful it will be, remains to be seen. This is pretty much in line with what myself and other programmers have requested for many years, if anything I'm wondering it it's enough.
What you quoted affects one type of workload, and doesn't invalidate SMT in any way. As I wrote before I haven't read anything that makes APX SMT-phobic ;)
 
It does improve IPC in absolute terms in practice, given that a single thread is not able to effectively saturate the core.
Absolutely not, it's a common misconception that IPC is performance per clock, when it's not, it's the amount of instructions the CPU is able to churn through. Whether there is one, two or more threads sharing a core's resources, the IPC remains constant. SMT does improve the saturation of the core for some workloads, but the total performance will only converge towards a fully thread fully saturating the core, never above that. This should be basic knowledge about CPUs.

What you quoted affects one type of workload, and doesn't invalidate SMT in any way. As I wrote before I haven't read anything that makes APX SMT-phobic ;)
I never claimed APX was "SMT-phobic", they probably will co-exist for a while, but the fact that each improvement in microarchitecture resulting in better saturated core results in less stalls, and therefore fewer idle "free" clock cycles for SMT to utilize. As you can clearly see in the quote from earlier about APX; "As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates the performance of such workloads.", it is very clearly about keeping the CPU saturated. The more saturated the core is from one thread, the less gains there will be from SMT, this should be obvious and is basic logical deduction, and is why we've seen fewer and fewer cases where SMT is significantly beneficial as CPUs advances.

On top of that, the intricate complexity of implementing SMT in the pipeline in modern CPUs, with the resulting transistor "costs" and design constraints, and all the nasty security implications, it naturally comes to a point where the efforts can be better spent by creating a more efficient architecture without SMT. This is why Intel's client CPUs have already moved on, and others will eventually follow.
 
"21 custom security instructions"
I do wonder what those entails :(


SMT is a relic of the past, and stopped making sense for user-interactive workloads after quad cores, but will stick around for a while in the server space, partly due to marketing reasons, but also because there are certain server workloads where it sort-of "makes sense", but that rationale is still shrinking. This is limited to workloads where the core is stalled most of the time thanks to cache misses and mispredictions, each worker thread is async, and the only thing that matters is overall throughput (not latency). Remember, the 4 threads will compete over caches and front-end resources, so the effective throughput for a single thread for the intended workload would have to be pretty miserable in order to justify 4-way SMT (or even 8-way like with PPC).

While modern x86 microarchitectures from Intel and AMD aren't anywhere close to saturating the CPU resources, their continuing advancement have made SMT less and less useful over time. So the less idle cycles there are, the less "free performance" can be extracted through SMT, which is probably what you're thinking about.

Meanwhile, Intel's upcoming Diamond Rapids and hopefully Nova Lake will introduce APX, which according to their documentation should bring a significant uplift in throughput.


They probably have extracted the performance they could the easiest way within their time-frame and constraints, and the end result is a CPU with lots of resources on the execution side, but with a very weak front-end to feed it.
It could also be that their SMT implementation works different from Intel and AMD, e.g. executing two of four threads intermixed (where Intel/AMD switches between two threads). If this happens to be the case, the saturation for each thread would be dreadful.

For instance PPC with its 8-way SMT is(was?) popular for certain java workloads, which are so inefficient that they barely execute at all. :p (more like a traffic jam…)

I think you explained it well, an anology would perhaps be you have a narrow thin corridor, where you allow one person through at a time back to back, then you decide to allow 2 side by side, more overall people get through, but its a less pleasant experience with cramped space.
 
Absolutely not, it's a common misconception that IPC is performance per clock, when it's not, it's the amount of instructions the CPU is able to churn through. Whether there is one, two or more threads sharing a core's resources, the IPC remains constant. SMT does improve the saturation of the core for some workloads, but the total performance will only converge towards a fully thread fully saturating the core, never above that. This should be basic knowledge about CPUs.
You are redefining what "IPC" means to suit your argument. I gave you detailed test results which you simply ignore. There's not much more I can do here.
The more saturated the core is from one thread, the less gains there will be from SMT, this should be obvious and is basic logical deduction, and is why we've seen fewer and fewer cases where SMT is significantly beneficial as CPUs advances.
That's not what we've been seeing. SMT performance and efficiency in x86 has been increasing. Zen 5 is able to achieve more with it than for example Zen 2. Same for Intel P-cores - they scale way better than their early SMT implementations.
On top of that, the intricate complexity of implementing SMT in the pipeline in modern CPUs, with the resulting transistor "costs" and design constraints, and all the nasty security implications, it naturally comes to a point where the efforts can be better spent by creating a more efficient architecture without SMT. This is why Intel's client CPUs have already moved on, and others will eventually follow.
Intel is not "moving on" from SMT in general. Their P-cores in server/workstation designs will keep using it. It's just their consumer designs that don't implement it. As I wrote before even NVIDIA is introducing SMT into their next sever ARM Vera CPUs.
From the linked CnC article when they discussed SMT with AMD:
The 2T point gets emphasis here. AMD is well aware that Intel is planning to leave SMT out of their upcoming Lunar Lake mobile processor. Zen 5 takes the opposite approach, maintaining SMT support even in mobile products like Strix Point. AMD found that SMT let them maintain maximum 1T performance while enjoying the higher throughput enabled by running two threads in a core for multithreaded workloads. They also found SMT gave them better power efficiency in those multithreaded loads, drawing a clear contrast with Intel’s strategy.
 
Back
Top