Thursday, May 8th 2025
Hygon Prepares 128-Core, 512-Threaded x86 CPU with Four-Way SMT and AVX-512 Support
Chinese server CPU maker Hygon, which owns a Zen core IP from AMD, has put a roadmap for C86-5G, its most powerful server processor to date, featuring up to 128 cores and an astonishing 512 threads. Thanks to a complete microarchitectural redesign, the new chip delivers more than 17 percent higher instructions per cycle (IPC) than its predecessor. It also supports the AVX-512 vector instruction set and four-way simultaneous multithreading, making it a strong contender for highly parallel workloads. Sixteen channels of DDR5-5600 memory feed data-intensive tasks, while CXL 2.0 interconnect support enables seamless scaling across multiple sockets. Built on an unknown semiconductor node, the C86-5G includes advanced power management and a hardened security engine. With 128 lanes of PCIe 5.0, it offers ample bandwidth for accelerators, NVMe storage, and high-speed networking. Hygon positions this flagship CPU as ideal for artificial intelligence training clusters, large-scale analytics platforms, and virtualized enterprise environments.
The C86-5G is the culmination of five years of steady development. The journey began with the C86-1G, an AMD-licensed design that served as a testbed for domestic engineers. It offered up to 32 cores, 64 threads, eight channels of DDR4-2666 memory, and 128 lanes of PCIe 3.0. Its goal was to absorb proven technology and build local know-how. Next came the C86-2G, which kept the same core count but introduced a revamped floating-point unit, 21 custom security instructions, and hardware-accelerated features for memory encryption, virtualization, and trusted computing. This model marked Hygon's first real step into independent research and development. With the C86-3G, Hygon rolled out a fully homegrown CPU core and system-on-chip framework. Memory support increased to DDR4-3200, I/O doubled to PCIe 4.0, and on-die networking included four 10 GbE and eight 1 GbE ports. The C86-4G raised the bar further by doubling compute density to 64 cores and 128 threads, boosting IPC by around 15 percent and adding 12-channel DDR5-4800 memory plus 128 lanes of PCIe 5.0. Socket options expanded to dual and quad configurations. Now, with the C86-5G, Hygon has shown it can compete head-to-head with global server CPU leaders, putting more faith in China's growing capabilities in high-performance computing.
Source:
via HXL on X
The C86-5G is the culmination of five years of steady development. The journey began with the C86-1G, an AMD-licensed design that served as a testbed for domestic engineers. It offered up to 32 cores, 64 threads, eight channels of DDR4-2666 memory, and 128 lanes of PCIe 3.0. Its goal was to absorb proven technology and build local know-how. Next came the C86-2G, which kept the same core count but introduced a revamped floating-point unit, 21 custom security instructions, and hardware-accelerated features for memory encryption, virtualization, and trusted computing. This model marked Hygon's first real step into independent research and development. With the C86-3G, Hygon rolled out a fully homegrown CPU core and system-on-chip framework. Memory support increased to DDR4-3200, I/O doubled to PCIe 4.0, and on-die networking included four 10 GbE and eight 1 GbE ports. The C86-4G raised the bar further by doubling compute density to 64 cores and 128 threads, boosting IPC by around 15 percent and adding 12-channel DDR5-4800 memory plus 128 lanes of PCIe 5.0. Socket options expanded to dual and quad configurations. Now, with the C86-5G, Hygon has shown it can compete head-to-head with global server CPU leaders, putting more faith in China's growing capabilities in high-performance computing.
29 Comments on Hygon Prepares 128-Core, 512-Threaded x86 CPU with Four-Way SMT and AVX-512 Support
They'll have simulated this every which way to Sunday.
AnandTech's analysis from 2020:
I do wonder what those entails :( SMT is a relic of the past, and stopped making sense for user-interactive workloads after quad cores, but will stick around for a while in the server space, partly due to marketing reasons, but also because there are certain server workloads where it sort-of "makes sense", but that rationale is still shrinking. This is limited to workloads where the core is stalled most of the time thanks to cache misses and mispredictions, each worker thread is async, and the only thing that matters is overall throughput (not latency). Remember, the 4 threads will compete over caches and front-end resources, so the effective throughput for a single thread for the intended workload would have to be pretty miserable in order to justify 4-way SMT (or even 8-way like with PPC).
While modern x86 microarchitectures from Intel and AMD aren't anywhere close to saturating the CPU resources, their continuing advancement have made SMT less and less useful over time. So the less idle cycles there are, the less "free performance" can be extracted through SMT, which is probably what you're thinking about.
Meanwhile, Intel's upcoming Diamond Rapids and hopefully Nova Lake will introduce APX, which according to their documentation should bring a significant uplift in throughput. They probably have extracted the performance they could the easiest way within their time-frame and constraints, and the end result is a CPU with lots of resources on the execution side, but with a very weak front-end to feed it.
It could also be that their SMT implementation works different from Intel and AMD, e.g. executing two of four threads intermixed (where Intel/AMD switches between two threads). If this happens to be the case, the saturation for each thread would be dreadful.
For instance PPC with its 8-way SMT is(was?) popular for certain java workloads, which are so inefficient that they barely execute at all. :p (more like a traffic jam…)
Let's see how it performs in practice. Going for SMT makes your front-end way simpler as well, and allows you to do some fancy strategies to increase IPC and maximize EU utilization for some given scenarios.
Zen 5, as an example, has 2x 4-wide decoders, which are pretty bog standard to implement (compared to Intel's 6-wide and larger implementations). A single thread will end up bottlenecked by it, and is not able to make use of both decoders, both with SMT then it's possible to basically double up the IPC.
Intel has that fancy 3x3-wide cluster than a single thread can use, but those are used in those E-cores which lack a µop-cache.
Given how that Hygon CPU is meant for servers and not your usual desktop use-case, I believe it does make more sense to go with SMT, specially for IO bound workloads that basically fit into the description you gave: Many server-ish workloads can pretty much be summarized to that (specially in web-related stuff) and SMT on Zen CPUs gives a really significant boost to throughput, and even latency in some scenarios.
It's still easy to saturate a modern x86 core with an integer load, for example the 7-zip benchmark scales to almost 100% on both AMD and Intel SMT. Floating point is a different story, but still possible to extract tangible benefits especially on modern implementations (Zen 4+, Alder Lake+). APX is solving different issues, chiefly of register pressure. It's not like APX server P-core CPUs will not feature SMT or at least I haven't read anything that would suggest it. I'm not sure about "Intel/AMD switches between two threads" when both are executing at the same time inside the core, and in case of AMD Zen 5 are even being decoded at the same time. Intel has also an 8-wide decoder which supposedly can be split. I haven't seen any confirmation that it happens for SMT, but I suspect it does. It was more for the Oracle database side where operations were simple, but had to be kept "intact" (without context switching) in order to optimize throughput while maintaining latency. POWER 8 cores were also heavily overbuilt compared to modern x86 - basically two full cores in one with shared caches, that allowed higher order SMT. I'd say the closest in design for x86 would be the infamous Bulldozer cores, but POWER went further and didn't share the FP unit.
Coincidentally Oracle's own SPARC CPUs also supported up to SMT8.
I haven't seen any evidence of significant gains from this yet, but with refinement and combined with APX it has some great (theoretical) potential. Just to be clear, SMT the way Intel and AMD implements it doesn't improve IPC at all. It just tries to keep the core fed, as if it was one single thread saturating the core. Actually not, the benefits from less register shuffling is just an added bonus, and a rather minimal one to be honest.
APX is about maximizing the efficiency of the branch predictor to saturate the CPU which is very clearly explained in the official documentation:
The performance features introduced so far will have a limited impact on workloads that suffer from a large number of conditional branch mispredictions. As out-of-order CPUs continue to become deeper and wider, the cost of mispredictions increasingly dominates the performance of such workloads. Branch predictor improvements can mitigate this only to a limited extent as data-dependent branches are fundamentally hard to predict.
To address this growing performance issue, we significantly expand the conditional instruction set of x86, which was first introduced with the Intel® Pentium® Pro in the form of CMOV/SET instructions. These instructions are used quite extensively by today’s compilers, but they are too limited for the broader use of if-conversion (a compiler optimization that replaces branches with conditional instructions).
Intel APX adds conditional forms of load, store, and compare/test instructions and adds an option for the compiler to suppress the status flag writes of common instructions. These enhancements expand the applicability of if-conversion to much larger code regions, cutting down on the number of branches that may incur misprediction penalties.
So as you can clearly see, this is very much about saturating the CPU.
How successful it will be, remains to be seen. This is pretty much in line with what myself and other programmers have requested for many years, if anything I'm wondering it it's enough.
I have a few systems that are only 4c8t that do better with certain tracks than 1 of my 8c16t systems due to higher clock speeds and lower DPC. I have some synth patches I made that will bring any CPU to it's knees.
chipsandcheese.com/p/amds-ryzen-9950x-zen-5-on-desktop
You can argue that it's a workaround for a front-end bottleneck (which I'd agree with), but that doesn't change the end results.
(source - Chips and Cheese)
Even if the Op Cache is disabled:
(source - Chips and Cheese) What you quoted affects one type of workload, and doesn't invalidate SMT in any way. As I wrote before I haven't read anything that makes APX SMT-phobic ;)
On top of that, the intricate complexity of implementing SMT in the pipeline in modern CPUs, with the resulting transistor "costs" and design constraints, and all the nasty security implications, it naturally comes to a point where the efforts can be better spent by creating a more efficient architecture without SMT. This is why Intel's client CPUs have already moved on, and others will eventually follow.
From the linked CnC article when they discussed SMT with AMD:
It's primarily the CPU vendors themselves at fault for creating confusion and turning "IPC" into a marketing gimmick. (But also big tech YouTubers/websites commonly misuses technical terms, and while many have been into tech for many years still lack the deep knowledge of CPU architectures, machine code and software design.) IPC and performance per clock may be very different, especially when you have different performance characteristics, and even benchmarking with different feature levels or ISAs all together. Take for instance one CPU running a test with AVX-512 and one with AVX2, first will execute fewer instructions per clock yet have higher performance than the latter. Or comparing Zen 2/3 to the Skylake family; Zen having more execution ports but a weaker front-end, resulting some workloads performing significantly better on one or the other.
The same is by all indicators the case for this Hygon CPU too; it's by far easier to achieve some performance by adding lots of execution ports first, and then optimize how to feed them later. And to some extent for Zen 5 too; increasing ALUs 4->6 didn't have a major impact across the board like "leakers" expected, but it will likely lead to gains when the front-end matures with Zen 6 and later revisions.