Monday, December 9th 2019

Centaur Releases In-Depth Analysis from The Linley Group for its NCORE-Equipped x86 Processor

Centaur Technology today revealed in-depth information about its new processor-design technology for integrating high-performance x86 CPUs with a specialized co-processor optimized for artificial intelligence (AI) acceleration. On its website, Centaur provides a new independent report from The Linley Group, the industry's leading authority on microprocessor technology and publishers of Microprocessor Report. The Linley Group reviewed Centaur's detailed design documents and interviewed Centaur's CPU and AI architects to support the analysis of both Centaur's newest x86 microarchitecture and the AI co-processor design.

"Centaur is galloping back into the x86 market with an innovative processor design that combines eight high-performance CPUs with a custom deep-learning accelerator (DLA). The company is the first to announce a server-processor design that integrates a DLA. The new accelerator, called Ncore, delivers better neural-network performance than even the most powerful Xeon, but without the high cost of an external GPU card," stated Linley Gwennap, Editor-in-Chief, Microprocessor Report.
The report can be accessed here (PDF).

The Linley Group referenced certified MLPerf benchmark (Preview) scores to compare Centaur's AI performance to high-end x86 CPU cores from the leading x86 vendor. Based on MLPerf scores, Centaur's AI-coprocessor inference performance is comparable to 23 of Intel's world-class x86 cores that now support 512-bit vector neural network instructions (VNNI). Centaur's AI co-processor uses an architecturally similar single-instruction-multiple-data (SIMD) approach as VNNI, but crunches 32,768 bits in a single clock cycle using a 16 MB memory with 20 terabytes/sec of bandwidth. Moreover, by offloading inference processing to a specialized co-processor, the x86 CPU cores remain available for other, more general-purpose tasks. Application developers can innovate new algorithms that take advantage of the unparalleled inference latency enabled by Centaur's AI performance and tight integration with x86 CPUs.

Attendees at the ISC East trade show in NYC saw Centaur's new technology up close for the first time. The demo showcased video analytics using Centaur's reference system with x86-based network-video-recording (NVR) software from Qvis Labs. In addition to conventional, real-time object detection/classification, Centaur was the only vendor at the show to highlight leading-edge applications such as semantic segmentation (pixel-level image classification) and a new technique for human pose estimation ("stick figures"). Centaur is focused on improving the hardware price/performance and software productivity for platforms to support this next wave of research applications and speed deployment into new server-class products.
Add your own comment

14 Comments on Centaur Releases In-Depth Analysis from The Linley Group for its NCORE-Equipped x86 Processor

#1
Vya Domus
better neural-network performance than even the most powerful Xeon, but without the high cost of an external GPU card
And also without the wide array of general purpose computation a GPU brings. It's not that straight forward, I wish all of these companies that make dedicated AI accelerators would stop making these dramatic comparisons.
Posted on Reply
#2
Cheeseball
Vya Domus
And also without the wide array of general purpose computation a GPU brings. It's not that straight forward, I wish all of these companies that make dedicated AI accelerators would stop making these dramatic comparisons.
The point of the Ncore is it's cost effectiveness. There is no need to use separate P100s or MI50s when this has an already-capable DLA which properly supports AVX-512 (and apparently VNNI). You would then need to couple those GPUs with a Xeon or Epyc, which would push the costs higher.
Posted on Reply
#3
btarunr
Editor & Senior Moderator
If CNS core single-thread performance ends up somewhere between Haswell and Skylake (Zen+ level), then it would be a tragedy for Centaur not to attempt a client-desktop product. Just take this 8-core CPU block, lose the DLA component, lose the inter-socket interconnect, slim the memory controller to 2-channel, slim the PCIe to 24 lanes, maybe strike a deal with GloFo for 12LP manufacturing, and get the thing out by Computex 2020. It should prove an interesting Core i3-Core i5 alternative.
Posted on Reply
#4
Apocalypsee
btarunr
If CNS core single-thread performance ends up somewhere between Haswell and Skylake (Zen+ level), then it would be a tragedy for Centaur not to attempt a client-desktop product. Just take this 8-core CPU block, lose the DLA component, lose the inter-socket interconnect, slim the memory controller to 2-channel, slim the PCIe to 24 lanes, maybe strike a deal with GloFo for 12LP manufacturing, and get the thing out by Computex 2020. It should prove an interesting Core i3-Core i5 alternative.
You take the word out of my mind! If Intel going to enter GPU market as third alternative, I would like to see third contender in x86 CPU market.
Posted on Reply
#5
sutyi
btarunr
If CNS core single-thread performance ends up somewhere between Haswell and Skylake (Zen+ level), then it would be a tragedy for Centaur not to attempt a client-desktop product. Just take this 8-core CPU block, lose the DLA component, lose the inter-socket interconnect, slim the memory controller to 2-channel, slim the PCIe to 24 lanes, maybe strike a deal with GloFo for 12LP manufacturing, and get the thing out by Computex 2020. It should prove an interesting Core i3-Core i5 alternative.
Then we would only need S3 Graphics, PowerVR Kyro dGPUs and Transmeta to rematerialize out of thin air to complete the early 2000s infinity gauntlet of IT. :D
Posted on Reply
#6
jeremyshaw
btarunr
If CNS core single-thread performance ends up somewhere between Haswell and Skylake (Zen+ level), then it would be a tragedy for Centaur not to attempt a client-desktop product. Just take this 8-core CPU block, lose the DLA component, lose the inter-socket interconnect, slim the memory controller to 2-channel, slim the PCIe to 24 lanes, maybe strike a deal with GloFo for 12LP manufacturing, and get the thing out by Computex 2020. It should prove an interesting Core i3-Core i5 alternative.
It's on TSMC 16nm, and it's an 8 core CPU that runs at 2.5GHz with no indication of turbo. Lots of talk about power efficiency without any metrics, comparisons, or numbers, so I am not expecting much there, either.

In the end, I expect this to be a very poor client-desktop product. Its niche is the integrated wide SIMD core.
Posted on Reply
#7
btarunr
Editor & Senior Moderator
jeremyshaw
In the end, I expect this to be a very poor client-desktop product. Its niche is the integrated wide SIMD core.
I'm not so sure. So far their prototype was shown handling a very specific application (image recognition across multiple video streams), which probably runs fine with this CPU configuration.

As you said there was no comment made from them on power or clock-speed headroom. With the right 10 nm class (12/14/16 FF) node, they might be able to come up with a client-segment product. If they've achieved single-thread parity with Zen+, then all they need is to sustain 3.80-4.00 GHz to torment current Core i5 chips. The only thing stopping this chip from hurting Pentium/Celeron/Core i3 is the lack of an iGPU. I doubt if VIA can pull off a contemporary iGPU today. So their embedded motherboards will have to bundle something like a GeForce MX150.
Posted on Reply
#8
MrMilli
btarunr
I doubt if VIA can pull off a contemporary iGPU today.
Considering their last Chrome core was very small on a 65nm processes and had 32 cores , I would say that just scaling that to 256 cores and DX11 would make them competitive with UHD 630.
Posted on Reply
#9
ratirt
btarunr
I'm not so sure. So far their prototype was shown handling a very specific application (image recognition across multiple video streams), which probably runs fine with this CPU configuration.

As you said there was no comment made from them on power or clock-speed headroom. With the right 10 nm class (12/14/16 FF) node, they might be able to come up with a client-segment product. If they've achieved single-thread parity with Zen+, then all they need is to sustain 3.80-4.00 GHz to torment current Core i5 chips. The only thing stopping this chip from hurting Pentium/Celeron/Core i3 is the lack of an iGPU. I doubt if VIA can pull off a contemporary iGPU today. So their embedded motherboards will have to bundle something like a GeForce MX150.
I think you are talking here about laptops? If so the iGPU is not the most concern here but power consumption and heat. They can always use GF MX150 but lowering power consumption and heat would require them to go lower on clocks and lower performance which i doubt is that impressive in comparison to i3 or i5 or ryzen 1s gen
Posted on Reply
#10
Steevo
Cheeseball
The point of the Ncore is it's cost effectiveness. There is no need to use separate P100s or MI50s when this has an already-capable DLA which properly supports AVX-512 (and apparently VNNI). You would then need to couple those GPUs with a Xeon or Epyc, which would push the costs higher.
I suppose they are going to give them away for the good of humanity, and the companies buying them will make do on their own if they need support.
Posted on Reply
#11
techguymaxc
btarunr
If CNS core single-thread performance ends up somewhere between Haswell and Skylake (Zen+ level), then it would be a tragedy for Centaur not to attempt a client-desktop product. Just take this 8-core CPU block, lose the DLA component, lose the inter-socket interconnect, slim the memory controller to 2-channel, slim the PCIe to 24 lanes, maybe strike a deal with GloFo for 12LP manufacturing, and get the thing out by Computex 2020. It should prove an interesting Core i3-Core i5 alternative.
I'm skeptical of the single-thread performance this uarch can offer. The various buffers are too small (or non-existant in the case of a uop cache), relative to the dispatch rate. I think Haswell-level performance should be the top-end of performance estimates, rather than a starting point. Coupled with a lack of SMT, there's no way the individual cores will be running at capacity in real-world workload.
Posted on Reply
#12
Cheeseball
Steevo
I suppose they are going to give them away for the good of humanity, and the companies buying them will make do on their own if they need support.
Not sure what you're getting at.

For the market this is currently aimed at, the only support needed would be basic delivery, initial implementation (API and documentation) and aftersales repair/replacement for any defects.
Posted on Reply
#13
Steevo
Cheeseball
Not sure what you're getting at.

For the market this is currently aimed at, the only support needed would be basic delivery, initial implementation (API and documentation) and aftersales repair/replacement for any defects.
My point is there is so much more to the ecosystem than just dropping in the new latest and greatest CPU, it takes much more to make an efficient and actual "cost effective" system than the simple PR spin here.
Posted on Reply
#14
Cheeseball
Steevo
My point is there is so much more to the ecosystem than just dropping in the new latest and greatest CPU, it takes much more to make an efficient and actual "cost effective" system than the simple PR spin here.
Hmm.. that really does depend on the use-case though. This sounds like it would be more cost effective when establishing (or adding) a new cluster than just adding on more accelerators.

It's The Linley Group adding the PR, not VIA/Centaur themselves. They speculate that if this is priced the same as the Xeon Silver, they would be getting the accelerator for free, in a sense.
Posted on Reply
Add your own comment