Wednesday, March 4th 2020

AMD Scores Another EPYC Win in Exascale Computing With DOE's "El Capitan" Two-Exaflop Supercomputer

AMD has been on a roll in both consumer, professional, and exascale computing environments, and it has just snagged itself another hugely important contract. The US Department of Energy (DOE) has just announced the winners for their next-gen, exascale supercomputer that aims to be the world's fastest. Dubbed "El Capitan", the new supercomputer will be powered by AMD's next-gen EPYC Genoa processors (Zen 4 architecture) and Radeon GPUs. This is the first such exascale contract where AMD is the sole purveyor of both CPUs and GPUs, with AMD's other design win with EPYC in the Cray Shasta being paired with NVIDIA graphics cards.

El Capitan will be a $600 million investment to be deployed in late 2022 and operational in 2023. Undoubtedly, next-gen proposals from AMD, Intel and NVIDIA were presented, with AMD winning the shootout in a big way. While initially the DOE projected El Capitan to provide some 1.5 exaflops of computing power, it has now revised their performance goals to a pure 2 exaflop machine. El Capitan willl thus be ten times faster than the current leader of the supercomputing world, Summit.
AMD's ability to provide an ecosystem with both CPUs and GPUs very likely played a key part in the DOE's choice for the project, and this all but guarantees that the contractor was left very satisfied with AMD's performance projections for both their Zen 4 and future GPU architectures. AMD's EPYC Genoa will feature support next-gen memory, implying DDR5 or later, and also feature unspecified next-gen I/O connections. AMD's graphics cards aren't detailed at all - they're just referred to as being part of the Radeon instinct lineup featuring a "new compute architecture".

Another wholly important part of this design win has to be that AMD has redesigned their 3rd Gen Infinity Fabric (which supports a 4:1 ratio of GPUs to CPUs) to provide data coherence between CPU and GPU - thus effectively reducing the need for data to move back and forth between the CPU and GPU as it is being processed. With relevant data being mirrored across both pieces of hardware through their coherent, Infinity Fabric-powered memory, computing efficiency can be significantly improved (since data transition usually requires more power expenditure than the actual computing calculations themselves), and that too must've played a key part in the selection.
El Capitan will also feature a future version of CRAY's proprietary Slingshot network fabric for increased speed and reduced latencies. All of this will be tied together with AMD's ROCm open software platform for heterogeneous programming to maximize performance of the CPUs and GPUs in OpenMP environments. ROCm has recently gotten a pretty healthy, $100 million shot in the arm also courtesy of the DOE, having deployed a Center of Excellence at the Lawrence Livermore National Lab (part of the DOE) to help develop ROCm. So this means AMD's software arm too is flexing its muscles - for this kind of deployment, at least - which has always been a contention point against rival NVIDIA, who has typically shown to invest much more in its software implementations than AMD - and hence the reason NVIDIA has been such a big player in the enterprise and computing segments until now.
As for why NVIDIA was shunned, it likely has nothing to do with their next-gen designs offering lesser performance than what AMD brought to the table. If anything, I'd take an educated guess in that the 3rd gen Infinity Fabric and its memory coherence was the deciding factor in choosing AMD GPUs over NVIDIA's, because the green company doesn't have anything like that to offer - it doesn't play in the x64 computing space, and can't offer that level of platform interconnectedness. Whatever the reason, this is yet another big win for AMD, who keeps muscling Intel out of very, very lucrative positions.
Source: Tom's Hardware
Add your own comment

35 Comments on AMD Scores Another EPYC Win in Exascale Computing With DOE's "El Capitan" Two-Exaflop Supercomputer

#26
Dammeron
They'll have so many of those new EPYCs, surely they won't notice one is missing, right? Cause I need it... ;)
Posted on Reply
#27
R0H1T
DammeronThey'll have so many of those new EPYCs, surely they won't notice one is missing, right? Cause I need it... ;)
You'll need to shift jobs to pull that off, even if temporarily :pimp:
Posted on Reply
#28
Vya Domus
Mark LittleCUDA is more for companies like mine where we have 10 people and make biomedical imaging devices. CUDA helps us speed up the image reconstruction on the GPU versus the CPU. We are too small to make our own APIs. Giant supercomputer projects have custom tailor made software.
Completely agree, highly specialized software for these large scale computations are probably optimized down to the lowest available level like PTX for Nvidia and assembly for AMD. Truth is not a whole lot of the critical software paths there are actually going to be written in CUDA or OpenCL.
CheeseballWhy do you keep saying CUDA is in a locked-in eco-system? You can run CUDA code on other hardware (even on x86 and ARM, if you're desperate) using HIP through ROCm, but you need to translate (not manual conversion) to avoid any NVIDIA extensions. This is currently a lot more efficient than what can be done in OpenCL 2.1.
CUDA really is a locked ecosystem even for customers of Nvidia hardware. For example their ISA isn't open to the public and there are instances where no matter what you write in CUDA or directly in PTX it will never be as fast as the hardware is capable of. Nvidia reserves the highest level of optimizations for themselves so in order to get the most out of the hardware you purchased you either have to use a library that was hand optimized by Nvidia or if there is none for of the sort of thing you need to do then tough luck. If that's not a locked ecosystem then I don't know what is.
Posted on Reply
#29
TheoneandonlyMrK
dicktracyThey picked the cheapest but good enuff and not the absolute highest performing options.
Like they could use xeons on a DOE super computer , near double the power use on a DOE system would go down well.
Now Fujitsu's 64FX chip's seems like a contender but not intel.
As for the GGPu choice perhaps they see something in the next generation of chips that we have not yet seen, they are not comparing chips that are out are they no it's chips to be made yet.
Posted on Reply
#30
bonehead123
Great, now we can figure out how to obliterate everyone on the planet even faster & moar better than before...get ready, 'cause the end times are now upon us !
Beertintedgoggles$600 million is fairly massive
Not by government spending standards, seeins how they're spending OUR money not theirs :(
Posted on Reply
#31
gamefoo21
CheeseballYou're right about that. Corporations create these supercomputers with a major goal in mind, so they would need custom APIs to get to that goal efficiently. But what @xkm1948 is getting at is that CUDA can scale from the basic enthusiast all the way to the [big] corporations that don't have the time (or need) to have a custom API developed for them.

If anything, those same corporations would employ researchers from these universities. :laugh:



Why do you keep saying CUDA is in a locked-in eco-system? You can run CUDA code on other hardware (even on x86 and ARM, if you're desperate) using HIP through ROCm, but you need to translate (not manual conversion) to avoid any NVIDIA extensions. This is currently a lot more efficient than what can be done in OpenCL 2.1.

The investment in ROCm is an advantage for everyone since all compute APIs will use this. Thank AMD for pulling this off.



They still use Apple because of deals (think 60%+ hardware and support discounts) offered by Apple. Also hardware deployment of Mac minis and Pros depends on department use cases.

Vulkan is aimed at rendering (and why any GPGPU code using Vulkan is on the graphics pipeline), which is why it succeeds OpenGL. OpenCL is meant for GPGPU use.
Oh I know Apple gives universities crazy prices. It's a great way to keep up demand once students become workers.

I'm so deep in studying Latin and writing papers on Greek and Roman epics my brain is melting, I really should focus and my posts are suffering because of that.

It's amazing how this stuff can get so muddied when you are trying to ram different stuff into it.
Posted on Reply
#32
R0H1T
bonehead123cause the end times are now upon us !
I though that was 2012 or 1999/whatever date Nostradamus came up with?
Posted on Reply
#35
IceShroom
bonehead123Great, now we can figure out how to obliterate everyone on the planet even faster & moar better than before...get ready, 'cause the end times are now upon us !
If the new supercomputer was built with 5GHz Xeon and GTX 480, then the Govt. could have obliterate us just by truning the computer 'On'.
Posted on Reply
Add your own comment
Apr 26th, 2024 11:40 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts