• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel & HPE Declare Aurora Supercomputer Blade Installation Complete

T0@st

News Editor
Joined
Mar 7, 2023
Messages
3,086 (3.90/day)
Location
South East, UK
System Name The TPU Typewriter
Processor AMD Ryzen 5 5600 (non-X)
Motherboard GIGABYTE B550M DS3H Micro ATX
Cooling DeepCool AS500
Memory Kingston Fury Renegade RGB 32 GB (2 x 16 GB) DDR4-3600 CL16
Video Card(s) PowerColor Radeon RX 7800 XT 16 GB Hellhound OC
Storage Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME SSD
Display(s) Lenovo Legion Y27q-20 27" QHD IPS monitor
Case GameMax Spark M-ATX (re-badged Jonsbo D30)
Audio Device(s) FiiO K7 Desktop DAC/Amp + Philips Fidelio X3 headphones, or ARTTI T10 Planar IEMs
Power Supply ADATA XPG CORE Reactor 650 W 80+ Gold ATX
Mouse Roccat Kone Pro Air
Keyboard Cooler Master MasterKeys Pro L
Software Windows 10 64-bit Home Edition
What's New: The Aurora supercomputer at Argonne National Laboratory is now fully equipped with all 10,624 compute blades, boasting 63,744 Intel Data Center GPU Max Series and 21,248 Intel Xeon CPU Max Series processors. "Aurora is the first deployment of Intel's Max Series GPU, the biggest Xeon Max CPU-based system, and the largest GPU cluster in the world. We're proud to be part of this historic system and excited for the groundbreaking AI, science and engineering Aurora will enable."—Jeff McVeigh, Intel corporate vice president and general manager of the Super Compute Group

What Aurora Is: A collaboration of Intel, Hewlett Packard Enterprise (HPE) and the Department of Energy (DOE), the Aurora supercomputer is designed to unlock the potential of the three pillars of high performance computing (HPC): simulations, data analytics and artificial intelligence (AI) on an extremely large scale. The system incorporates more than 1,024 storage nodes (using DAOS, Intel's distributed asynchronous object storage), providing 220 terabytes (TB) of capacity at 31TBs of total bandwidth, and leverages the HPE Slingshot high-performance fabric. Later this year, Aurora is expected to be the world's first supercomputer to achieve a theoretical peak performance of more than 2 exaflops (an exaflop is 1018 or a billion billion operations per second) when it enters the TOP 500 list.




Aurora will harness the full power of the Intel Max Series GPU and CPU product family. Designed to meet the demands of dynamic and emerging HPC and AI workloads, early results with the Max Series GPUs demonstrate leading performance on real-world science and engineering workloads, showcasing up to 2 times the performance of AMD MI250X GPUs on OpenMC, and near linear scaling up to hundreds of nodes. The Intel Xeon Max Series CPU drives a 40% performance advantage over the competition in many real-world HPC workloads, such as earth systems modeling, energy and manufacturing.

Why It Matters: From tackling climate change to finding cures for deadly diseases, researchers face monumental challenges that demand advanced computing technologies at scale. Aurora is poised to address the needs of the HPC and AI communities, providing the necessary tools to push the boundaries of scientific exploration. "While we work toward acceptance testing, we're going to be using Aurora to train some large-scale open source generative AI models for science," said Rick Stevens, Argonne National Laboratory associate laboratory director. "Aurora, with over 60,000 Intel Max GPUs, a very fast I/O system, and an all-solid-state mass storage system, is the perfect environment to train these models."

How It Works: At the heart of this state-of-the-art system are Aurora's sleek rectangular blades, housing processors, memory, networking and cooling technologies. Each blade consists of two Intel Xeon Max Series CPUs and six Intel Max Series GPUs. The Xeon Max Series product family is already demonstrating great early performance on Sunspot (watch the video below), the test bed and development system with the same architecture as Aurora. Developers are utilizing oneAPI and AI tools to accelerate HPC and AI workloads and enhance code portability across multiple architectures.


The installation of these blades has been a delicate operation, with each 70-pound blade requiring specialized machinery to be vertically integrated into Aurora's refrigerator-sized racks. The system's 166 racks accommodate 64 blades each and span eight rows, occupying a space equivalent to two professional basketball courts in the Argonne Leadership Computing Facility (ALCF) data center.

Researchers from the ALCF's Aurora Early Science Program (ESP) and DOE's Exascale Computing Project will migrate their work from the Sunspot test bed to the fully installed Aurora. This transition will allow them to scale their applications on the full system. Early users will stress test the supercomputer and identify potential bugs that need to be resolved before deployment. This includes efforts to develop generative AI models for science, recently announced at the ISC'23 conference.

View at TechPowerUp Main Site | Source
 
hope it will not catch fire because of hte heat LUL!
 
I'm willing to bet that Intel either sold the hardware at cost or even cheaper....can you think of ANY other reason why someone would go with an all Intel Supercomputer? I'm seriously asking...
 
I'm willing to bet that Intel either sold the hardware at cost or even cheaper....can you think of ANY other reason why someone would go with an all Intel Supercomputer? I'm seriously asking...

This was a stipulation set by the Department of Energy that the multiple supercomputers could not all be from the same vendor. This is also just the delivery of the computer cabinets itself and not the actual acceptance testing.
 
A bunch of neatly arranged boxes with neatly arranged piping ... that's fine, but it doesn't look all that impressive. Now show us the cooling system, Intel! With a few humans for scale.
 
I'm willing to bet that Intel either sold the hardware at cost or even cheaper....can you think of ANY other reason why someone would go with an all Intel Supercomputer? I'm seriously asking...
The last time they changed the spec Intel took a writeoff that quarter of 300M.
So yes, probably not making money on it.

Intel's ~$300M one-time charge in Q4 is almost certainly a penalty for shoddy execution on Aurora
by u/Long_on_AMD in AMD_Stock
Congratulations! 2 Exaflops! It just took ten years.

It technically hasn't been benchmarked yet.
And El-Capitan isn't finished being deployed yet.
 
not impressive :D 220PB or 220Tb per storage node maybe ?

the compute side and storage side are different. The storage side will grow and expand as research requirements needs it the compute side (and it’s configuration) are the big spend


Because here nobody knows true numbers of BOM.
For this? No. Probably not. There are plenty of real engineers on the forums though that deal with kind of thing everyday. You have to speak to your audience though. Higher compute or tech in general is easier to make a troll comment on than actually discuss. It’s hardly worth the effort since most users want higher Fortnite frame rates instead of actually learning.
 
Last edited:
the compute side and storage side are different. The storage side will grow and expand as research requirements needs it the compute side (and it’s configuration) are the big spend
https://www.alcf.anl.gov/aurora : storage specs "230 PB, 31 TB/s, 1024 Nodes (DAOS)"
It could not be 220TB only as the article says (or the way I read and understand the article sentance).
 
the compute side and storage side are different. The storage side will grow and expand as research requirements needs it
That's hot, fast, write-intensive storage (according to some older presentation, it also contains some Optane). It's physically close to compute nodes, that's why it's decentralised into 1024 nodes. It's probably not destined to grow but can be complemented by colder, larger(?), less exciting and expandable storage, possibly spinning rust.
 
That's hot, fast, write-intensive storage (according to some older presentation, it also contains some Optane). It's physically close to compute nodes, that's why it's decentralised into 1024 nodes. It's probably not destined to grow but can be complemented by colder, larger(?), less exciting and expandable storage, possibly spinning rust.

Most of the time this is infiniband to nvme then bleeds off to a larger array of SSD cached spinning rust.
 
https://www.alcf.anl.gov/aurora : storage specs "230 PB, 31 TB/s, 1024 Nodes (DAOS)"
It could not be 220TB only as the article says (or the way I read and understand the article sentance).
Well, someone at Intel didn't properly understand what they are selling. The 220 TB figure can be found at multiple web sites that didn't care to check Intel's press release, along with the "TBs" unit.

Also, total storage capacity divided by total speed amounts to two hours. If the capacity is fully used for input data and/or output data, the system spends at least two hours of precious supercomputer time transfering data to storage before processing, or from storage after processing, or both.
 
Back
Top