• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

Joined
Jan 5, 2006
Messages
18,584 (2.61/day)
System Name AlderLake
Processor Intel i7 12700K P-Cores @ 5Ghz
Motherboard Gigabyte Z690 Aorus Master
Cooling Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut Extreme + 5 case fans
Memory 32GB DDR5 Corsair Dominator Platinum RGB 6000MT/s CL36
Video Card(s) MSI RTX 2070 Super Gaming X Trio
Storage Samsung 980 Pro 1TB + 970 Evo 500GB + 850 Pro 512GB + 860 Evo 1TB x2
Display(s) 23.8" Dell S2417DG 165Hz G-Sync 1440p
Case Be quiet! Silent Base 600 - Window
Audio Device(s) Panasonic SA-PMX94 / Realtek onboard + B&O speaker system / Harman Kardon Go + Play / Logitech G533
Power Supply Seasonic Focus Plus Gold 750W
Mouse Logitech MX Anywhere 2 Laser wireless
Keyboard RAPOO E9270P Black 5GHz wireless
Software Windows 11
Benchmark Scores Cinebench R23 (Single Core) 1936 @ stock Cinebench R23 (Multi Core) 23006 @ stock
Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

Four hundred racks were delivered to the Japanese research institute Riken for the Fugaku, a super computer with chips based on Arm. The system gets nearly 16,000 nodes with 48-core Arm chips.

The installation of Fugaku started in early December. All parts have been delivered this week, reports Riken. The supercomputer must be ready for use in 2021. Then the system will be used for research calculations, including research projects to combat Covid-19.

The entire system will consist of 158,976 nodes with A64FX processors from Fujitsu. These are Armv8.2-A SVE based processors with 48 cores for calculations and 2 or 4 cores for OS activity. The chips run at 2GHz, with a boost to 2.2GHz and are combined with 32GB hbm2 memory. The entire system should provide half an exaflops of 64bits double precision floatping point performance.

Capture.PNG

Fugaku is the nickname of Mount Fuji in Japan.


 
The cool part about this supercomputer is that this is the first major implementation of the ARM-SVE (Scalable Vector Extension) instruction set. SIMD-compute in supercomputers is pretty commonplace these days, but every implementation brings forth new ideas and optimizations.

I hope to hear good things from those who work with the ARM-SVE instruction set. Its allegedly easier to auto-vectorize compared to AVX2 or AVX512, and the SVE instruction set is unique in that the ISA itself is independent of the SIMD-width. So future ARM-SVE implementations may increase (or decrease) the vector width without any need for recompiling. In contrast, AMD has a big 64-wide to 32-wide change from GCN Vega -> RDNA. And Nvidia is similarly 32-wide at the ISA level.

I'm kind of betting that the static 32-wide ISAs of GPUs will win over in the long run, surely 32-wide (1024-bits wide) is wide enough, and there are multiple instructions (permute / bpermute) which are innately tied to the width of the SIMD-processor. Nonetheless, seeing the SIMD-width independent SVE instruction set is cool.
 
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.
 
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.


1590257875182.png



Does that answer your question?
 
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

It's somewhat similar to how GPUs work, things are broken up into wavefronts that are 128bit wide, so it doesn't matter what is actually available in hardware.
 
things are broken up into wavefronts that are 128bit wide

Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.
 
Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.

No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.
 
Word on the street is, it costs more than an Arm and a leg.
 
Word on the street is, it costs more than an Arm and a leg.

Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.
 
No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.

Leaving SIMT vs SIMD for a moment (which I consider to be a marketing term invented by NVidia), I'm pretty confused with your statements.

SVE could be implemented 128-bit wide. But the one for the Riken Supercomputer is well-documented to be 512-bits wide (similar performance to Intel AVX512 Skylake implementations). The ARM-SVE ISA is designed to scale to at least 2048-bit wide, so the code written for Riken Supercomputer would be theoretically able to run full-speed (that is 4x faster) on a hypothetical, future 2048-bit SVE implementation, without the need of recompiling.

As far as I know, the A64FX is the only chip in the world that currently implements ARM-SVE.

----------

Because ARM-SVE is variable-length, the concept of a "wavefront" doesn't really apply to it. At least, not in any traditional sense.
 
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

At least the thing's got legs now. And it will run headless...


:fear:
 
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

depends on the quality of the arm :P
 
Back
Top