• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

Joined
Jan 5, 2006
Messages
10,402 (1.97/day)
System Name Desktop / Laptop
Processor Intel i7 6700K @ 4.5GHz (1.270 V) / Intel i3 7100U
Motherboard Asus Z170 Pro Gaming / HP 83A3 (U3E1)
Cooling Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut + 5 case fans / Fan
Memory 16GB DDR4 Corsair Vengeance LPX 3000MHz CL15 / 8GB DDR4 HyperX CL13
Video Card(s) MSI RTX 2070 Super Gaming X Trio / Intel HD620
Storage Samsung 970 Evo 500GB + Samsung 850 Pro 512GB + Samsung 860 Evo 1TB / Samsung 256GB M.2 SSD
Display(s) 23.8" Dell S2417DG 165Hz G-Sync 1440p + 21.5" LG 22MP67VQ IPS 60Hz 1080p / 14" 1080p IPS Glossy
Case Be quiet! Silent Base 600 - Window / HP Pavilion
Audio Device(s) SupremeFX Onboard / Realtek onboard + B&O speaker system
Power Supply Seasonic Focus Plus Gold 750W / Powerbrick
Mouse Logitech MX Anywhere 2 Laser wireless / Logitech M330 wireless
Keyboard RAPOO E9270P Black 5GHz wireless / HP backlit
Software Windows 10 / Windows 10
Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

Four hundred racks were delivered to the Japanese research institute Riken for the Fugaku, a super computer with chips based on Arm. The system gets nearly 16,000 nodes with 48-core Arm chips.

The installation of Fugaku started in early December. All parts have been delivered this week, reports Riken. The supercomputer must be ready for use in 2021. Then the system will be used for research calculations, including research projects to combat Covid-19.

The entire system will consist of 158,976 nodes with A64FX processors from Fujitsu. These are Armv8.2-A SVE based processors with 48 cores for calculations and 2 or 4 cores for OS activity. The chips run at 2GHz, with a boost to 2.2GHz and are combined with 32GB hbm2 memory. The entire system should provide half an exaflops of 64bits double precision floatping point performance.

Capture.PNG

Fugaku is the nickname of Mount Fuji in Japan.


 
Joined
Apr 24, 2020
Messages
79 (1.80/day)
The cool part about this supercomputer is that this is the first major implementation of the ARM-SVE (Scalable Vector Extension) instruction set. SIMD-compute in supercomputers is pretty commonplace these days, but every implementation brings forth new ideas and optimizations.

I hope to hear good things from those who work with the ARM-SVE instruction set. Its allegedly easier to auto-vectorize compared to AVX2 or AVX512, and the SVE instruction set is unique in that the ISA itself is independent of the SIMD-width. So future ARM-SVE implementations may increase (or decrease) the vector width without any need for recompiling. In contrast, AMD has a big 64-wide to 32-wide change from GCN Vega -> RDNA. And Nvidia is similarly 32-wide at the ISA level.

I'm kind of betting that the static 32-wide ISAs of GPUs will win over in the long run, surely 32-wide (1024-bits wide) is wide enough, and there are multiple instructions (permute / bpermute) which are innately tied to the width of the SIMD-processor. Nonetheless, seeing the SIMD-width independent SVE instruction set is cool.
 
Joined
Jan 31, 2011
Messages
100 (0.03/day)
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.
 
Joined
Apr 24, 2020
Messages
79 (1.80/day)
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

1590257875182.png



Does that answer your question?
 
Joined
Jan 8, 2017
Messages
5,036 (4.04/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.
It's somewhat similar to how GPUs work, things are broken up into wavefronts that are 128bit wide, so it doesn't matter what is actually available in hardware.
 
Joined
Apr 24, 2020
Messages
79 (1.80/day)
things are broken up into wavefronts that are 128bit wide
Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.
 
Joined
Jan 8, 2017
Messages
5,036 (4.04/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.
No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.
 

FreedomEclipse

~Technological Technocrat~
Joined
Apr 20, 2007
Messages
20,589 (4.29/day)
Location
London,UK
System Name Codename: Icarus Mk.V
Processor Intel 8600k@4.8Ghz
Motherboard Asus ROG Strixx Z370-F
Cooling CPU: BeQuiet! Dark Rock Pro 4 {1xCorsair ML120 Pro|5xML140 Pro}
Memory 16 Corsair Vengeance White LED DDR4 3200Mhz {2x8GB}
Video Card(s) Gigabyte 1080Ti Gaming OC|Accelero Xtreme IV
Storage Samsung 970Evo 512GB SSD (Boot)|WD Blue 1TB SSD|2x 3TB Toshiba DT01ACA300
Display(s) Asus PB278Q 27"
Case Corsair 760T (White)
Audio Device(s) Yamaha RX-V573|Speakers: JBL Control One|Auna 300-CN|Wharfedale Diamond SW150
Power Supply Corsair AX760
Mouse Logitech G900/G502
Keyboard Duckyshine Dead LED(s) III
Software Windows 10 Pro
Benchmark Scores (ノಠ益ಠ)ノ彡┻━┻
Word on the street is, it costs more than an Arm and a leg.
 
Joined
Aug 20, 2007
Messages
12,993 (2.78/day)
System Name Pioneer
Processor Intel i9 9900k
Motherboard ASRock Z390 Taichi
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory G.SKILL TridentZ Series 32GB (4 x 8GB) DDR4-3200 @ 14-14-14-34-2T
Video Card(s) AMD RX 5700 XT (XFX THICC Ultra III)
Storage Mushkin Pilot-E 2TB NVMe SSD w/ EKWB M.2 Heatsink
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) VGA HDMI->Panasonic SC-HTB20/Schiit Modi MB/Asgard 2 DAC/Amp to AKG Pro K7712 Headphones
Power Supply SeaSonic Prime 750W 80Plus Titanium
Mouse ROCCAT Kone EMP
Keyboard WASD CODE 104-Key w/ Cherry MX Green Keyswitches, Doubleshot Vortex PBT White Transluscent Keycaps
Software Windows 10 Enterprise (yes, it's legit.)
Word on the street is, it costs more than an Arm and a leg.
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.
 
Joined
Apr 24, 2020
Messages
79 (1.80/day)
No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.
Leaving SIMT vs SIMD for a moment (which I consider to be a marketing term invented by NVidia), I'm pretty confused with your statements.

SVE could be implemented 128-bit wide. But the one for the Riken Supercomputer is well-documented to be 512-bits wide (similar performance to Intel AVX512 Skylake implementations). The ARM-SVE ISA is designed to scale to at least 2048-bit wide, so the code written for Riken Supercomputer would be theoretically able to run full-speed (that is 4x faster) on a hypothetical, future 2048-bit SVE implementation, without the need of recompiling.

As far as I know, the A64FX is the only chip in the world that currently implements ARM-SVE.

----------

Because ARM-SVE is variable-length, the concept of a "wavefront" doesn't really apply to it. At least, not in any traditional sense.
 
Top