• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

Joined
Jan 5, 2006
Messages
17,793 (2.66/day)
System Name AlderLake / Laptop
Processor Intel i7 12700K P-Cores @ 5Ghz / Intel i3 7100U
Motherboard Gigabyte Z690 Aorus Master / HP 83A3 (U3E1)
Cooling Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut Extreme + 5 case fans / Fan
Memory 32GB DDR5 Corsair Dominator Platinum RGB 6000MHz CL36 / 8GB DDR4 HyperX CL13
Video Card(s) MSI RTX 2070 Super Gaming X Trio / Intel HD620
Storage Samsung 980 Pro 1TB + 970 Evo 500GB + 850 Pro 512GB + 860 Evo 1TB x2 / Samsung 256GB M.2 SSD
Display(s) 23.8" Dell S2417DG 165Hz G-Sync 1440p / 14" 1080p IPS Glossy
Case Be quiet! Silent Base 600 - Window / HP Pavilion
Audio Device(s) Panasonic SA-PMX94 / Realtek onboard + B&O speaker system / Harman Kardon Go + Play / Logitech G533
Power Supply Seasonic Focus Plus Gold 750W / Powerbrick
Mouse Logitech MX Anywhere 2 Laser wireless / Logitech M330 wireless
Keyboard RAPOO E9270P Black 5GHz wireless / HP backlit
Software Windows 11 / Windows 10
Benchmark Scores Cinebench R23 (Single Core) 1936 @ stock Cinebench R23 (Multi Core) 23006 @ stock
Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

Four hundred racks were delivered to the Japanese research institute Riken for the Fugaku, a super computer with chips based on Arm. The system gets nearly 16,000 nodes with 48-core Arm chips.

The installation of Fugaku started in early December. All parts have been delivered this week, reports Riken. The supercomputer must be ready for use in 2021. Then the system will be used for research calculations, including research projects to combat Covid-19.

The entire system will consist of 158,976 nodes with A64FX processors from Fujitsu. These are Armv8.2-A SVE based processors with 48 cores for calculations and 2 or 4 cores for OS activity. The chips run at 2GHz, with a boost to 2.2GHz and are combined with 32GB hbm2 memory. The entire system should provide half an exaflops of 64bits double precision floatping point performance.

Capture.PNG

Fugaku is the nickname of Mount Fuji in Japan.


 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
The cool part about this supercomputer is that this is the first major implementation of the ARM-SVE (Scalable Vector Extension) instruction set. SIMD-compute in supercomputers is pretty commonplace these days, but every implementation brings forth new ideas and optimizations.

I hope to hear good things from those who work with the ARM-SVE instruction set. Its allegedly easier to auto-vectorize compared to AVX2 or AVX512, and the SVE instruction set is unique in that the ISA itself is independent of the SIMD-width. So future ARM-SVE implementations may increase (or decrease) the vector width without any need for recompiling. In contrast, AMD has a big 64-wide to 32-wide change from GCN Vega -> RDNA. And Nvidia is similarly 32-wide at the ISA level.

I'm kind of betting that the static 32-wide ISAs of GPUs will win over in the long run, surely 32-wide (1024-bits wide) is wide enough, and there are multiple instructions (permute / bpermute) which are innately tied to the width of the SIMD-processor. Nonetheless, seeing the SIMD-width independent SVE instruction set is cool.
 
Joined
Jan 31, 2011
Messages
238 (0.05/day)
Processor 3700X
Motherboard X570 TUF Plus
Cooling U12
Memory 32GB 3600MHz
Video Card(s) eVGA GTX970
Storage 512GB 970 Pro
Case CM 500L vertical
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.


1590257875182.png



Does that answer your question?
 
Joined
Jan 8, 2017
Messages
8,929 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

It's somewhat similar to how GPUs work, things are broken up into wavefronts that are 128bit wide, so it doesn't matter what is actually available in hardware.
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
things are broken up into wavefronts that are 128bit wide

Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.
 
Joined
Jan 8, 2017
Messages
8,929 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.

No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.
 

FreedomEclipse

~Technological Technocrat~
Joined
Apr 20, 2007
Messages
23,363 (3.76/day)
Location
London,UK
System Name Codename: Icarus Mk.VI
Processor Intel 8600k@Stock -- pending tuning
Motherboard Asus ROG Strixx Z370-F
Cooling CPU: BeQuiet! Dark Rock Pro 4 {1xCorsair ML120 Pro|5xML140 Pro}
Memory 32GB XPG Gammix D10 {2x16GB}
Video Card(s) ASUS Dual Radeon™ RX 6700 XT OC Edition
Storage Samsung 970 Evo 512GB SSD (Boot)|WD SN770 (Gaming)|2x 3TB Toshiba DT01ACA300|2x 2TB Crucial BX500
Display(s) LG GP850-B
Case Corsair 760T (White)
Audio Device(s) Yamaha RX-V573|Speakers: JBL Control One|Auna 300-CN|Wharfedale Diamond SW150
Power Supply Corsair AX760
Mouse Logitech G900
Keyboard Duckyshine Dead LED(s) III
Software Windows 10 Pro
Benchmark Scores (ノಠ益ಠ)ノ彡┻━┻
Word on the street is, it costs more than an Arm and a leg.
 
Joined
Aug 20, 2007
Messages
20,773 (3.41/day)
System Name Pioneer
Processor Ryzen R9 7950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage 2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64
Word on the street is, it costs more than an Arm and a leg.

Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.

Leaving SIMT vs SIMD for a moment (which I consider to be a marketing term invented by NVidia), I'm pretty confused with your statements.

SVE could be implemented 128-bit wide. But the one for the Riken Supercomputer is well-documented to be 512-bits wide (similar performance to Intel AVX512 Skylake implementations). The ARM-SVE ISA is designed to scale to at least 2048-bit wide, so the code written for Riken Supercomputer would be theoretically able to run full-speed (that is 4x faster) on a hypothetical, future 2048-bit SVE implementation, without the need of recompiling.

As far as I know, the A64FX is the only chip in the world that currently implements ARM-SVE.

----------

Because ARM-SVE is variable-length, the concept of a "wavefront" doesn't really apply to it. At least, not in any traditional sense.
 
Joined
Sep 17, 2014
Messages
20,917 (5.97/day)
Location
The Washing Machine
Processor i7 8700k 4.6Ghz @ 1.24V
Motherboard AsRock Fatal1ty K6 Z370
Cooling beQuiet! Dark Rock Pro 3
Memory 16GB Corsair Vengeance LPX 3200/C16
Video Card(s) ASRock RX7900XT Phantom Gaming
Storage Samsung 850 EVO 1TB + Samsung 830 256GB + Crucial BX100 250GB + Toshiba 1TB HDD
Display(s) Gigabyte G34QWC (3440x1440)
Case Fractal Design Define R5
Audio Device(s) Harman Kardon AVR137 + 2.1
Power Supply EVGA Supernova G2 750W
Mouse XTRFY M42
Keyboard Lenovo Thinkpad Trackpoint II
Software W10 x64
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

At least the thing's got legs now. And it will run headless...


:fear:
 
Joined
Aug 12, 2019
Messages
1,719 (1.00/day)
Location
LV-426
System Name Custom
Processor i9 9900k
Motherboard Gigabyte Z390 arous master
Cooling corsair h150i
Memory 4x8 3200mhz corsair
Video Card(s) Galax RTX 3090 EX Gamer White OC
Storage 500gb Samsung 970 Evo PLus
Display(s) MSi MAG341CQ
Case Lian Li Pc-011 Dynamic
Audio Device(s) Arctis Pro Wireless
Power Supply 850w Seasonic Focus Platinum
Mouse Logitech G403
Keyboard Logitech G110
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

depends on the quality of the arm :p
 
Top