Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

P4-630 · May 15, 2020

Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

Four hundred racks were delivered to the Japanese research institute Riken for the Fugaku, a super computer with chips based on Arm. The system gets nearly 16,000 nodes with 48-core Arm chips.

The installation of Fugaku started in early December. All parts have been delivered this week, reports Riken. The supercomputer must be ready for use in 2021. Then the system will be used for research calculations, including research projects to combat Covid-19.

The entire system will consist of 158,976 nodes with A64FX processors from Fujitsu. These are Armv8.2-A SVE based processors with 48 cores for calculations and 2 or 4 cores for OS activity. The chips run at 2GHz, with a boost to 2.2GHz and are combined with 32GB hbm2 memory. The entire system should provide half an exaflops of 64bits double precision floatping point performance.

Fugaku is the nickname of Mount Fuji in Japan.

Japanse Riken krijgt supercomputer met bijna 160.000 nodes met 48 Arm-cores

In het Japanse onderzoeksinstituut Riken zijn vierhonderd racks afgeleverd voor de Fugaku, een supercomputer met chips op basis van Arm. Het systeem krijgt bijna 160.000 nodes met Arm-chips met 48 cores.

tweakers.net

dragontamer5788 · May 15, 2020

The cool part about this supercomputer is that this is the first major implementation of the ARM-SVE (Scalable Vector Extension) instruction set. SIMD-compute in supercomputers is pretty commonplace these days, but every implementation brings forth new ideas and optimizations.

I hope to hear good things from those who work with the ARM-SVE instruction set. Its allegedly easier to auto-vectorize compared to AVX2 or AVX512, and the SVE instruction set is unique in that the ISA itself is independent of the SIMD-width. So future ARM-SVE implementations may increase (or decrease) the vector width without any need for recompiling. In contrast, AMD has a big 64-wide to 32-wide change from GCN Vega -> RDNA. And Nvidia is similarly 32-wide at the ISA level.

I'm kind of betting that the static 32-wide ISAs of GPUs will win over in the long run, surely 32-wide (1024-bits wide) is wide enough, and there are multiple instructions (permute / bpermute) which are innately tied to the width of the SIMD-processor. Nonetheless, seeing the SIMD-width independent SVE instruction set is cool.

jeremyshaw · May 23, 2020

@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

dragontamer5788 · May 23, 2020

jeremyshaw said:
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

http://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.131-ARMv8-vector-Stephens-Yoshida-ARM-v8-23_51-v11.pdf

Does that answer your question?

Vya Domus · May 23, 2020

jeremyshaw said:
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

It's somewhat similar to how GPUs work, things are broken up into wavefronts that are 128bit wide, so it doesn't matter what is actually available in hardware.

dragontamer5788 · May 23, 2020

Vya Domus said:
things are broken up into wavefronts that are 128bit wide

Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.

Vya Domus · May 23, 2020

dragontamer5788 said:
Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.

No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.

FreedomEclipse · May 23, 2020

Word on the street is, it costs more than an Arm and a leg.

R-T-B · May 23, 2020

FreedomEclipse said:
Word on the street is, it costs more than an Arm and a leg.

Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

dragontamer5788 · May 23, 2020

Vya Domus said:
No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.

Leaving SIMT vs SIMD for a moment (which I consider to be a marketing term invented by NVidia), I'm pretty confused with your statements.

SVE could be implemented 128-bit wide. But the one for the Riken Supercomputer is well-documented to be 512-bits wide (similar performance to Intel AVX512 Skylake implementations). The ARM-SVE ISA is designed to scale to at least 2048-bit wide, so the code written for Riken Supercomputer would be theoretically able to run full-speed (that is 4x faster) on a hypothetical, future 2048-bit SVE implementation, without the need of recompiling.

As far as I know, the A64FX is the only chip in the world that currently implements ARM-SVE.

----------

Because ARM-SVE is variable-length, the concept of a "wavefront" doesn't really apply to it. At least, not in any traditional sense.

dragontamer5788 · Jun 23, 2020

Reddit was sharing this Github of the A64FX's official documentation: https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.1.pdf

As some may know: this supercomputer is now the #1 supercomputer in the world, beating Summit in the Linpack benchmark. So congrats to the team for a job well done!

Vayra86 · Jun 23, 2020

R-T-B said:
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

At least the thing's got legs now. And it will run headless...

:fear:

Hyderz · Jun 23, 2020

R-T-B said:
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

depends on the quality of the arm

System Name	AlderLake / Laptop
Processor	Intel i7 12700K P-Cores @ 5Ghz / Intel i3 7100U
Motherboard	Gigabyte Z690 Aorus Master / HP 83A3 (U3E1)
Cooling	Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut Extreme + 5 case fans / Fan
Memory	32GB DDR5 Corsair Dominator Platinum RGB 6000MHz CL36 / 8GB DDR4 HyperX CL13
Video Card(s)	MSI RTX 2070 Super Gaming X Trio / Intel HD620
Storage	Samsung 980 Pro 1TB + 970 Evo 500GB + 850 Pro 512GB + 860 Evo 1TB x2 / Samsung 256GB M.2 SSD
Display(s)	23.8" Dell S2417DG 165Hz G-Sync 1440p / 14" 1080p IPS Glossy
Case	Be quiet! Silent Base 600 - Window / HP Pavilion
Audio Device(s)	Panasonic SA-PMX94 / Realtek onboard + B&O speaker system / Harman Kardon Go + Play / Logitech G533
Power Supply	Seasonic Focus Plus Gold 750W / Powerbrick
Mouse	Logitech MX Anywhere 2 Laser wireless / Logitech M330 wireless
Keyboard	RAPOO E9270P Black 5GHz wireless / HP backlit
Software	Windows 11 / Windows 10
Benchmark Scores	Cinebench R23 (Single Core) 1936 @ stock Cinebench R23 (Multi Core) 23006 @ stock

Processor	3700X
Motherboard	X570 TUF Plus
Cooling	U12
Memory	32GB 3600MHz
Video Card(s)	eVGA GTX970
Storage	512GB 970 Pro
Case	CM 500L vertical

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	Codename: Icarus Mk.VI
Processor	Intel 8600k@Stock -- pending tuning
Motherboard	Asus ROG Strixx Z370-F
Cooling	CPU: BeQuiet! Dark Rock Pro 4 {1xCorsair ML120 Pro\|5xML140 Pro}
Memory	32GB XPG Gammix D10 {2x16GB}
Video Card(s)	ASUS Dual Radeon™ RX 6700 XT OC Edition
Storage	Samsung 970 Evo 512GB SSD (Boot)\|WD SN770 (Gaming)\|2x 3TB Toshiba DT01ACA300\|2x 2TB Crucial BX500
Display(s)	LG GP850-B
Case	Corsair 760T (White)
Audio Device(s)	Yamaha RX-V573\|Speakers: JBL Control One\|Auna 300-CN\|Wharfedale Diamond SW150
Power Supply	Corsair AX760
Mouse	Logitech G900
Keyboard	Duckyshine Dead LED(s) III
Software	Windows 10 Pro
Benchmark Scores	(ノಠ益ಠ)ノ彡┻━┻

System Name	Pioneer
Processor	Ryzen R9 7950X
Motherboard	GIGABYTE Aorus Elite X670 AX
Cooling	Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory	64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s)	XFX RX 7900 XTX Speedster Merc 310
Storage	2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64

Processor	i7 8700k 4.6Ghz @ 1.24V
Motherboard	AsRock Fatal1ty K6 Z370
Cooling	beQuiet! Dark Rock Pro 3
Memory	16GB Corsair Vengeance LPX 3200/C16
Video Card(s)	ASRock RX7900XT Phantom Gaming
Storage	Samsung 850 EVO 1TB + Samsung 830 256GB + Crucial BX100 250GB + Toshiba 1TB HDD
Display(s)	Gigabyte G34QWC (3440x1440)
Case	Fractal Design Define R5
Audio Device(s)	Harman Kardon AVR137 + 2.1
Power Supply	EVGA Supernova G2 750W
Mouse	XTRFY M42
Keyboard	Lenovo Thinkpad Trackpoint II
Software	W10 x64

System Name	Custom
Processor	i9 9900k
Motherboard	Gigabyte Z390 arous master
Cooling	corsair h150i
Memory	4x8 3200mhz corsair
Video Card(s)	Galax RTX 3090 EX Gamer White OC
Storage	500gb Samsung 970 Evo PLus
Display(s)	MSi MAG341CQ
Case	Lian Li Pc-011 Dynamic
Audio Device(s)	Arctis Pro Wireless
Power Supply	850w Seasonic Focus Platinum
Mouse	Logitech G403
Keyboard	Logitech G110

Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

~Technological Technocrat~