Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

P4-630 · May 15, 2020

Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

Four hundred racks were delivered to the Japanese research institute Riken for the Fugaku, a super computer with chips based on Arm. The system gets nearly 16,000 nodes with 48-core Arm chips.

The installation of Fugaku started in early December. All parts have been delivered this week, reports Riken. The supercomputer must be ready for use in 2021. Then the system will be used for research calculations, including research projects to combat Covid-19.

The entire system will consist of 158,976 nodes with A64FX processors from Fujitsu. These are Armv8.2-A SVE based processors with 48 cores for calculations and 2 or 4 cores for OS activity. The chips run at 2GHz, with a boost to 2.2GHz and are combined with 32GB hbm2 memory. The entire system should provide half an exaflops of 64bits double precision floatping point performance.

Fugaku is the nickname of Mount Fuji in Japan.

Japanse Riken krijgt supercomputer met bijna 160.000 nodes met 48 Arm-cores

In het Japanse onderzoeksinstituut Riken zijn vierhonderd racks afgeleverd voor de Fugaku, een supercomputer met chips op basis van Arm. Het systeem krijgt bijna 160.000 nodes met Arm-chips met 48 cores.

tweakers.net

dragontamer5788 · May 15, 2020

The cool part about this supercomputer is that this is the first major implementation of the ARM-SVE (Scalable Vector Extension) instruction set. SIMD-compute in supercomputers is pretty commonplace these days, but every implementation brings forth new ideas and optimizations.

I hope to hear good things from those who work with the ARM-SVE instruction set. Its allegedly easier to auto-vectorize compared to AVX2 or AVX512, and the SVE instruction set is unique in that the ISA itself is independent of the SIMD-width. So future ARM-SVE implementations may increase (or decrease) the vector width without any need for recompiling. In contrast, AMD has a big 64-wide to 32-wide change from GCN Vega -> RDNA. And Nvidia is similarly 32-wide at the ISA level.

I'm kind of betting that the static 32-wide ISAs of GPUs will win over in the long run, surely 32-wide (1024-bits wide) is wide enough, and there are multiple instructions (permute / bpermute) which are innately tied to the width of the SIMD-processor. Nonetheless, seeing the SIMD-width independent SVE instruction set is cool.

jeremyshaw · May 23, 2020

@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

dragontamer5788 · May 23, 2020

jeremyshaw said:
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

http://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.131-ARMv8-vector-Stephens-Yoshida-ARM-v8-23_51-v11.pdf

Does that answer your question?

Vya Domus · May 23, 2020

jeremyshaw said:
@dragontamer5788 That is pretty cool, but what happens if you end up with something that is expecting 2048 bit vectors, trying to execute in an implementation that only has 128bit vector HW? Or is it purely down to SW design to accommodate for that? We can see from AMD's AVX implementations that vector width does not automatically scale up and down on the RTL side and the ISA of SVE doesn't seem to do that, either.

It's somewhat similar to how GPUs work, things are broken up into wavefronts that are 128bit wide, so it doesn't matter what is actually available in hardware.

dragontamer5788 · May 23, 2020

Vya Domus said:
things are broken up into wavefronts that are 128bit wide

Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.

Vya Domus · May 23, 2020

dragontamer5788 said:
Surely you mean 1024-bits wide for NVidia Volta and AMD RDNA? AMD GCN is 2048-bits wide.

NVidia has something called "Cooperative Groups" that helps the programmer split the 32-sized SIMD width into smaller pieces. But outside of cooperative groups, you really need to understand the 32x wide execution path to effectively program high-performance CUDA or high-performance OpenCL.

No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.

FreedomEclipse · May 23, 2020

Word on the street is, it costs more than an Arm and a leg.

R-T-B · May 23, 2020

FreedomEclipse said:
Word on the street is, it costs more than an Arm and a leg.

Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

dragontamer5788 · May 23, 2020

Vya Domus said:
No, I was talking about the ARM implementation. It's sort of improper to say those GPU architectures are "1024 bit wide SIMD", they're SIMT architectures.

Leaving SIMT vs SIMD for a moment (which I consider to be a marketing term invented by NVidia), I'm pretty confused with your statements.

SVE could be implemented 128-bit wide. But the one for the Riken Supercomputer is well-documented to be 512-bits wide (similar performance to Intel AVX512 Skylake implementations). The ARM-SVE ISA is designed to scale to at least 2048-bit wide, so the code written for Riken Supercomputer would be theoretically able to run full-speed (that is 4x faster) on a hypothetical, future 2048-bit SVE implementation, without the need of recompiling.

As far as I know, the A64FX is the only chip in the world that currently implements ARM-SVE.

----------

Because ARM-SVE is variable-length, the concept of a "wavefront" doesn't really apply to it. At least, not in any traditional sense.

dragontamer5788 · Jun 23, 2020

Reddit was sharing this Github of the A64FX's official documentation: https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.1.pdf

As some may know: this supercomputer is now the #1 supercomputer in the world, beating Summit in the Linpack benchmark. So congrats to the team for a job well done!

Vayra86 · Jun 23, 2020

R-T-B said:
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

At least the thing's got legs now. And it will run headless...

:fear:

Hyderz · Jun 23, 2020

R-T-B said:
Since they need to spend a human arm per arm core, they need 7,680,000 human arms to purchase this thing with all it's nodes. Considering the average human only has two arms, that means we've made 3,840,000 armless wonders for this level of compute.

I hope it was worth it.

Ok, I'll stop.

depends on the quality of the arm

System Name	AlderLake
Processor	Intel i7 12700K P-Cores @ 5Ghz
Motherboard	Gigabyte Z690 Aorus Master
Cooling	Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut Extreme + 5 case fans
Memory	32GB DDR5 Corsair Dominator Platinum RGB 6000MT/s CL36
Video Card(s)	MSI RTX 2070 Super Gaming X Trio
Storage	Samsung 980 Pro 1TB + 970 Evo 500GB + 850 Pro 512GB + 860 Evo 1TB x2
Display(s)	23.8" Dell S2417DG 165Hz G-Sync 1440p
Case	Be quiet! Silent Base 600 - Window
Audio Device(s)	Panasonic SA-PMX94 / Realtek onboard + B&O speaker system / Harman Kardon Go + Play / Logitech G533
Power Supply	Seasonic Focus Plus Gold 750W
Mouse	Logitech MX Anywhere 2 Laser wireless
Keyboard	RAPOO E9270P Black 5GHz wireless
Software	Windows 11
Benchmark Scores	Cinebench R23 (Single Core) 1936 @ stock Cinebench R23 (Multi Core) 23006 @ stock

Processor	3700X
Motherboard	X570 TUF Plus
Cooling	U12
Memory	32GB 3600MHz
Video Card(s)	eVGA GTX970
Storage	512GB 970 Pro
Case	CM 500L vertical

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	WorkInProgress
Processor	AMD 7800X3D
Motherboard	MSI X670E GAMING PLUS
Cooling	Thermalright AM5 Contact Frame + Phantom Spirit 120SE
Memory	2x32GB G.Skill Trident Z5 NEO DDR5 6000 CL32
Video Card(s)	Gainward RTX 4070Ti Phantom Reunion (The54thvoid Edition)
Storage	WD SN770 1TB (Boot)\|1x WD SN850X 8TB (Gaming)\| 2x2TB WD SN770\| 2x2TB+2x4TB Crucial BX500
Display(s)	LG GP850-B
Case	Corsair 760T (White) {1xCorsair ML120 Pro\|5xML140 Pro}
Audio Device(s)	Yamaha RX-V573\|Speakers: JBL Control One\|Auna 300-CN\|Wharfedale Diamond SW150
Power Supply	Seasonic Focus GX-850 80+ GOLD
Mouse	Logitech G502 X
Keyboard	Cherry G80-3000N (TKL)
Software	Windows 11 Home
Benchmark Scores	ლ(ಠ益ಠ)ლ

System Name	Pioneer
Processor	Ryzen 9 9950X
Motherboard	MSI MAG X670E Tomahawk Wifi
Cooling	Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory	64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6200(Running 1T no GDM)
Video Card(s)	PNY RTX 5080 OC
Storage	Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W 80Plus Titanium PSU
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64 / Windows 11 Enterprise (yes it's legit)

System Name	Tiny the White Yeti
Processor	7800X3D
Motherboard	MSI MAG Mortar b650m wifi
Cooling	CPU: Thermalright Peerless Assassin / Case: Phanteks T30-120 x3
Memory	32GB Corsair Vengeance 30CL6000
Video Card(s)	ASRock RX7900XT Phantom Gaming
Storage	Lexar NM790 4TB + Samsung 850 EVO 1TB + Samsung 980 1TB + Crucial BX100 250GB
Display(s)	Gigabyte G34QWC (3440x1440)
Case	Lian Li A3 mATX White
Audio Device(s)	Harman Kardon AVR137 + 2.1
Power Supply	EVGA Supernova G2 750W
Mouse	Steelseries Aerox 5
Keyboard	Lenovo Thinkpad Trackpoint II
VR HMD	HD 420 - Green Edition ;)
Software	W11 IoT Enterprise LTSC
Benchmark Scores	Over 9000

System Name	Custom
Processor	i9 9900k
Motherboard	Gigabyte Z390 arous master
Cooling	corsair h150i
Memory	4x8 3200mhz corsair
Video Card(s)	Galax RTX 3090 EX Gamer White OC
Storage	500gb Samsung 970 Evo PLus
Display(s)	MSi MAG341CQ
Case	Lian Li Pc-011 Dynamic
Audio Device(s)	Arctis Pro Wireless
Power Supply	850w Seasonic Focus Platinum
Mouse	Logitech G403
Keyboard	Logitech G110

Japanese Riken gets supercomputer with almost 160,000 nodes with 48 Arm cores

~Technological Technocrat~