Tuesday, August 18th 2020

Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU

Raja Koduri, Intel's chief architect and senior vice president of Intel's discrete graphics division, has today held a talk at HotChips 32, the latest online conference of 2020, that shows off the latest architectural advancements in the semiconductor industry. So Intel has prepared two talks, one about Ice Lake-SP server CPUs and one about Intel's efforts in the upcoming graphics card launch. So what has Intel been working on the whole time? Raja Koduri took over the talk and has benchmarked the upcoming GPU and recorded how much raw power the GPUs posses, possibly counting in PetaFLOPs.

When Mr. Koduri got to talk, he pulled the 4-tile Xe HP GPU out of his pocket and showed for the first time how the chip looks. And it is one big chip. Featuring 4 tiles, the GPU represents Intel's fastest and biggest variant of Xe HP GPUs. The benchmark Intel ran was made to show off scaling on the Xe architecture and how the increase in the number of tiles results in a scalable increase in performance. Running on a single tile, the GPU managed to develop the performance of 10588 GFLOPs or around 10.588 TeraFLOPs. When there are two tiles, the performance scales almost perfectly at 21161 GFLOPS (21.161 TeraFLOPs) for 1.999X improvement. At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32.
Intel Xe HP GPU Demo Intel Xe HP GPU Demo Intel Xe HP GPU Demo
Mr. Koduri has mentioned that the 4-tile chip is capable of "PetaFLOPs performance" which means that the GPU is going to be incredibly fast for tasks like machine learning and AI. Given that the GPU supports tensor cores if we calculate that it has 2048 compute units (EUs), capable of performing 128 operations per cycle (128 TOPs) and the fact that there are about 2 FMA (Fused Multiply-Add) units, that equals to about 524,288 FLOPs of AI power. This means that the GPU needs to be clocked at least at 2 GHz clock to achieve the PetaFLOP performance target, or have more than 128 TOPs of computing ability.
Source: Tom's Hardware
Add your own comment

32 Comments on Raja Koduri Previews "PetaFLOPs Scale" 4-Tile Intel Xe HP GPU

#1
laszlo
didn't know intel started to make tiles ; roof covering is quite a good business now depend how long their tiles will last..
Posted on Reply
#2
Vya Domus
10 TFLOPS per tile is rather unimpressive, you could have gotten that performance from a single GPU 4 years ago. It's kind of a waste for a MCM design, that should be reserved for when the independent GPUs are already as fast as possible.
Posted on Reply
#3
HD64G
So, GPU compute workloads are prallel and when many GPU cores are combined, the result is almost perfrectly proportional to their number performance. Who knew?
Posted on Reply
#4
Metroid
Vya Domus10 TFLOPS per tile is rather unimpressive, you could have gotten that performance from a single GPU 4 years ago. It's kind of a waste for a MCM design, that should be reserved for when the independent GPUs are already as fast as possible.
en.wikipedia.org/wiki/Ampere_(microarchitecture)


















ArchitectureFP32 CUDA CoresBoost ClockMemory ClockMemory Bus WidthMemory BandwidthVRAMSingle PrecisionDouble PrecisionINT8 TensorFP16 Tensorbfloat16 TensorTensorFloat-32(TF32) TensorFP64 TensorInterconnectGPUGPU Die SizeTransistor CountTDPManufacturing Process
Ampere69121410MHz2.4Gbps HBM25120-bit1555GB/sec40GB19.5 TFLOPs9.7 TFLOPs624 TOPs312 TFLOPs312 TFLOPs156 TFLOPs19.5 TFLOPS600GB/secGA100826mm254.2B400WTSMC 7nm N7
Volta51201530MHz1.75Gbps HBM24096-bit900GB/sec16GB/32GB15.7 TFLOPs7.8 TFLOPsN/A125 TFLOPsN/AN/AN/A300GB/secGV100815mm221.1B300W/350WTSMC 12nm FFN
Pascal35841480MHz1.4Gbps HBM24096-bit720GB/sec16GB10.6 TFLOPs5.3 TFLOPsN/AN/A


"At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32. "

Nvidia ampere does 19.5 on single precision. Intel with 1 tile does 10, 4 tiles does 40, I do find it impressive.
Posted on Reply
#5
Vayra86
HD64GSo, GPU compute workloads are prallel and when many GPU cores are combined, the result is almost perfrectly proportional to their number performance. Who knew?
But look at it scale man! Never mind the fact power consumption is obviously also x2

Regardless, Intel does seem to have an MCM solution running, so they're doing something.
Posted on Reply
#6
ZoneDymo
"10588 GFLOPs or around 10.588 TeraFLOPs. "

this is just funny

It manages about 1km or about 1000m or about 100000cm
Posted on Reply
#7
Vayra86
Metroiden.wikipedia.org/wiki/Ampere_(microarchitecture)


















ArchitectureFP32 CUDA CoresBoost ClockMemory ClockMemory Bus WidthMemory BandwidthVRAMSingle PrecisionDouble PrecisionINT8 TensorFP16 Tensorbfloat16 TensorTensorFloat-32(TF32) TensorFP64 TensorInterconnectGPUGPU Die SizeTransistor CountTDPManufacturing Process
Ampere69121410MHz2.4Gbps HBM25120-bit1555GB/sec40GB19.5 TFLOPs9.7 TFLOPs624 TOPs312 TFLOPs312 TFLOPs156 TFLOPs19.5 TFLOPS600GB/secGA100826mm254.2B400WTSMC 7nm N7
Volta51201530MHz1.75Gbps HBM24096-bit900GB/sec16GB/32GB15.7 TFLOPs7.8 TFLOPsN/A125 TFLOPsN/AN/AN/A300GB/secGV100815mm221.1B300W/350WTSMC 12nm FFN
Pascal35841480MHz1.4Gbps HBM24096-bit720GB/sec16GB10.6 TFLOPs5.3 TFLOPsN/AN/A

"At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32. "





Nvidia ampere does 19.5 on single precision. Intel with 1 tile does 10, 4 tiles does 40, I do find it impressive.
Dunno. Nvidia dominates the space not just because of raw TFLOPs, it does so because it provides software frameworks for their hardware as well. There's a lot more to this than just dropping a number on the floor and say 'look its fast'. Yeah... theoretically. Nvidia has been doing more with less for decades now. In addition, do you know the floor plan for this Xe GPU? It sure as hell is large. 4 tiles = 4 perfect dies.

In addition, between this and their IGPs I'm not seeing how Xe is going to make waves for gaming. This is certainly not the trimmed down gaming die we'll get... but what dó we get? 16 tiles of IGP? :p

All in all, cool test, but pretty pointless and a great way of telling us nothing.
Posted on Reply
#8
Metroid
Vayra86Dunno. Nvidia dominates the space not just because of raw TFLOPs, it does so because it provides software frameworks for their hardware as well. There's a lot more to this than just dropping a number on the floor and say 'look its fast'. Yeah... theoretically. Nvidia has been doing more with less for decades now. In addition, do you know the floor plan for this Xe GPU? It sure as hell is large. 4 tiles = 4 perfect dies.
The impressive from my post still remains, now transforming that into a gaming machine that will rival AMD and Nvidia that is a totally different thing. I do hope it happens, Nvidia wanting to charge $2k for their 3090 is enough, more competition, better prices, we all win.

To complement my post, last i heard, 4 tiles would be around 500 watts, imagine to cool that thing down hehe, I mean cooling down a threadripper is hard enough, TR is around 500 watts. I have no idea what they will do, I think they will release a 2 tiles and 1 tile gpus or a more manufacturing friendly approach, only a 2 tiles gpus and price it lower than competition. We will see, I hope they get it right, we need this but knowing Intel, they always price their things very high x the competition.
Posted on Reply
#9
Vya Domus
MetroidNvidia ampere does 19.5 on single precision. Intel with 1 tile does 10, 4 tiles does 40, I do find it impressive.
In other words what Intel can do with 2 GPUs Nvidia can with 1. That's not impressive no matter how you spin it, Nvidia ships 8 GPUs per board I believe, I can't imagine Intel shipping more than 2 per board so in the end Nvidia is still going to have the performance density advantage.
Posted on Reply
#10
Metroid
Vya DomusIn other words what Intel can do with 2 GPUs Nvidia can with 1. That's not impressive no matter how you spin it, Nvidia ships 8 GPUs per board I believe, I can't imagine Intel shipping more than 2 per board so in the end Nvidia is still going to have the performance density advantage.
I have no idea if 1 tile intel = ampere, I mean in size, I do think ampere matches a 2 tile size from intel. So in single precision, Intel 2 tiles and Nvidia ampere is pretty much matched in that sense and I do find it impressive, this is Intel first attempt.
Posted on Reply
#11
stimpy88
More smoke and mirrors from Mr Koduri... The usual promises of greatness, yet when launched, will most likely be an expensive turd for niche use cases.
Posted on Reply
#12
Fluffmeister
Yeah kinda has a whiff of the Polaris launch about it, why buy a single fast efficient GTX 1080 when you can buy TWO RX 480's instead! I mean look how well they do in everyone's favourite game!
Posted on Reply
#13
ExcuseMeWtf
Metroiden.wikipedia.org/wiki/Ampere_(microarchitecture)


















ArchitectureFP32 CUDA CoresBoost ClockMemory ClockMemory Bus WidthMemory BandwidthVRAMSingle PrecisionDouble PrecisionINT8 TensorFP16 Tensorbfloat16 TensorTensorFloat-32(TF32) TensorFP64 TensorInterconnectGPUGPU Die SizeTransistor CountTDPManufacturing Process
Ampere69121410MHz2.4Gbps HBM25120-bit1555GB/sec40GB19.5 TFLOPs9.7 TFLOPs624 TOPs312 TFLOPs312 TFLOPs156 TFLOPs19.5 TFLOPS600GB/secGA100826mm254.2B400WTSMC 7nm N7
Volta51201530MHz1.75Gbps HBM24096-bit900GB/sec16GB/32GB15.7 TFLOPs7.8 TFLOPsN/A125 TFLOPsN/AN/AN/A300GB/secGV100815mm221.1B300W/350WTSMC 12nm FFN
Pascal35841480MHz1.4Gbps HBM24096-bit720GB/sec16GB10.6 TFLOPs5.3 TFLOPsN/AN/A

"At four tiles the GPU achieves 3.993 times scaling and scores 41908 GFLOPs resulting in 41.908 TeraFLOPS, all measured in single-precision FP32. "




Nvidia ampere does 19.5 on single precision. Intel with 1 tile does 10, 4 tiles does 40, I do find it impressive.
Fair enough.

We will see if latency in whatever interconnect implementation they use won't be an issue, though I presume Intel engineers thought of that and more already and certainly know better than all of us here.
Posted on Reply
#14
Mescalamba
The beauty of nVidia being able to power all sorts of processing is that CUDA and similar is actually possible to implement and use, sometimes even for daily noob user. AMD has mostly "theoretical" power, but actually using it is a lot different and sadly when actually used in comparable scenario, they not where they should be based on pure power. I cant say if AMD is overstating its performance or just software part isnt up to it.

Over long time period I always had "feeling" AMD (ATi) could do ton better if they actually got their software side together.. Even games when tuned right show that sometimes. Its there, just mostly out of reach. :/

Intel seems to be again a bit late to the party, with same issues their CPU have. Too big, too hot. And probably too expensive.
Posted on Reply
#15
T4C Fantasy
CPU & GPU DB Maintainer
I'm excited for all new gpus coming out, architecture is really intriguing no matter how inferior or superior it is.
Posted on Reply
#16
Vayra86
MescalambaThe beauty of nVidia being able to power all sorts of processing is that CUDA and similar is actually possible to implement and use, sometimes even for daily noob user. AMD has mostly "theoretical" power, but actually using it is a lot different and sadly when actually used in comparable scenario, they not where they should be based on pure power. I cant say if AMD is overstating its performance or just software part isnt up to it.

Over long time period I always had "feeling" AMD (ATi) could do ton better if they actually got their software side together.. Even games when tuned right show that sometimes. Its there, just mostly out of reach. :/

Intel seems to be again a bit late to the party, with same issues their CPU have. Too big, too hot. And probably too expensive.
What strikes me with Intel in all of their new developments is the lack of focus on scalability in terms of yields. Nowhere can we see a straight copy of the idea of chiplets that are as small as possible. They're still trying to make big complicated stuff. Even these tiled GPUs are humongous. They're also differentiating everything all over the place with a myriad of product lines and tweaks... its like they literally don't WANT to make an efficient, single product stack and derive new products from it - they just build a whole new one for every little segment. The wide variety of core configurations alone... wtf.

Looks like old ideas desperately trying to keep themselves relevant, despite ever increasing foundry challenges. Its like they love to repeat 10nm. Intel seems to be adamant that extreme specialization and tweaking is the way forward... but isn't that a dead end, ultimately, and probably pretty soon?
Posted on Reply
#17
Caring1
More flops from Intel, I bet no one saw that coming. ;)
Posted on Reply
#18
DeathtoGnomes
Caring1More flops from Intel, I bet no one saw that coming. ;)
I did. [I see what you did there!]

I mean it is low entry level design, so for a first attempt we can say "it has potential".
Posted on Reply
#19
Steevo
A demonstration of a simulation of the possible power of what could be?

Sounds about right for this guy.
Posted on Reply
#20
Metroid
SteevoA demonstration of a simulation of the possible power of what could be?
Sounds about right for this guy.
He likes to brag ehhe, I expected no less from Mr Raja Koduri ehhe

He is in a lot of pressure I tell you hehe, nothing better to show people who do not like him at Intel that he is doing his job.
Posted on Reply
#21
PowerPC
Raja Koduri's eyes look like he should have stayed at AMD.
Posted on Reply
#22
stimpy88
PowerPCRaja Koduri's eyes look like he should have stayed at AMD.
Him leaving AMD was the best thing that has happened to them since Lisa Su and the Zen architecture.
Posted on Reply
#23
Zareek
At least this is starting to get interesting! Some more competition in graphics would be really nice...
Posted on Reply
#24
Blueberries
Having linear scalability is WILD, and 10.5TFLOPS on a single chipset is nothing to scoff at.

I'll rehash what I said when Xe was announced: if Intel doesn't provide a competitive product with their initial launch, they absolutely will with their third or fourth generation.
Posted on Reply
#25
Steevo
BlueberriesHaving linear scalability is WILD, and 10.5TFLOPS on a single chipset is nothing to scoff at.

I'll rehash what I said when Xe was announced: if Intel doesn't provide a competitive product with their initial launch, they absolutely will with their third or fourth generation.
Also a broken clock is right twice a day.
Posted on Reply
Add your own comment
Copyright © 2004-2021 www.techpowerup.com. All rights reserved.
All trademarks used are properties of their respective owners.