Friday, August 20th 2021

Intel Ponte Vecchio Early Silicon Puts Out 45 TFLOPs FP32 at 1.37 GHz, Already Beats NVIDIA A100 and AMD MI100

Intel in its 2021 Architecture Day presentation put out fine technical details of its Xe HPC Ponte Vecchio accelerator, including some [very] preliminary performance claims for its current A0-silicon-based prototype. The prototype operates at 1.37 GHz, but achieves out at least 45 TFLOPs of FP32 throughput. We calculated the clock speed based on simple math. Intel obtained the 45 TFLOPs number on a machine running a single Ponte Vecchio OAM (single MCM with two stacks), and a Xeon "Sapphire Rapids" CPU. 45 TFLOPs sees the processor already beat the advertised 19.5 TFLOPs of the NVIDIA "Ampere" A100 Tensor Core 40 GB processor. AMD isn't faring any better, with its production Instinct MI100 processor only offering 23.1 TFLOPs FP32.
"A0 silicon" is the first batch of chips that come back from the foundry after the tapeout. It's a prototype that is likely circulated within Intel internally, and to a very exclusive group of ISVs and industry partners, under very strict NDAs. It is common practice to ship prototypes with significantly lower clock speeds than what the silicon is capable of, at least to the ISVs, so they can test for functionality and begin developing software for the silicon.
Our math for the clock speed is as follows. Intel, in the presentation mentions that each package (OAM) puts out a throughput of 32,768 FP32 ops per clock cycle. It also says that a 2-stack (one package) amounts to 128 Xe-cores, and that each Xe HPC core Vector Engine offers 256 FP32 ops per clock cycle. These add up to 32,768 FP32 ops/clock for one package (a 2-stack). From here, we calculate that 45,000 GFLOPs (measured in clpeak by the way), divided by 32,768 FP32 ops/clock, amounts to 1373 MHz clock speed. A production stepping will likely have higher clock speeds, and throughput scales linearly, but even 1.37 GHz seems like a number Intel could finalize on, given the sheer size and "weight" (power draw) of the silicon (rumored to be 600 W for A0). All this power comes with great thermal costs, with Intel requiring liquid cooling for the OAMs. If these numbers can make it into the final product, then Intel has very well broken through into the HPC space in a big way.
Add your own comment

48 Comments on Intel Ponte Vecchio Early Silicon Puts Out 45 TFLOPs FP32 at 1.37 GHz, Already Beats NVIDIA A100 and AMD MI100

#2
dragontamer5788


This picture is a pretty big deal if Intel is being honest about the architecture.

Xe-link is probably not a crossbar as indicated in the picture (I'm assuming its closer to a CLOS network or Benes Network). But the idea is that a switch can provide an any-to-any connection at full speed between all nodes. If they're really using a good switching fabric and can provide full throughput between all nodes, then they're going to be a worthy competitor against NVidia's NVLink technology.
Posted on Reply
#4
dragontamer5788
Muser99Likely uses InfiniBand? InfiniBand Trade Association - Wikipedia
Unlikely. NVidia is spec'd out for 600GBps per link (that's 4800 Gbit/s). If Intel is seriously trying to compete against NVLink, I'd be expecting at least 50 GBps (400 Gbit) throughput link-to-link, or more.

Coming in at 1/12th the speed of NVidia is fine for a 1st gen product, but they'll have to catch up quickly after proving themselves. The speeds of these links are an order of magnitude more bandwidth than what even InfiniBand offers.
Posted on Reply
#5
W1zzard
dragontamer5788Xe-link is probably not a crossbar as indicated in the picture (I'm assuming its closer to a CLOS network or Benes Network). But the idea is that a switch can provide an any-to-any connection at full speed between all nodes. If they're really using a good switching fabric and can provide full throughput between all nodes, then they're going to be a worthy competitor against NVidia's NVLink technology.

I'd interpret this slide as "crossbar"
Posted on Reply
#6
AnarchoPrimitiv
At 600 watts? How do they compare per unite of measurement (TFLOPs per watt)?
Posted on Reply
#7
xkm1948
Wish Intel is preparing good software development environment stack to support this in the long run
Posted on Reply
#8
TumbleGeorge
AnarchoPrimitivAt 600 watts? How do they compare per unite of measurement (TFLOPs per watt)?
Outside has calculations how will be performance if PV work at 2GHz... maybe 600 watts target is for device when work on frequency above of this sample which is on early silicon.
Posted on Reply
#10
P4-630
:D Soo... Can it run Crysis? :D
Posted on Reply
#11
davideneco
Mi300 announcements is near
And availability before ponte vecchio

With 70-75 tflops FP32 ....
Posted on Reply
#12
dragontamer5788
W1zzard
I'd interpret this slide as "crossbar"
Thanks for the slide.

Unfortunately, its giving me more questions rather than answers. The ArchDay21claims site doesn't provide details (edc.intel.com/content/www/us/en/products/performance/benchmarks/architecture-day-2021/). I don't know if that's 90 Gbit/sec or if its 90 GByte/sec for example.

8x links gets us to 720 "G" per second, hopefully that's "GBytes" which would be a bit faster than NVSwitch and competitive. But if its "Gbits", then that's only 90GByte/sec (which is probably passable, but much slower than NVidia). Its "passable" because 16x PCIe 4 is just 32GByte/sec, so really, anything "faster than PCIe" is kind of a win. But I'm assuming Intel is aiming at the big boy, the A100 600GByte/sec fabric.

------

Note: most "crossbars" are just nonblocking CLOS networks. :) I think people use the term "crossbar" as shorthand for a "switch that has no restriction on bandwidth" (which a nonblocking CLOS network qualifies), and not necessarily a "physical crossbar" (which takes up O(n^2 space), while CLOS network is O(n*log(n)) space)
Posted on Reply
#13
TheGuruStud
Sure it does. Also, where's the chiller hiding? LOLtel really earning their name (and using tsmc makes it even better).
Posted on Reply
#14
persondb
dragontamer5788

This picture is a pretty big deal if Intel is being honest about the architecture.

Xe-link is probably not a crossbar as indicated in the picture (I'm assuming its closer to a CLOS network or Benes Network). But the idea is that a switch can provide an any-to-any connection at full speed between all nodes. If they're really using a good switching fabric and can provide full throughput between all nodes, then they're going to be a worthy competitor against NVidia's NVLink technology.
That's bit a crossbar. That's a fully connected topology, as every node has a link to another node. You can see it in W1zzard post that each of them has 8 links. It doesn't need any switch at all, there are dedicated links from each node to all other nodes.
Posted on Reply
#17
btarunr
Editor & Senior Moderator
Minus InfinityWhat about FP64?
FP64 is 1:1 FP32. So its FP64 throughput is identical.
P4-630:D Soo... Can it run Crysis? :D
It has all the ingredients to be a cloud gaming GPU.
Posted on Reply
#18
phanbuey
davidenecoMi300 announcements is near
And availability before ponte vecchio

With 70-75 tflops FP32 ....
exactly -- this is pure shareholder hype.
Posted on Reply
#21
nguyen
huh, isn't the RTX3090 already capable of ~36 TFLOPS of FP32 at 350W TGP, what's so special about a MCM solution getting 45 TFLOPS at 600W LMAO.
Posted on Reply
#22
dragontamer5788
btarunrFP64 is 1:1 FP32. So its FP64 throughput is identical.
Are you sure? Most of the time, FP64 is 1:2 FP32 (half-speed).

AVX512, A100, MI100, etc. etc. All the same. If you double the bits, you double the ram-bandwidth needed and therefore half the speed (100 64-bit numbers is 800 bytes. 100x32-bit numbers is just 400 bytes).

Since RAM is moving effectively at half speed, it "just makes sense" for compute to also move at 1/2 speed.
Posted on Reply
#23
TumbleGeorge
nguyenhuh, isn't the RTX3090 already capable of ~36 TFLOPS of FP32 at 350W TGP, what's so special about a MCM solution getting 45 TFLOPS at 600W LMAO.
100000 ways was repeated Nvidia lie with this number. Real teraflops is 1/2 from advertising teraflops.
Posted on Reply
#24
dragontamer5788
nguyenhuh, isn't the RTX3090 already capable of ~36 TFLOPS of FP32 at 350W TGP, what's so special about a MCM solution getting 45 TFLOPS at 600W LMAO.
NVidia A100 (the $10,000 server card) is only 19.5 FP32 TFlops: www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

And only 9.7 FP64 TFlops.

The Tensor-flops are an elevated number that only deep-learning folk care about (and apparently not all deep learning folk are using those tensor cores). Achieving ~20 FP32 TFlops general-purpose code is basically the best today (MI100 is a little bit faster, but without as much of that NVlink thing going on).

So 45 TFlops of FP32 is pretty huge by today's standards. However, Intel is going to be competing against the next-generation products, not the A100. I'm sure NVidia is going to grow, but 45TFlops per card is probably going to be competitive.
persondbThat's bit a crossbar. That's a fully connected topology, as every node has a link to another node. You can see it in W1zzard post that each of them has 8 links. It doesn't need any switch at all, there are dedicated links from each node to all other nodes.
Fully connected is stupid. It means that of the 720 G (bit? Byte?) available to Node A (90 G x 8 connections in NodeA), but you only have 90G wired between Node A and Node B. Which means, Node A and B can only ever talk at 90G speeds.

What if Node B has all of the data that's important for the calculation? Well, you'd like it if NodeA can communicate at 720 G (byte/sec ??) with Node B. You have 8x SerDes after all, it'd be nice to "gang up" those Serdes and have them work together.

Both a crossbar and a CLOS network would allow that. A fully connected topology cannot. This is the difference between Zen1 and Zen2, where Zen2 has a switch (probably a CLOS network, might be a crossbar) efficiently allocating RAM to all 8-nodes. Zen1 was fully connected (Node 1 had a high speed connection to Node 2, Node 3, and Node 4).

That switch is in fact, a big deal, and the key to scalability.
Posted on Reply
#25
nguyen
dragontamer5788NVidia A100 (the $10,000 server card) is only 19.5 FP32 TFlops: www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

And only 9.7 FP64 TFlops.

The Tensor-flops are an elevated number that only deep-learning folk care about (and apparently not all deep learning folk are using those tensor cores). Achieving ~20 FP32 TFlops general-purpose code is basically the best today (MI100 is a little bit faster, but without as much of that NVlink thing going on).

So 45 TFlops of FP32 is pretty huge by today's standards. However, Intel is going to be competing against the next-generation products, not the A100. I'm sure NVidia is going to grow, but 45TFlops per card is probably going to be competitive.
Nope, RTX 3090 has ~36TFLOPS of FP32, Tensor TFLOPS is something like INT4 or INT8, obviously A100 is designed for different type of workload that don't depend on FP32 or FP64 so much. The workstation Ampere A6000 has 40 TFLOPS of FP32, I guess Nvidia doesn't care about FP64 performance anymore after Titan X Maxwell
Posted on Reply
Add your own comment