• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Ponte Vecchio Early Silicon Puts Out 45 TFLOPs FP32 at 1.37 GHz, Already Beats NVIDIA A100 and AMD MI100

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
42,903 (8.03/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Intel in its 2021 Architecture Day presentation put out fine technical details of its Xe HPC Ponte Vecchio accelerator, including some [very] preliminary performance claims for its current A0-silicon-based prototype. The prototype operates at 1.37 GHz, but achieves out at least 45 TFLOPs of FP32 throughput. We calculated the clock speed based on simple math. Intel obtained the 45 TFLOPs number on a machine running a single Ponte Vecchio OAM (single MCM with two stacks), and a Xeon "Sapphire Rapids" CPU. 45 TFLOPs sees the processor already beat the advertised 19.5 TFLOPs of the NVIDIA "Ampere" A100 Tensor Core 40 GB processor. AMD isn't faring any better, with its production Instinct MI100 processor only offering 23.1 TFLOPs FP32.



"A0 silicon" is the first batch of chips that come back from the foundry after the tapeout. It's a prototype that is likely circulated within Intel internally, and to a very exclusive group of ISVs and industry partners, under very strict NDAs. It is common practice to ship prototypes with significantly lower clock speeds than what the silicon is capable of, at least to the ISVs, so they can test for functionality and begin developing software for the silicon.



Our math for the clock speed is as follows. Intel, in the presentation mentions that each package (OAM) puts out a throughput of 32,768 FP32 ops per clock cycle. It also says that a 2-stack (one package) amounts to 128 Xe-cores, and that each Xe HPC core Vector Engine offers 256 FP32 ops per clock cycle. These add up to 32,768 FP32 ops/clock for one package (a 2-stack). From here, we calculate that 45,000 GFLOPs (measured in clpeak by the way), divided by 32,768 FP32 ops/clock, amounts to 1373 MHz clock speed. A production stepping will likely have higher clock speeds, and throughput scales linearly, but even 1.37 GHz seems like a number Intel could finalize on, given the sheer size and "weight" (power draw) of the silicon (rumored to be 600 W for A0). All this power comes with great thermal costs, with Intel requiring liquid cooling for the OAMs. If these numbers can make it into the final product, then Intel has very well broken through into the HPC space in a big way.



View at TechPowerUp Main Site
 
Joined
Jul 3, 2019
Messages
294 (0.28/day)
Location
Bulgaria
Processor 6700K
Motherboard M8G
Cooling D15S
Memory 16GB 3k15
Video Card(s) 2070S
Storage 850 Pro
Display(s) U2410
Case Core X2
Audio Device(s) ALC1150
Power Supply Seasonic
Mouse Razer
Keyboard Logitech
Software 20H2
Last edited:
Joined
Apr 24, 2020
Messages
1,639 (2.15/day)
1629487627436.png


This picture is a pretty big deal if Intel is being honest about the architecture.

Xe-link is probably not a crossbar as indicated in the picture (I'm assuming its closer to a CLOS network or Benes Network). But the idea is that a switch can provide an any-to-any connection at full speed between all nodes. If they're really using a good switching fabric and can provide full throughput between all nodes, then they're going to be a worthy competitor against NVidia's NVLink technology.
 
Joined
Apr 24, 2020
Messages
1,639 (2.15/day)

Unlikely. NVidia is spec'd out for 600GBps per link (that's 4800 Gbit/s). If Intel is seriously trying to compete against NVLink, I'd be expecting at least 50 GBps (400 Gbit) throughput link-to-link, or more.

Coming in at 1/12th the speed of NVidia is fine for a 1st gen product, but they'll have to catch up quickly after proving themselves. The speeds of these links are an order of magnitude more bandwidth than what even InfiniBand offers.
 

W1zzard

Administrator
Staff member
Joined
May 14, 2004
Messages
24,084 (3.66/day)
Processor Core i7-8700K
Memory 32 GB
Video Card(s) RTX 3080
Display(s) 30" 2560x1600 + 19" 1280x1024
Software Windows 10 64-bit
Xe-link is probably not a crossbar as indicated in the picture (I'm assuming its closer to a CLOS network or Benes Network). But the idea is that a switch can provide an any-to-any connection at full speed between all nodes. If they're really using a good switching fabric and can provide full throughput between all nodes, then they're going to be a worthy competitor against NVidia's NVLink technology.


I'd interpret this slide as "crossbar"
 
Joined
Nov 6, 2016
Messages
945 (0.47/day)
Location
NH, USA
System Name Lightbringer
Processor Ryzen 7 2700X
Motherboard Asus ROG Strix X470-F Gaming
Cooling Enermax Liqmax Iii 360mm AIO
Memory G.Skill Trident Z RGB 32GB (8GBx4) 3200Mhz CL 14
Video Card(s) Sapphire RX 5700XT Nitro+
Storage Hp EX950 2TB NVMe M.2, HP EX950 1TB NVMe M.2, Samsung 860 EVO 2TB
Display(s) LG 34BK95U-W 34" 5120 x 2160
Case Lian Li PC-O11 Dynamic (White)
Power Supply BeQuiet Straight Power 11 850w Gold Rated PSU
Mouse Glorious Model O (Matte White)
Keyboard Royal Kludge RK71
Software Windows 10
At 600 watts? How do they compare per unite of measurement (TFLOPs per watt)?
 
Joined
Mar 18, 2008
Messages
5,716 (1.10/day)
System Name Virtual Reality / Bioinformatics
Processor Undead CPU
Motherboard Undead TUF X99
Cooling Noctua NH-D15
Memory GSkill 128GB DDR4-3000
Video Card(s) EVGA RTX 3090 FTW3 Ultra
Storage Samsung 960 Pro 1TB + 860 EVO 2TB + WD Black 5TB
Display(s) 32'' 4K Dell
Case Fractal Design R5
Audio Device(s) BOSE 2.0
Power Supply Seasonic 850watt
Mouse Logitech Master MX
Keyboard Corsair K70 Cherry MX Blue
VR HMD HTC Vive + Oculus Quest 2
Software Windows 10 P
Wish Intel is preparing good software development environment stack to support this in the long run
 
Joined
Sep 1, 2020
Messages
612 (0.97/day)
Location
Bulgaria
At 600 watts? How do they compare per unite of measurement (TFLOPs per watt)?
Outside has calculations how will be performance if PV work at 2GHz... maybe 600 watts target is for device when work on frequency above of this sample which is on early silicon.
 
Joined
Jan 5, 2006
Messages
12,486 (2.09/day)
System Name AlderLake / Laptop
Processor Intel i7 12700K / Intel i3 7100U
Motherboard Gigabyte Z690 Aorus Master / HP 83A3 (U3E1)
Cooling Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut Extreme + 5 case fans / Fan
Memory 32GB DDR5 Corsair Dominator Platinum RGB 6000MHz CL36 / 8GB DDR4 HyperX CL13
Video Card(s) MSI RTX 2070 Super Gaming X Trio / Intel HD620
Storage Samsung 980 Pro 1TB + 970 Evo 500GB + 850 Pro 512GB + 860 Evo 1TB x2 / Samsung 256GB M.2 SSD
Display(s) 23.8" Dell S2417DG 165Hz G-Sync 1440p / 14" 1080p IPS Glossy
Case Be quiet! Silent Base 600 - Window / HP Pavilion
Audio Device(s) ALC1220-VB + ESS ES9118 DAC / Realtek onboard + B&O speaker system
Power Supply Seasonic Focus Plus Gold 750W / Powerbrick
Mouse Logitech MX Anywhere 2 Laser wireless / Logitech M330 wireless
Keyboard RAPOO E9270P Black 5GHz wireless / HP backlit
Software Windows 11 / Windows 10
:D Soo... Can it run Crysis? :D
 
Joined
Apr 24, 2020
Messages
1,639 (2.15/day)

I'd interpret this slide as "crossbar"

Thanks for the slide.

Unfortunately, its giving me more questions rather than answers. The ArchDay21claims site doesn't provide details (https://edc.intel.com/content/www/us/en/products/performance/benchmarks/architecture-day-2021/). I don't know if that's 90 Gbit/sec or if its 90 GByte/sec for example.

8x links gets us to 720 "G" per second, hopefully that's "GBytes" which would be a bit faster than NVSwitch and competitive. But if its "Gbits", then that's only 90GByte/sec (which is probably passable, but much slower than NVidia). Its "passable" because 16x PCIe 4 is just 32GByte/sec, so really, anything "faster than PCIe" is kind of a win. But I'm assuming Intel is aiming at the big boy, the A100 600GByte/sec fabric.

------

Note: most "crossbars" are just nonblocking CLOS networks. :) I think people use the term "crossbar" as shorthand for a "switch that has no restriction on bandwidth" (which a nonblocking CLOS network qualifies), and not necessarily a "physical crossbar" (which takes up O(n^2 space), while CLOS network is O(n*log(n)) space)
 
Last edited:
Joined
Sep 15, 2007
Messages
3,926 (0.73/day)
Location
Police/Nanny State of America
System Name More hardware than I use :|
Processor 4.7 8350 - 4.2 4560K - 4.4 4690K
Motherboard Sabertooth R2.0 - Gigabyte Z87X-UD4H-CF - AsRock Z97M KIller
Cooling Mugen 2 rev B push/pull - Hyper 212+ push/pull - Hyper 212+
Memory 16GB Gskill - 8GB Gskill - 16GB Ballistix 1.35v
Video Card(s) Xfire OCed 7950s - Powercolor 290x - Oced Zotac 980Ti AMP! (also have two 7870s)
Storage Crucial 250GB SSD, Kingston 3K 120GB, Sammy 1TB, various WDs, 13TB (actual capactity) NAS with WDs
Display(s) X-star 27" 1440 - Auria 27" 1440 - BenQ 24" 1080 - Acer 23" 1080
Case Lian Li open bench - Fractal Design ARC - Thermaltake Cube (still have HAF 932 and more ARCs)
Audio Device(s) Titanium HD - Onkyo HT-RC360 Receiver - BIC America custom 5.1 set up (and extra Klipsch sub)
Power Supply Corsair 850W V2 - EVGA 1000 G2 - Seasonic 500 and 600W units (dead 750W needs RMA lol)
Mouse Logitech G5 - Sentey Revolution Pro - Sentey Lumenata Pro - multiple wireless logitechs
Keyboard Logitech G11s - Thermaltake Challenger
Software I wish I could kill myself instead of using windows (OSX can suck it too).
Sure it does. Also, where's the chiller hiding? LOLtel really earning their name (and using tsmc makes it even better).
 
Joined
Jun 1, 2021
Messages
97 (0.27/day)
View attachment 213480

This picture is a pretty big deal if Intel is being honest about the architecture.

Xe-link is probably not a crossbar as indicated in the picture (I'm assuming its closer to a CLOS network or Benes Network). But the idea is that a switch can provide an any-to-any connection at full speed between all nodes. If they're really using a good switching fabric and can provide full throughput between all nodes, then they're going to be a worthy competitor against NVidia's NVLink technology.
That's bit a crossbar. That's a fully connected topology, as every node has a link to another node. You can see it in W1zzard post that each of them has 8 links. It doesn't need any switch at all, there are dedicated links from each node to all other nodes.
 
Joined
Dec 29, 2010
Messages
2,625 (0.63/day)
Processor AMD 5900x
Motherboard Asus x570 Strix-E
Cooling Hardware Labs
Memory G.Skill 4000c17 2x16gb
Video Card(s) RTX 3090
Storage Sabrent
Display(s) Samsung G9
Case Phanteks 719
Audio Device(s) Fiio K5 Pro
Power Supply EVGA 1300 G2
Mouse Logitech G600
Keyboard Corsair K95
Much glue.
 

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
42,903 (8.03/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Joined
Nov 13, 2007
Messages
8,956 (1.69/day)
Location
Austin Texas
System Name Espresso Machine
Processor 12600K @ 5.2Ghz
Motherboard MSI 690-I PRO
Cooling 280MM push pull water Loop
Memory 32 GB DDR5 6200 MHZ 32-35-35-67
Video Card(s) MSI Ventus RTX 3080 UC @1810mhz 775mv
Storage 3x1TB SSDs, 2TB SSD
Display(s) LG CX OLED 48"
Case SLIGER S610
Audio Device(s) Bose Solo
Power Supply Corsair SF750
Mouse Superlight wireless
Keyboard 65% mini hyperspeed wireless
Software Windows 11
Mi300 announcements is near
And availability before ponte vecchio

With 70-75 tflops FP32 ....
exactly -- this is pure shareholder hype.
 
Joined
Mar 10, 2010
Messages
10,068 (2.26/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II/Trig
Processor Amd R5 5600G/ Intel 8750H/3800X
Motherboard Crosshair hero8 impact/Asus/crosshair hero 7
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Sapphire refference Rx vega 64 EK waterblocked/Rtx 2060/GTX 1060
Storage Silicon power 1TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dellshiter
Case Lianli p0-11 dynamic/strix scar2/aero cool shiter
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock /850 watt ?
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
Probably good enough to push bundle's.
 
Joined
Jul 16, 2014
Messages
6,690 (2.33/day)
Location
SE Michigan
System Name Dumbass
Processor AMD FX-9370
Motherboard ASUS SABERTOOTH 990FX R2.0 +SB950
Cooling Artic Liquid Freezer 2 - 420mm
Memory G.Skill Sniper 16gb DDR3 2400
Video Card(s) GreenTeam 1080 Gaming X 8GB
Storage Samsung EVO 500gb & 1Tb, 2tb HDD
Display(s) 1x Nixeus NX_EDG27, 2x Dell S2440L (16:9)
Case Phanteks Enthoo Primo w/8 140mm SP Fans
Audio Device(s) onboard (realtek?) - SPKRS:Logitech Z623 200w 2.1
Power Supply Corsair HX1000i
Mouse Logitech G604
Keyboard Logitech G910 Orion Spark
Software windows 10
Benchmark Scores https://i.imgur.com/aoz3vWY.jpg?2
at 600w redundant psu's might be needed.
 
Joined
Nov 11, 2016
Messages
2,192 (1.08/day)
System Name The de-ploughminator
Processor I7 9900K @ 5.1Ghz
Motherboard Gigabyte Z370 Gaming 5
Cooling Custom Watercooling
Memory 4x8GB G.Skill Trident Neo 3600mhz 15-15-15-30
Video Card(s) RTX 3090 + Bitspower WB
Storage Plextor 512GB nvme SSD
Display(s) LG OLED CX48"
Case Lian Li 011D Dynamic
Audio Device(s) Creative AE-5
Power Supply Corsair HX850
Mouse Razor Viper Ultimate
Keyboard Corsair K75
Software Win10
huh, isn't the RTX3090 already capable of ~36 TFLOPS of FP32 at 350W TGP, what's so special about a MCM solution getting 45 TFLOPS at 600W LMAO.
 
Joined
Apr 24, 2020
Messages
1,639 (2.15/day)
FP64 is 1:1 FP32. So its FP64 throughput is identical.

Are you sure? Most of the time, FP64 is 1:2 FP32 (half-speed).

AVX512, A100, MI100, etc. etc. All the same. If you double the bits, you double the ram-bandwidth needed and therefore half the speed (100 64-bit numbers is 800 bytes. 100x32-bit numbers is just 400 bytes).

Since RAM is moving effectively at half speed, it "just makes sense" for compute to also move at 1/2 speed.
 
Joined
Sep 1, 2020
Messages
612 (0.97/day)
Location
Bulgaria
huh, isn't the RTX3090 already capable of ~36 TFLOPS of FP32 at 350W TGP, what's so special about a MCM solution getting 45 TFLOPS at 600W LMAO.
100000 ways was repeated Nvidia lie with this number. Real teraflops is 1/2 from advertising teraflops.
 
Joined
Apr 24, 2020
Messages
1,639 (2.15/day)
huh, isn't the RTX3090 already capable of ~36 TFLOPS of FP32 at 350W TGP, what's so special about a MCM solution getting 45 TFLOPS at 600W LMAO.

NVidia A100 (the $10,000 server card) is only 19.5 FP32 TFlops: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet.pdf

And only 9.7 FP64 TFlops.

The Tensor-flops are an elevated number that only deep-learning folk care about (and apparently not all deep learning folk are using those tensor cores). Achieving ~20 FP32 TFlops general-purpose code is basically the best today (MI100 is a little bit faster, but without as much of that NVlink thing going on).

So 45 TFlops of FP32 is pretty huge by today's standards. However, Intel is going to be competing against the next-generation products, not the A100. I'm sure NVidia is going to grow, but 45TFlops per card is probably going to be competitive.

That's bit a crossbar. That's a fully connected topology, as every node has a link to another node. You can see it in W1zzard post that each of them has 8 links. It doesn't need any switch at all, there are dedicated links from each node to all other nodes.

Fully connected is stupid. It means that of the 720 G (bit? Byte?) available to Node A (90 G x 8 connections in NodeA), but you only have 90G wired between Node A and Node B. Which means, Node A and B can only ever talk at 90G speeds.

What if Node B has all of the data that's important for the calculation? Well, you'd like it if NodeA can communicate at 720 G (byte/sec ??) with Node B. You have 8x SerDes after all, it'd be nice to "gang up" those Serdes and have them work together.

Both a crossbar and a CLOS network would allow that. A fully connected topology cannot. This is the difference between Zen1 and Zen2, where Zen2 has a switch (probably a CLOS network, might be a crossbar) efficiently allocating RAM to all 8-nodes. Zen1 was fully connected (Node 1 had a high speed connection to Node 2, Node 3, and Node 4).

That switch is in fact, a big deal, and the key to scalability.
 
Last edited:
Top