Tuesday, October 3rd 2017

AMD and Xilinx Announce a New World Record for AI Inference

At today's Xilinx Developer Forum in San Jose, Calif., our CEO, Victor Peng was joined by the AMD CTO Mark Papermaster for a Guinness. But not the kind that comes in a pint - the kind that comes in a record book. The companies revealed the AMD and Xilinx have been jointly working to connect AMD EPYC CPUs and the new Xilinx Alveo line of acceleration cards for high-performance, real-time AI inference processing. To back it up, they revealed a world-record 30,000 images per-second inference throughput!

The impressive system, which will be featured in the Alveo ecosystem zone at XDF today, leverages two AMD EPYC 7551 server CPUs with its industry-leading PCIe connectivity, along with eight of the freshly-announced Xilinx Alveo U250 acceleration cards. The inference performance is powered by Xilinx ML Suite, which allows developers to optimize and deploy accelerated inference and supports numerous machine learning frameworks such as TensorFlow. The benchmark was performed on GoogLeNet, a widely used convolutional neural network.
AMD and Xilinx have shared a common vision around the evolution of computing to heterogeneous system architecture and have a long history of technical collaboration. Both companies have optimized drivers and tuned the performance for interoperability between AMD EPYC CPUs with Xilinx FPGAs. We are also collaborating with others in the industry on cache coherent interconnect for accelerators (the CCIX Consortium - pronounced "see-six"), focused on enabling cache coherency and shared memory across multiple processors.

AMD EPYC is the perfect CPU platform for accelerating artificial intelligence and high- performance computing workloads. With 32 cores, 64 threads, 8 memory channels with up to 2 TB of memory per socket, and 128 PCIe lanes coupled with the industry's first hardware-embedded x86 server security solution, EPYC is designed to deliver the memory capacity, bandwidth, and processor cores to efficiently run memory-intensive workloads commonly seen with AI and HPC. With EPYC, customers can collect and analyze larger data sets much faster, helping them significantly accelerate complex problems.

Xilinx and AMD see a bright future in their technology collaboration. There is strong alignment in our roadmaps that align the high-performance AMD EPYC server and graphics processors with Xilinx acceleration platforms across its Alveo accelerator cards, as well as its forthcoming Versal portfolio.

So, raise a pint to the future of AI inference and innovation for heterogeneous computing platforms. And don't forget to stop by and see the system in action in the Alveo ecosystem zone at the Fairmont hotel.
Add your own comment

24 Comments on AMD and Xilinx Announce a New World Record for AI Inference

#2
Xzibit
Those are the accelerators

Alveo U250 Data Center Accelerator Card

At the heart of the Xilinx Alveo U200 and U250 accelerator cards are custom-built UltraScale+ FPGAs that run or optimally (and exclusively) on Alveo .
Posted on Reply
#3
the54thvoid
I don't think they are Vega based. There are a lot of FPGA cards out there, my brother (and a team) designed one last year. Stratix (developed by Altera) is the chip inside his companies GPU.
Posted on Reply
#4
notb
FordGT90Concept said:
So are those Vega-based chips?
Why would it be?
It's FPGA-based, so the chip is designed precisely for inference. It's a few times faster than a GPU would be (4x faster than V100, graph below).

It's important to state that the actual Xilinx product is the Alveo accelerator and it's performing the inference tasks. CPUs are here just to run the platform and push data around. It might as well be using Xeons.
It was most likely AMD's initiative to be mentioned here.
It may have been an easy decision for Xilinx, since their main competitor is Intel as well (Stratix-based accelerator was launched just a few days ago).

As for performance, here's a comparison to some alternatives from a Xilinx whitepaper.

https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

And now the fun part
Arria-10 is a competing FPGA product... but not the latest one from Altera/Intel. Xilinx says:
"3. Arria-10 numbers taken Intel White Paper, "Accelerating Deep Learning with the OpenCL™ Platform and Intel Stratix 10 FPGAs."
https://builders.intel.com/docs/aibuilders/accelerating-deep-learning-with-the-opencl-platform-and-intel-stratix-10-fpgas.pdf."

But you know what's also in this white paper? Surprise... it's Stratix 10 performance! :-D
Posted on Reply
#5
breubreubreu
FordGT90Concept said:
Each card has two Quad Small Form-factor Pluggable fiber links at 100 Gb each. Why?
Maybe for InfiniBand? It would be quite useful for a cluster with several of these computers - and also mean that they could break this record again in the future, if the algorithms parallelize and scale well enough.
Posted on Reply
#6
notb
breubreubreu said:
Maybe for InfiniBand? It would be quite useful for a cluster with several of these computers - and also mean that they could break this record again in the future, if the algorithms parallelize and scale well enough.
Fairly unlikely, since inference is latency-critical, i.e. it's usually important to get the result quickly (<> fast).
Also, inference is not exactly a material for parallelism.
Multi-thread gain here is mostly done by batching, i.e. you're performing calculations on many samples at the same time.
Posted on Reply
#7
ArbitraryAffection
notb said:
Why would it be?
It's FPGA-based, so the chip is designed precisely for inference. It's a few times faster than a GPU would be (4x faster than V100, graph below).

It's important to state that the actual Xilinx product is the Alveo accelerator and it's performing the inference tasks. CPUs are here just to run the platform and push data around. It might as well be using Xeons.
It was most likely AMD's initiative to be mentioned here.
It may have been an easy decision for Xilinx, since their main competitor is Intel as well (Stratix-based accelerator was launched just a few days ago).

As for performance, here's a comparison to some alternatives from a Xilinx whitepaper.

https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

And now the fun part
Arria-10 is a competing FPGA product... but not the latest one from Altera/Intel. Xilinx says:
"3. Arria-10 numbers taken Intel White Paper, "Accelerating Deep Learning with the OpenCL™ Platform and Intel Stratix 10 FPGAs."
https://builders.intel.com/docs/aibuilders/accelerating-deep-learning-with-the-opencl-platform-and-intel-stratix-10-fpgas.pdf."

But you know what's also in this white paper? Surprise... it's Stratix 10 performance! :-D

Xeons have less on chip I/O for these cards so EPYC are naturally better suited for these configurations. I apologise if i'm just being overly sensitive but it always seems like people have discredit AMD at any given opportunity.
Posted on Reply
#8
breubreubreu
notb said:
Fairly unlikely, since inference is latency-critical, i.e. it's usually important to get the result quickly (<> fast).
Also, inference is not exactly a material for parallelism.
Multi-thread gain here is mostly done by batching, i.e. you're performing calculations on many samples at the same time.
Derp, mixed up the inference with the training.

Maybe these ports can be used to access the cards in a "cluster" if there aren't enough PCIe slots? Even this is a stretch, though.
Posted on Reply
#9
SRB151
notb said:
Why would it be?
It's FPGA-based, so the chip is designed precisely for inference. It's a few times faster than a GPU would be (4x faster than V100, graph below).

It's important to state that the actual Xilinx product is the Alveo accelerator and it's performing the inference tasks. CPUs are here just to run the platform and push data around. It might as well be using Xeons.
It was most likely AMD's initiative to be mentioned here.
It may have been an easy decision for Xilinx, since their main competitor is Intel as well (Stratix-based accelerator was launched just a few days ago).

As for performance, here's a comparison to some alternatives from a Xilinx whitepaper.

https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

And now the fun part
Arria-10 is a competing FPGA product... but not the latest one from Altera/Intel. Xilinx says:
"3. Arria-10 numbers taken Intel White Paper, "Accelerating Deep Learning with the OpenCL™ Platform and Intel Stratix 10 FPGAs."
https://builders.intel.com/docs/aibuilders/accelerating-deep-learning-with-the-opencl-platform-and-intel-stratix-10-fpgas.pdf."

But you know what's also in this white paper? Surprise... it's Stratix 10 performance! :-D

It amazes me the lengths people will go to in order to be a fanboy. Stratix 10 isn't released, and won't be until sometime next year. That graph is either a projection, or based on an engineering sample. To be taken with a large grain of salt at this point. Especially with no system or real world tests.

Ordering Information
Ordering Contact Engineering Sample Contact an Intel® sales representative

OEM Partner Server Model Hewlett Packard Enterprise (HPE) Available 1H 2019
Posted on Reply
#10
DeathtoGnomes
ArbitraryAffection said:
Xeons have less on chip I/O for these cards so EPYC are naturally better suited for these configurations. I apologise if i'm just being overly sensitive but it always seems like people have discredit AMD at any given opportunity.
no need to apologize, he is a known shill, a step above fanboi.
Posted on Reply
#12
londiste
notb said:
As for performance, here's a comparison to some alternatives from a Xilinx whitepaper.

Where is MI25 in comparison? It is sold with a good reason for inferencing.
Also, have to wonder about the FP16/FP32 note on V100. Wasn't V100 capable of INT8 inferencing? :)
Posted on Reply
#13
jabbadap
Hmh I'm kind of baffled by this. That Alveo U250 ain't that good on the paper, 33.3 TOPS int8 peak with 225W TDP, which is low compared to nvidia. Is that GoogleNET V1 batch=1 some kind of corner case for nvidia's solutions?

londiste said:
Where is MI25 in comparison? It is sold with a good reason for inferencing.
Also, have to wonder about the FP16/FP32 note on V100. Wasn't V100 capable of INT8 inferencing? :)
It's taken from nvidia's marketing materials, inferencing at int8 ain't V100s targeted use. They have smaller teslas for that p4, t4. Would be interesting to see if that GoogleNET can take advance of T4 130 int8 Tensor TOPS.
Posted on Reply
#14
HTC
jabbadap said:
Hmh I'm kind of baffled by this. That Alveo U250 ain't that good on the paper, 33.3 TOPS int8 peak with 225W TDP, which is low compared to nvidia. Is that GoogleNET V1 batch=1 some kind of corner case for nvidia's solutions?
I know about zero concerning these types of cards.

With that disclaimer out of the way, perhaps the application these cards run manages to take full advantage of the card's capabilities while that doesn't happen with nVidia cards?
Posted on Reply
#15
jabbadap
HTC said:
I know about zero concerning these types of cards.

With that disclaimer out of the way, perhaps the application these cards run manages to take full advantage of the card's capabilities while that doesn't happen with nVidia cards?
Well yeah I don't know. In paper Tesla P4 has 22 int8 TOPS vs U250 33.3 but the latter is almost five times faster on that use case.
Posted on Reply
#16
HTC
jabbadap said:
Well yeah I don't know. In paper Tesla P4 has 22 int8 TOPS vs U250 33.3 but the latter is almost five times faster on that use case.
5 times actually faster VS roughly 50% faster on paper? Definitely something with nVidia's card hindering performance. That or the Int8 TOPS performance nVidia claims for this card is far higher then what it actually is, no?
Posted on Reply
#17
cdawall
where the hell are my stars
HTC said:
I know about zero concerning these types of cards.

With that disclaimer out of the way, perhaps the application these cards run manages to take full advantage of the card's capabilities while that doesn't happen with nVidia cards?
It's an FPGA these are getting programmed to do a specific task. You aren't seeing that with an nv/amd card,which are more general purpose.

These cards are out wrecking shop in the mining world as well posting huge numbers. I don't know who said the stratix 10 isn't out I have actually like held one in my hands and stuff a couple months back almost purchased a set, but it requires a much higher level of programming to set up than I was willing to put in.
Posted on Reply
#18
jabbadap
HTC said:
5 times actually faster VS roughly 50% faster on paper? Definitely something with nVidia's card hindering performance. That or the Int8 TOPS performance nVidia claims for this card is far higher then what it actually is, no?
Maybe it's just that with batch=1 it's not parallel task and gpus usually are for parallel tasks. There's numbers for Tesla V100 on nvidia developer.
Posted on Reply
#19
HTC
cdawall said:
It's an FPGA these are getting programmed to do a specific task. You aren't seeing that with an nv/amd card,which are more general purpose.

These cards are out wrecking shop in the mining world as well posting huge numbers. I don't know who said the stratix 10 isn't out I have actually like held one in my hands and stuff a couple months back almost purchased a set, but it requires a much higher level of programming to set up than I was willing to put in.
Sort of like consoles, right?

Still, since on paper the difference is about 50% VS 500% actual difference ... that's a whole order of magnitude there ... something's not right, right?
Posted on Reply
#20
cdawall
where the hell are my stars
HTC said:
Sort of like consoles, right?

Still, since on paper the difference is about 50% VS 500% actual difference ... that's a whole order of magnitude there ... something's not right, right?
I guess you could compare to consoles in a way. Specific targeted coding for one specific item allows someone to fully take advantage of a product.
Posted on Reply
#21
HTC
cdawall said:
I guess you could compare to consoles in a way. Specific targeted coding for one specific item allows someone to fully take advantage of a product.
Still, that difference ... what could be the cause of such a massive difference between advertised and actual?
Posted on Reply
#22
cdawall
where the hell are my stars
HTC said:
Still, that difference ... what could be the cause of such a massive difference between advertised and actual?
Ask AMD.
Posted on Reply
#23
SRB151
notb said:
Of course it is. Stratix 10 has been around for a while, in multiple variants.
https://www.tomshardware.co.uk/intel-stratix-10-tx-fpga,news-57969.html

What you're thinking about is the latest Intel-built accelerator.
No, at least not the ones Intel has been plugging lately. They have 3 versions, the HBM version that they've been plugging since last year has yet to be released. None of the charts specify which card they show, but every piece of marketing I've seen has been about the HBM version. Which I belive is the MX?

cdawall said:
It's an FPGA these are getting programmed to do a specific task. You aren't seeing that with an nv/amd card,which are more general purpose.

These cards are out wrecking shop in the mining world as well posting huge numbers. I don't know who said the stratix 10 isn't out I have actually like held one in my hands and stuff a couple months back almost purchased a set, but it requires a much higher level of programming to set up than I was willing to put in.
My bad, I didn't specify. The version they've been plugging since last year, the one with HBM, isn’t out from everything I read. Since Intel didn't say which one it benchmarked, I assumed it was the latest and greatest, esp since the chips are identical, except memory and some support for specific applications. They've been crowing about the HBM version for almost a year. If it is released to customers, I stand corrected.

BTW, not to knock the white paper, but if it actually is night and day faster, how could the record have been broken? From the whitepaper, Stratix should have it with half the cards in anybody's system.
Posted on Reply
#24
Patriot
ArbitraryAffection said:
Xeons have less on chip I/O for these cards so EPYC are naturally better suited for these configurations. I apologise if i'm just being overly sensitive but it always seems like people have discredit AMD at any given opportunity.
AMD does have more raw lanes, but without pcie switches to buffer between cpu and accelerator they easily get bottlenecked by the x16 gmi lanes between dies.
130ns between dies on the same socket, and 250ns between dies on opposing sockets... so card 1 to card 2 is 130ns... card 1 to card 4 is 250ns, and card 2 to card 4 is 380ns... So long as things don't have to talk to each other, and you aren't hanging nvme as well... you can survive without a switch, otherwise you will quickly saturate the internal gmi and xgmi interconnects. Those are idle latencies btw...

Rome should solve most of this by straight up doubling the pcie lanes, bumping the ram frequency and making a seperate interconnect for accelerators.

Epyc is very competitive, but not without weakness.
Posted on Reply
Add your own comment