Friday, July 17th 2020

Linux Performance of AMD Rome vs Intel Cascade Lake, 1 Year On

Michael Larabel over at Phoronix posted an extremely comprehensive analysis on the performance differential between AMD's Rome-based EPYC and Intel's Cascade Lake Xeons one-year after release. The battery of tests, comprising more than 116 benchmark results, pits a Xeon Platinum 8280 2P system against an EPYC 7742 2P one. The tests were conducted pitting performance of both systems while running benchmarks under the Ubuntu 19.04 release, which was chosen as the "one year ago" baseline, against the newer Linux software stack (Ubuntu 20.10 daily + GCC 10 + Linux 5.8).

The benchmark conclusions are interesting. For one, Intel gained more ground than AMD over the course of the year, with the Xeon platform gaining 6% performance across releases, while AMD's EPYC gained just 4% over the same period of time. This means that AMD's system is still an average of 14% faster across all tests than the Intel platform, however, which speaks to AMD's silicon superiority. Check some benchmark results below, but follow the source link for the full rundown.
Source: Phoronix
Add your own comment

33 Comments on Linux Performance of AMD Rome vs Intel Cascade Lake, 1 Year On

#1
InVasMani
Hasn't traditionally Intel had better compiler support on the software side or at least more widely used. This seems to be what I'd expect though there is only so much extra leeway they'll be able to gain from just a compiler advantage alone.
Posted on Reply
#2
_Flare
When compared intel to AMD over the years, the difference in a scenario where the AMD gets pitted against an intel getting good code while AMD getting inferior code out of it, yeah the offset is bigger then.
Intels marketing and intel-tame software/hardware companies try to fool people wich don´t have their glasses as clean as they should.
Phoronix Michael Larabel does a great job everytime going as real as possible with benchmarks.
Posted on Reply
#3
xkm1948
For programs the can leverage AVX512, Intel chip still reign supreme.
Posted on Reply
#4
Makaveli
xkm1948
For programs the can leverage AVX512, Intel chip still reign supreme.
like the 10 pieces of software that actually use AVX512 sure :)
Posted on Reply
#5
ncrs
xkm1948
For programs the can leverage AVX512, Intel chip still reign supreme.
Which AVX-512? :)
Posted on Reply
#6
GoldenX
I can't wait for AMD to finally standardize AVX-512, it seems Intel needs a decade to do so.
Posted on Reply
#7
biffzinker
GoldenX
I can't wait for AMD to finally standardize AVX-512, it seems Intel needs a decade to do so.
Yeah, Intel is all over place with their product segmentation strategy across all the CPU categories.
Posted on Reply
#8
Punkenjoy
biffzinker
Yeah, Intel is all over place with their product segmentation strategy across all the CPU categories.
Intel should actually start by firing the team responsible of products names.
Posted on Reply
#9
tabascosauz
GoldenX
I can't wait for AMD to finally standardize AVX-512, it seems Intel needs a decade to do so.
They won't, because AVX-512 exists for Intel, who wants to push its products in specific areas like AI. Instead of actually standardizing the entire instruction family, they just pull out single instructions under the AVX-512 banner whenever the marketing team needs it, eg. VNNI when Intel needs to market itself to deep learning.

Take a look at the horrendously fragmented list of products supporting scattered bits and pieces of AVX-512 and you'll see why it's not even remotely worth AMD's time right now.
Posted on Reply
#10
R-T-B
InVasMani
Hasn't traditionally Intel had better compiler support on the software side or at least more widely used. This seems to be what I'd expect though there is only so much extra leeway they'll be able to gain from just a compiler advantage alone.
Michael Larabel deals in linux. There is no compiler advantage there. Heck, I don't even think anyone there USES ICC... I know you can't even build the kernel with it, for starters.
Posted on Reply
#11
GoldenX
tabascosauz
They won't, because AVX-512 exists for Intel, who wants to push its products in specific areas like AI. Instead of actually standardizing the entire instruction family, they just pull out single instructions under the AVX-512 banner whenever the marketing team needs it, eg. VNNI when Intel needs to market itself to deep learning.

Take a look at the horrendously fragmented list of products supporting scattered bits and pieces of AVX-512 and you'll see why it's not even remotely worth AMD's time right now.
Sales numbers can and will change that. They are still in 14nm hell, and seems like they will still be for at least another 6 months. Stupid decisions like these are costing them their credibility.
Posted on Reply
#12
biffzinker
Have any of the Celeron’s picked up HyperThreading or are they still limited to dual-cores without HT? With Comet Lake they should be two cores four threads.
Posted on Reply
#13
randomUser
The C++ application i am programming takes 35 seconds to compile using 8 threads on 9900K (stock)
And it takes 18 seconds to compile using 8 threads on 3700X (stock)

Now thats what i call a productive CPU



And it takes 25 minutes to compile using 2 threads on Allwiner A20 ARM cpu lol
Posted on Reply
#14
londiste
For these over 100 tests run, the AMD EPYC 7742 2P on the latest Linux software packages yielded 14% better performance over Intel's top-end non-AP Xeon Platinum 8280 dual socket server.
What is kind of weird is the only 14% difference in geomean of test results. I guess there are just too many tests that do not rely on many threads. The systems are 128 vs 56 cores, after all.
Posted on Reply
#15
yeeeeman
So, 128 cores AMD vs 56 cores Intel and AMD wins by 14%????
LE: Now I see it. The tests are a mix of ST and lightly/hard MT scenarios. In any case, with very well MT software you'll see bigger difference, but I guess given these are the current workloads in the server space, Intel is not that far off.
Posted on Reply
#16
efikkan
InVasMani
Hasn't traditionally Intel had better compiler support on the software side or at least more widely used. This seems to be what I'd expect though there is only so much extra leeway they'll be able to gain from just a compiler advantage alone.
This is only a myth.
Pretty much all software today is compiled with eiter GCC, LLVM or MSVC, neither are biased.
Of all the compiler optimization that GCC and LLVM offers, most of them are generic. There are a few exceptions, like if you target Zen 2 vs. Skylake, but those are minimal and the majority of optimizations are all the same.

We can't optimize for the underlying microarchitectures, as the CPUs share a common ISA. The CPUs from Intel and AMD also behave very similarly, so in order to optimize significantly for either one, we need some significant ISA differences. As of right now, Skylake and Zen 2 is very comparable in ISA features (while Skylake-X and Ice Lake have some new features like AVX-512 and a few other instructions). So when the ISA and general behavior is the same, the possibility of targeted optimizations to favor one of them is pretty much non-existent. So whenever you hear people claim games are "Skylake optimized" etc., that's 100% BS, they have no idea what they're talking about.
tabascosauz
They won't, because AVX-512 exists for Intel, who wants to push its products in specific areas like AI. Instead of actually standardizing the entire instruction family, they just pull out single instructions under the AVX-512 banner whenever the marketing team needs it, eg. VNNI when Intel needs to market itself to deep learning.
You are clearly way off base here.
The core functionality of AVX-512 is known as AVX-512F, the others are optional extensions.
The various "AI" features are marketed as AVX-512 because they use the AVX-512 vector units, unlike other single instructions running through the integer units.

As an additional note;
I'm not a fan of application specific instructions. Those never get widespread use, and quickly become obsolete, and software relying on these are no longer forward-compatible.
Posted on Reply
#17
JB_Gamer
efikkan
This is only a myth.
...
So whenever you hear people claim games are "Skylake optimized" etc., that's 100% BS, they have no idea what they're talking about.
Isn't that what's so usual, people (myself included) carry with us plenty of old and irrelevant information/data, mainly because it's almost impossible to keep being updated with it all?
Posted on Reply
#18
Imsochobo
yeeeeman
So, 128 cores AMD vs 56 cores Intel and AMD wins by 14%????
LE: Now I see it. The tests are a mix of ST and lightly/hard MT scenarios. In any case, with very well MT software you'll see bigger difference, but I guess given these are the current workloads in the server space, Intel is not that far off.
In datacenter loads mostly mt rules, cause you don't run one application on a server.
You run virtualized, docker, yeah..
Posted on Reply
#19
Aerpoweron
For modern supercomputers and AI, don't you just use a GPU for highly parellelized stuff like AVX 512?

AMD had the HSA stuff, but it never got adopted with the APUs.
Posted on Reply
#20
efikkan
Aerpoweron
For modern supercomputers and AI, don't you just use a GPU for highly parellelized stuff like AVX 512?

AMD had the HSA stuff, but it never got adopted with the APUs.
You raise a very valid question which many in here might be wondering, and there is an explanation.
AVX, multithreading and GPU acceleration are all different types of parallelism, but they work on different scopes.
  • AVX works mixed in with other instructions, and have a negligible overhead cost. AVX is primarily parallelization on data level, not logic level, which means repeated logic can be eliminated. One AVX operation costs the same as a single FP operation, so with AVX-512 you can do 16 32-bit floats at the same cost of a single float. And the only "cost" is the normal transfer between CPU registers. So this is parallelization on the finest level, typically a few lines of code or inside a loop.
  • Multithreading is on a coarser level than AVX. When using multiple threads, there are much higher synchronization costs, ranging from sending simple signals to sending larger pieces of data. Also data hazards can very quickly lead to stalls and inefficiency, so for this reason the proper way to scale with threads is to divide the workload into independent work chunks given to each worker threads. Multiple threads also have to deal with the OS scheduler which can cause latencies of several ms. Work chunks for threads are generally ranging from ms to seconds, while AVX works in the nanosecond range.
  • GPU acceleration have even larger synchronization costs than multithreading, but the GPU has also more computational power, so if the balance is right, GPU acceleration makes sense. The GPU is very good at computational density, while current GPUs still relies on the CPU to control the workflow on a higher level.
It's worth mentioning that many productive applications use two or all three types of parallelization, as they complement each other.

But when it comes to "AI" for supercomputers, this will soon be accelerated by ASICs. I see no reason why general purpose CPUs should include such features.
Posted on Reply
#23
efikkan
illli
I know of one person that doesn't like avx-512 very much

www.extremetech.com/computing/312673-linus-torvalds-i-hope-avx512-dies-a-painful-death
We already have a discussion about that, you are welcome to join it here: www.techpowerup.com/forums/threads/linus-torvalds-finds-avx-512-an-intel-gimmick-to-invent-and-win-at-benchmarks.269770/
londiste
- He dislikes FP in general. This may or may not be a reasonable stance.
FP is used a lot, in video, rendering, photo editing, games etc.
And AVX can do integer too, which is why I often refer to them as vector units, since they can do both integers and floats. Integers in AVX is used heavily in things like file compression.
Posted on Reply
#24
Vya Domus
GoldenX
I can't wait for AMD to finally standardize AVX-512, it seems Intel needs a decade to do so.
I hope not, very wide SIMD is a fallacy in modern computer architecture design. SIMD was introduced in the days when other massively parallel compute hardware didn't exist and everyone thought frequency/numbers of transistors would just scale forever with increasingly lower power consumption. That didn't hold up, the contingency created by simultaneously trying to make a CPU that has the fastest possible single core performance while trying to add more and more cores and wider SIMD is too great. GPUs make CPU SIMD redundant, I can't think of a single application that couldn't be scaled up from x86 AVX to CUDA/OpenCL, in fact the latter are way more robust anyway.
Posted on Reply
#25
GoldenX
You add too much latency over PCIe.
Posted on Reply
Add your own comment