• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

EPYC 9965 (192 cores ZEN5c), a design error?

Joined
Oct 24, 2022
Messages
303 (0.32/day)
The more cores a CPU has, the more cache memory it needs, since increasing the number of cores increases the "competition" between the cores for access to RAM. And with a larger amount of cache, more data is copied in advance from RAM to cache and read directly from it by the cores, thus preventing cores from idling (and consequently losing performance) due to delays in accessing data in main RAM.

With the EPYC 9965 CPU (192 cores), AMD did exactly the opposite of what logic suggests: it greatly increased the number of cores and halved the amount of L3 cache memory.

It seems that, in the video encoding test of the link below (done by Tom's Hardware), exactly what was described above happened: the EPYC 9965 CPU (192 cores) had a poor performance, similar to the EPYC 9575F CPU, which has only 64 cores.

And on the EPYC CPU specifications pages, AMD did the "favor" of not showing which type of core the processor has (whether ZEN5, ZEN5c, etc.), nor does it show which instruction sets the processor supports or how much cache memory each CPU/chiplet has:




Vg7vikkPoqjfmMUtdxrroP.png

Source:
 
Boost clocks...
 
It seems that, in the video encoding test of the link below (done by Tom's Hardware), exactly what was described above happened: the EPYC 9965 CPU (192 cores) had a poor performance, similar to the EPYC 9575F CPU, which has only 64 cores.
There will be less cache sensitive workloads - to be fair it's really the problem of the server buyer / workload intention to dictate what you need it for and what is best product for the job.
There will be a similar issue for Intel's all E-core server parts.

Nobody ever said you can swap one for the other and not see an impact.
The reduced cache size will impact in a number of ways such as processes that spread threads over many cores, not just memory buffering. Again, workload choice will make a difference.

And on the EPYC CPU specifications pages, AMD did the "favor" of not showing which type of core the processor has (whether ZEN5, ZEN5c, etc.), nor does it show which instruction sets the processor supports or how much cache memory each CPU/chiplet has:
I have to admit, not identifying the core type isn't great but the cache amounts, etc., will give it away (for those who know what they're looking for).
As for CPU capabilities, unlike Intel, Zen and Zen-c cores share the same CPU caps.
A full list is handy to have though, although at this point in time it supports pretty much everything except Intel's AVX10 (or AVX-512 2nd attempt)
 
Last edited:
Get a load of that test rig.

I could think of better things to do with it than encode videos......like virtualize several small companies worth of servers.
 
It seems that, in the video encoding test of the link below (done by Tom's Hardware), exactly what was described above happened: the EPYC 9965 CPU (192 cores) had a poor performance, similar to the EPYC 9575F CPU, which has only 64 cores.
You are hyperfocusing on a single task. First of all, video encoding has an upper limit on how much it can be made parallel, and that also depends on the encoder being used, and also the config of said encoder.
For SVT-AV1, here are some different behaviours for different presets:
Screenshot 2024-10-15 at 00.34.52.png


A 16-core beating a 64-core CPU.
Screenshot 2024-10-15 at 00.36.34.png

Here the 64-core with higher frequencies take the lead.

Another thing is that those encoding tests are NOT dependent on memory bandwidth. This can be easily seen here:
Screenshot 2024-10-15 at 00.38.34.png


Anyhow, Zen 5c is not meant to be encoding videos, it's meant for hyperscalers that are looking for really high core density.

And on the EPYC CPU specifications pages, AMD did the "favor" of not showing which type of core the processor has (whether ZEN5, ZEN5c, etc.), nor does it show which instruction sets the processor supports or how much cache memory each CPU/chiplet has:
AMD's spec page is awfuly bad, but they did give a list of which models had Zen5c cores and which did not:
Screenshot 2024-10-15 at 00.40.41.png


Instruction set is AVX-512, caches you need to dig somewhere else.
 
I don't get to use many AMD branded prototypes. Who's case is that? Doesn't look like a supermicro or mitac. Custom water loop, or is it one of those dynatron setups? They were a bit scant on the details of the system, and I can't see many of the part numbers.
 
"SVT-AV1's greatest strength is its parallelization capability, where it outclasses other AV1 encoders by a significant margin. SVT-AV1's parallelization techniques do not involve tiling & don't harm video quality, & can comfortably utilize up to 16 cores given 1080p source video. This is while maintaining competitive coding efficiency to mainline aomenc. Perceptually, mainline SVT-AV1 is outperformed by well-tuned community forks of aomenc, but according to many the gap has begun to close with the introduction of SVT-AV1-PSY."
 
Anyhow, Zen 5c is not meant to be encoding videos, it's meant for hyperscalers that are looking for really high core density.

I'm still going to read your entire post and all the topic carefully.

But even so, it happens as I said in the first post: the more cores a CPU has, the more cache it needs to have to the cores are always well supplied with data from RAM, in order to they don't stay idle waiting for data from main RAM.


can comfortably utilize up to 16 cores given 1080p source video

So, the SVT-AV1 is only capable of dividing blocks of up to 480x270 pixels (1920x1080 ÷ 4) for each core? That must be why other CPUs with more than 64 cores did not perform much better encoding the 4K video of the test. (A 1920x1080 resolution video has 16 blocks of 480x270 pixels. And a 3840x2160 video has 64 blocks of 480x270 pixels.)

If the SVT-AV1 is only capable of dividing blocks of up to 480x270 pixels for each core, this is a very poor programming. The team developing the SVT-AV1 should have already optimized it to it works very well with CPUs with hundreds of cores, dividing the video into very small blocks for each core.
 
Last edited:
SVT-AV1 rev 1.2.xx sucks performance wise. It doesn't utilize all cores 100%. There even was a short time where it only used 1 core.
If you run some of the old 1.1 versions of SVT-AV1 they run much faster and encode almost double the FPS.
New features like variance-boost, variance-boost-strength and variance-octile might have something to do with the performance decrease.
 
It seems that, in the video encoding test of the link below (done by Tom's Hardware), exactly what was described above happened: the EPYC 9965 CPU (192 cores) had a poor performance, similar to the EPYC 9575F CPU, which has only 64 cores.
This is a bit like saying "Hey this CPU is inferior to that GPU when it comes to rendering".

I.e you're not wrong, but you're asking the wrong question.
 
toms said:
the testing for the 192-core model isn't yet done

You can see more benchmarks on phoronix. The zen5c cores have their niche applications.
And if you are buying these kind of CPUs, you are probably paid to study the one slide with the SKUs.
 
It's just a massively niche product where the extra cores will only benefit certain kinds of workloads. Need to know what you're buying rather than just saying more cores = better.

Actually that's the same right down the product stack to a large degree. Generalised reviews and benchmarks can be incredibly misleading if your use case isn't the same...
 
I'm still going to read your entire post and all the topic carefully.

But even so, it happens as I said in the first post: the more cores a CPU has, the more cache it needs to have when running memory intensive workloads to the cores are always well supplied with data from RAM, in order to they don't stay idle waiting for data from main RAM.
FTFY
 
The more cores a CPU has, the more cache memory it needs, since increasing the number of cores increases the "competition" between the cores for access to RAM. And with a larger amount of cache, more data is copied in advance from RAM to cache and read directly from it by the cores, thus preventing cores from idling (and consequently losing performance) due to delays in accessing data in main RAM.

With the EPYC 9965 CPU (192 cores), AMD did exactly the opposite of what logic suggests: it greatly increased the number of cores and halved the amount of L3 cache memory.
That's because it's aiming at Cloud and Virtualization workloads, an area where ARM is trying to make inroads. Epyc 9965 Turin Dense is a perfect counter to them. Same with Intel Xeon 6 "Sierra Forest".

The logic is that the core size is 25% smaller than regular Zen 5 and the smaller caches make it even smaller.
 
These are perfect chips to host hundreds of bursty workloaded VMs, think things that are being sold as SaaS for a lot of enterprise customers, Scalable webapps/webservers that can scale up and down depending on loads during the day and things like load balancers for large orgs.

It also means people can condense 10s of physical servers now into single box platforms cutting down costs in power/cooling/U Space.
 
That's because it's aiming at Cloud and Virtualization workloads, an area where ARM is trying to make inroads. Epyc 9965 Turin Dense is a perfect counter to them. Same with Intel Xeon 6 "Sierra Forest".

The logic is that the core size is 25% smaller than regular Zen 5 and the smaller caches make it even smaller.

The EPYC 9965 processor has 192 cores and only 12 dual channels of DDR5 memory (total of 24 32-bit channels). Therefore, there is only 1 memory channel for every 8 x86 cores, and they are constantly "competing" with each other for access to RAM.

AMD chose to make the EPYC CPUs in an MCM scheme, instead of a single die, for several reasons... In this scheme, there is an increase in latency in accessing the main RAM memory and, to reduce or compensate this increase in latency, the chiplets need to have large amounts of cache memory to always be supplied, in advance, with data from the RAM memory.

Apparently, AMD chose to reduce the amount of cache memory in its "c"-end chiplets because cache memories (SRAM), like DRAM memories, cannot be made with such advanced lithography as other parts of the chiplet. When there is a major advance in the lithography of certain parts of a chip, from 14 to 7 nm for example (50% less), the lithography of the cache memories (SRAM) decreases by only 5 or 3% (and sometimes it doesn't even decrease, it remains the same as the previous lithography). In short, cache memory occupies a large area (mm²) of silicon, making the chip much more expensive.

In any case, reducing the amount of cache memory ALWAYS ends up being a bad saving, as it directly affects the performance of the x86 cores, since, if the data is not in the cache memory, the x86 cores lose a lot of performance by remaining idle waiting for the memory controller to send them the data that is in the RAM memory.


If the processor has more cache per core, the cores are less idle. A processor with fewer cores and more cache per core can do the same or more than one with more cores and less cache per core.

See it also:



 
The EPYC 9965 processor has 192 cores and only 12 dual channels of DDR5 memory (total of 24 32-bit channels). Therefore, there is only 1 memory channel for every 8 x86 cores, and they are constantly "competing" with each other for access to RAM.

Tl;dr
You are undoubtedly correct in your theoretical statements. But... If you can draw a design where you have all those 192 cores and a lot more caches and basically anything what you want. I will be very interested in how you recreate it on silicon. It will not fit on the die due to too large an area. Perhaps, by 2030, using a much reduced lithography unit made with the new ASML 5200 scanners. But then they will work with smaller matrices, so hardly... You'll probably need a special custom matrix. The thing is, unless you're a multi-millionaire, you're unlikely to be able to afford to buy something of this class.
 
This CPU is not aimed at video encoding, but I'm sure your vendor will have something that'll work wonders in your specific video encoding workload if you are intending to buy this scale of product. If you don't have vendor, then this thing is not aimed at you at all.

It's to replace several units of older servers with single node, say older Xeon plats that top out at 56 cores. As many said before, this is for Hyperscalers who sell their stuff by the core (and memory). To place 192 cores on single chip, you need to cut the cache sizes, it's simple as that. With 192 cores in 1U platform, you can replace 2-5 older gen servers with single unit, giving you more opportunities for business as there are usually two things that are on premium at datacenters, space and power.
 
I could think of better things to do with it than encode videos
In a world were streaming video is a major industry, you underestimate just how important encoding video really is.

Could we please not discuss the garbage that is Cloudflare?
 
In a world were streaming video is a major industry, you underestimate just how important encoding video really is.


Could we please not discuss the garbage that is Cloudflare?
Anyone in that space that is pushing lots of video streaming or requires better scaling will be unlikely to be using CPU only for encode/transcode.
Hell, even Intel had a Xe product for that before the graphics cards were out - even zen c cores with extra density can't beat an ASIC or dedicated IP-ASIC-like blocks (when it comes to efficiency - obviously throw 100+ CPU cores at a problem and it can probably beat a lowly simple ASIC in just 1 task/stream).
 
Anyone in that space that is pushing lots of video streaming or requires better scaling will be unlikely to be using CPU only for encode/transcode.
Hell, even Intel had a Xe product for that before the graphics cards were out - even zen c cores with extra density can't beat an ASIC or dedicated IP-ASIC-like blocks (when it comes to efficiency - obviously throw 100+ CPU cores at a problem and it can probably beat a lowly simple ASIC in just 1 task/stream).
You're not understanding the context here. It doesn't matter if there's something better.
 
You're not understanding the context here. It doesn't matter if there's something better.
It seems all but a few is following different contexts of what products are good at what tasks in this thread.

I get your point and performance doing a certain task (even if it's not the ideal platform / candidate for it) will be an important metric in terms of rating performance, but that in itself defines the purpose that these things are used for.
Yeah, for sure MS/Amazon/Google/etc., will spin you up an instance of the crappiest ALU/FPU performing cloud VM to use for a set of tasks which doesn't utilise them properly - that's on you for choosing the wrong option. But...
In a world were streaming video is a major industry
... In that world any 'major industry' participant will be looking to get best bang per buck and if that means specialist / specific hardware to achieve it then awesome. You think Google, Amazon, etc., are using pure CPU power to do the video gruntwork on their offerings (like Twitch, Youtube, etc.)?
The irony is having a server filled with several GPU cards that can handle all that transcoding workload means that having the zen c CPUs is actually a fine tradeoff as they'll generally not be taxed with workloads they really aren't optimised for and can easily handle the shifting of data using less energy.
Somewhat akin to when crypto miners were conneting way more than 4 GPUs to standard motherboards, sometimes with the cheapest CPUs they could find - just keep them GPUs fed...
 
Last edited:
I don't get it. They literally have three product lines with differering amounts of cache per core depending on the workload. Obviously some tasks are less dependent on cache and are highly parallel where this shines. If you need more cache, get one of the regular Zen 5 variants with 128 cores. Need even more cache? Go X3D.

How this is a design error is beyond me. They fit as many cores as they could because of customer requests. I'd wager a guess and say video encoding wasn't part of their consideration, nor were other cache sensitive apps.

These CPU's aren't meant for general tasks, whoever deploys these would only do so knowing that the workload they are going to optimize for would be the fastest on Epyc 9965. And there are plenty of tasks where that is the case.
 
Back
Top