• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Ryzen 9 9950X

It'd be especially egregious considered Windows only utilizes two of the four security rings in x86 CPUs, the innermost "supervisor" and outermost "user" mode rings (rings 0 and 3). As I understand, though, even with operating under a root-level account, common applications will still be in user mode, there just wouldn't be anything to inhibit system-wide (including kernel, protected and reserved regions) access.

In the name of science, one can always try to run something with NT Authority permissions to see if that alone bypasses the problem... :oops:

View attachment 359150
Well if you are using memory integrity on 11 you'll get the hypervisor ring too, but yeah, most gamers don't do that.

It'd be worth asking if w1zzard has memory integrity setting on in these tests, I suppose.
 
From past experience with the 7900X3D I use to have, the difference was pretty moot. 7800/8000 c36/38 was generally worse latency, 5-8% bandwidth increase and incredibly difficult to stabilize with the exception of using the Tachyon/Gene. Meanwhile 6400 c32 was much easier to run/stabilize (as well as getting others OCN’ers stabilized here), with better latency, and slightly less bandwidth at much more sane voltages.

Remains to be seen if DDR5 8000 is worth it, when something along the lines of 6400 with tight timings provides similar results. This is especially true when only very specific workloads or synthetics can take advantage of this; potentially useless for 90%+ of users.

Oh yeah stabilizing it on the 2DPC X670's is a nightmare and there are very few 1DPC boards out there. But apparently that's what they are aiming for with X870, not sure how. I do hear there's a Tachyon being released and if it's not vaporware i'll try to snag one.

I was thinking more along the lines of 8000 c34 with tuned subtimings should be possible and then see how it fares vs tuned 6400.
 

LOL, @W1zzard you might have some sleepless nights ahead of you mate

I can't even. Using the hidden sys admin account changes Zen 4 and 5 performance results, basically invalidating all reviews

I wonder if this affects Zen 3 and Intel too :laugh:

be real, invalidating? it changes nearly nothing.
 
I thought the analysis of the behavior of AMDs new clustered decoders on Zen5 and CCD to CCD latency over on chips and cheese was interesting. For whatever reason I made the assumption that the latency between CCD was not vastly slower between Zen4/5 and that both decode clusters where not directly tied to an SMT thread so on a light loaded system or one with SMT enabled the main thread would benefit.

 
Oh yeah stabilizing it on the 2DPC X670's is a nightmare and there are very few 1DPC boards out there. But apparently that's what they are aiming for with X870, not sure how. I do hear there's a Tachyon being released and if it's not vaporware i'll try to snag one.

I was thinking more along the lines of 8000 c34 with tuned subtimings should be possible and then see how it fares vs tuned 6400.

8000 c34 will require 1.6+ vdimm and active cooling, c38 is more in the realm of 24/7 possible.

For some comparison, when I had that AM5 system up and running 6400 c32 tightened primary/secondary/tertiary required 1.46 vdd, 1.43 vddq, 1.35 vddio, 1.26 vsoc, and 1.1 vmisc.

Getting 7800 to boot on b650 required similar voltages at c38, 8000 was a nightmare and never fully stable at 1.48-1.5 and board or IMC was clearly holding clocks back.

Many 6400, 6600, 7800 and 8000+ results require in excess of 1.55-1.6+ at c30-32 and c36 respectively.

I do hope there have been physical memory design improvements for the 800 series. Consistency of getting 2:1 working at DDR5 8000 across many boards would be much nicer to see than a maximum frequency push.
 
So it looks like there are going to be some changes: confirmed core parking/scheduling issue (guessing to be fixed by AMD and/or Microsoft), confirmed "admin" issue to be fixed by Microsoft, rumoured tdp updates on some models.

I don't think the difference will be huge but it won't be negligible either. Also some of the above will help Zen 4 as well.
 
be real, invalidating? it changes nearly nothing.

A consistent 5-6% with room for variance, affecting a previous generation product as well? I'd argue it's at least worth a quick retest. Lest you forget, AMD fanboys routinely claim the Supreme Victory Royale™ over "le ebil ngreedia" whenever the 7900 XTX is exactly 2% ahead of the RTX 4080 in raster games... please make up your minds.
 
I'm starting to think that this whole 9 series ryzen is made to kill AM5 platform.
Who will jump from 7xxx to 9xxxx? Nobody.
Then the new "10-serie"s appears, granting 20-30% "performance increase over last gen" and everybody gets hyped. Maybe the X870 chipset will help? How?

Is there any test where 7950 falled against 5950? This is pure nonsense.
 
I'm starting to think that this whole 9 series ryzen is made to kill AM5 platform.
Who will jump from 7xxx to 9xxxx? Nobody.
Then the new "10-serie"s appears, granting 20-30% "performance increase over last gen" and everybody gets hyped. Maybe the X870 chipset will help? How?

Is there any test where 7950 falled against 5950? This is pure nonsense.
The only problem here is that AMD advertises this as a whole generation jump when really it was a refresh. If they just said it would be a refresh then very few would be surprised by the results. Intel also makes this mistake when they claim to release a new generation every year when in reality they do not.
 
Can someone please remind me why AMD needs this "Xbox game bar" nonsense for their dual-CCD CPUs to work properly, when Intel doesn't?
 
The only problem here is that AMD advertises this as a whole generation jump when really it was a refresh. If they just said it would be a refresh then very few would be surprised by the results. Intel also makes this mistake when they claim to release a new generation every year when in reality they do not.

AMD did do a lot of work on the front end, not just a mild tweak but it doesn't seem to bare a lot of fruit. I think this is more then a refresh which I think of as throw cache and clock speeds at, but it didn't hit the mark. Maybe the closest release I can compare this to from intel was 10th vs 11th gen processors? Intel clamed 18% IPC increase from Skylake, upped L1/L2 redid somethings around the front end but the CPU fell flat.

 
Can someone please remind me why AMD needs this "Xbox game bar" nonsense for their dual-CCD CPUs to work properly, when Intel doesn't?
Yes, thats another nonsense, probably the software-driver department found it was the easiest solution to detect if a program was a game to manage the power profile. How does discord manages to know its a mistery.
 
Can someone please remind me why AMD needs this "Xbox game bar" nonsense for their dual-CCD CPUs to work properly, when Intel doesn't?
They need to be treated like X3D parts because the new (server driven) architecture has over 2x the cross-CCD latency over Zen4
So core parking is a must, assigning game threads on 1 CCD.
While this does not affect so much the productivity workloads it does vastly the gaming ones.

Gamers can wait for the X3D parts

AMDs new era of CPUs segmentation.
We are just a little slow to wrap our heads around it, mostly because previous Ryzen gen to gen upgrades were very different.

 
I thought the analysis of the behavior of AMDs new clustered decoders on Zen5 and CCD to CCD latency over on chips and cheese was interesting. For whatever reason I made the assumption that the latency between CCD was not vastly slower between Zen4/5 and that both decode clusters where not directly tied to an SMT thread so on a light loaded system or one with SMT enabled the main thread would benefit.

From the same review ~

gnr_sys_level.drawio.png


As with Zen 2, each CCD connects to the IO die through an Infinity Fabric link. On desktop, this link is 32 bytes per cycle in the read direction and 16 bytes per cycle in the write direction. That differs from AMD’s mobile parts, where the Infinity Fabric link from a core cluster can do 32 bytes per cycle in both directions. Infinity Fabric runs at 2 GHz on both setups, just as it did on desktop Zen 4. That’s not a surprise, since AMD has re-used Zen 4’s IO die for Zen 5. At that clock speed, each cluster has 64 GB/s of read bandwidth and 32 GB/s of write bandwidth to the rest of the system.
This was an issue with zen4 as well, they need to address this with zen6 or they might as well not even pretend to care about MSDT anymore!
 
  • Like
Reactions: tfp
From the same review ~

gnr_sys_level.drawio.png


This was an issue with zen4 as well, they need to address this with zen6 or they might as well not even pretend to care about MSDT anymore!
I thought Strix Point was a slightly modified zen5 for laptop. Zen4 doesn't have the 2 decoder blocks.
 
This is just the memory subsystem or hierarchy.
z5_desktop_block_diagram.jpg

z5_mobile_block_diagram.jpg

You're probably talking about this.
 
Hi again, you don't think the 12th, 13th, and 14th, well primarily the 14th generation intel processors will benchmark higher if you use higher frequency memory than the one you been using? I know we've talked about this, but it was in regards to AMD cpu's. I think if you use higher frequency memory for the 12th, 13th, and 14th generation intel processor's they will give more higher results. The thing is with 12th generation maybe 6400MHz, with 13th, something around 6800Mhz. The 14th generation can scale higher frequency memory for better results. I know you know this.
 
From the same review ~

gnr_sys_level.drawio.png


This was an issue with zen4 as well, they need to address this with zen6 or they might as well not even pretend to care about MSDT anymore!

Those diagram contradict each other.
Its quite obvious now that the AGESA has it stuck in 16 byte cycles for the desktop when it aupports the full fat 32 byte cycles it has to for the avx 512 support??? Was it suppose to be able to change dymanicly on detected load or something cause thats a 50% bottleneck for writes. Last that does not explain the increase in CCD latency increases at all. As the bandwidth for caches was increase from L1 to L2 and L2 to L3, they even mention how latebcys wwbt down a cycles for some of them.

It's honstly more impresaive that the AGESS micro code could be messed up that badly by someone lol
 
They need to be treated like X3D parts because the new (server driven) architecture has over 2x the cross-CCD latency over Zen4
So core parking is a must, assigning game threads on 1 CCD.
While this does not affect so much the productivity workloads it does vastly the gaming ones.

Gamers can wait for the X3D parts

AMDs new era of CPUs segmentation.
We are just a little slow to wrap our heads around it, mostly because previous Ryzen gen to gen upgrades were very different.

Here is something I don't get. If the new server driven arch has so much more cross-CCD latency wouldn't that be bad for virtualization platforms when VM's virtual processing crosses CCD's too? I feel like this is a use case AMD shot themselves in the foot with increase latency and perhaps explains why the virtualization benchmark took such a regression from last gen instead of being some improvement.

for reference

1723767504576.png
1723767472746.png
 
Here is something I don't get. If the new server driven arch has so much more cross-CCD latency wouldn't that be bad for virtualization platforms when VM's virtual processing crosses CCD's too? I feel like this is a use case AMD shot themselves in the foot with increase latency and perhaps explains why the virtualization benchmark took such a regression from last gen instead of being 20%+ improvement.
I'm not going to claim that I've seen or know everything but I assume you are talking about virtualization in windows right?
How this workload is coming on nonWin apps? Like linux or whatever else servers may run? And how important is it really compared to other workloads on EPYC?
 
I'm not going to claim that I've seen or know everything but I assume you are talking about virtualization in windows right?
How this workload is coming on nonWin apps? Like linux or whatever else servers may run? And how important is it really compared to other workloads on EPYC?
Those are good questions. For the charts I posted I had to go back to the setup and review but it's a bit sparse describing the setup of the Oracle VirtualBox but if I understand the test setup correctly it was a Win11 host and Win11 guest. I would think it wouldn't matter if the virtualization was a level 1 or level 2 hypervisor or the OS's involved. The potential of cross CCD latency would still be there in all situations unless the hypervisor and/or host OS was smart enough to prevent that. Perhaps I'm making a fuss over nothing but I wonder if just like for games the other tests indicated in red performed poorly for the same reasons? (edit) sorry for the word salad.
 
This is just the memory subsystem or hierarchy.
z5_desktop_block_diagram.jpg

z5_mobile_block_diagram.jpg

You're probably talking about this.
In part yes, but Zen4 front end looks like the below. You can see that the decoder was duplicated in Zen5 but only works on the individual SMT and not together when one thread is idle or disabled like on some of Intel's chips. It generally feels like they made good progress on the front end but don't have a wide enough CPU core, can't retire instructions fast enough, are ram/cache limited, or some other bottleneck that just makes these improvements most limited then expected. I thought the chips and cheese article makes some reasonable assumptions around this.



1723768880729.png
 
Those are good questions. For the charts I posted I had to go back to the setup and review but it's a bit sparse describing the setup of the Oracle VirtualBox but if I understand the test setup correctly it was a Win11 host and Win11 guest. I would think it wouldn't matter if the virtualization was a level 1 or level 2 hypervisor or the OS's involved. The potential of cross CCD latency would still be there in all situations unless the hypervisor and/or host OS was smart enough to prevent that. Perhaps I'm making a fuss over nothing but I wonder if just like for games the other tests indicated in red performed poorly for the same reasons? (edit) sorry for the word salad.
Fair enough... And something else on this cross-CCD latency
This we see on desktop variants with the same and probably obsolete IOD that was carried over from desktop 7000.
We dont really know what AMD has cooked up for EPYC on the cross chiplet communication matter.

Desktop variants are not that important to them apparently (not saying at all) and kept cost low (no new IOD) while they will bring the extra cache and (maybe over-) compensate for the IOD bottleneck on some workloads. Games first.
In the mean time, apps and windows may slowly improve them by a few % by upgrading for their new structure.

Time will tell
Not willing to dismiss them before I see the whole picture first because the first batch of them have fall under my expectations based on past experience.
Anyway I care more about gaming performance than anything else, and those are yet to come.
 
In part yes, but Zen4 front end looks like the below. You can see that the decoder was duplicated in Zen5 but only works on the individual SMT and not together when one thread is idle or disabled like on some of Intel's chips. It generally feels like they made good progress on the front end but don't have a wide enough CPU core, can't retire instructions fast enough, are ram/cache limited, or some other bottleneck that just makes these improvements most limited then expected. I thought the chips and cheese article makes some reasonable assumptions around this.



View attachment 359170
Expect that contradicts what the Zen engineer said in the youtube interview with chips & cheese?

Chip & cheese is also contradicting his own interview in parts of his own review.

It totally is suppose to able to excute both pipelines decoders and predictors when running a single thread through the cores. Other wise all those parts just sits idle doing nothing predictor, decode, & pipeline.

Chips & cheese test he proved that it not doing what the engineer claimed it was supposed to do or could in the youtube interview. All three parta are idle when smt is disabled. It is only using one predictor, one pipeline, & one decoder.
 
Back
Top