Tuesday, November 23rd 2021

PlayStation 3 Emulator Delivers Modest Speed-Ups with Disabled E-Cores on Intel Alder Lake Processors

According to some testing performed by the team behind RPCS3, a free and open-source emulation software for Sony's PlayStation 3, Intel's Alder Lake processors are enjoying a hefty performance boost when E-Cores is disabled. First of all, the Alder Lake processors feature a hybrid configuration with high-performance P-cores and low-power E-cores. The P-cores are based on Golden Cove architecture and can execute AVX-512 instructions with ease. However, the AVX-512 boost is only applicable when E-cores are disabled as software looks at the whole package. Officially, Alder Lake processors don't support AVX-512, as the processor's little E-cores cannot execute AVX-512 instruction.

Thanks to the team behind the RPCS3 emulator, we have some information and tests that suggest that turning E-cores off gives a performance boost to the emulation speed and game FPS. With E-Cores disabled, and only P-cores left, the processor can execute AVX-512 and gets a higher ring ratio. This means that latency in the ring bus is presumably lower. The team benchmarked Intel Core i9-12900K, and Core i9-11900K processors clocked at 5.2 GHz for tests. The Alder Lake chip had disabled E-cores. In God of War: Ascension, the Rocket Lake processor produced 68 FPS, while Alder Lake produced 78 FPS, representing around 15% improvement.
This suggests that more applications can take advantage of disabling E-cores, especially if the application has support for AVX-512 instructions, where only P-cores can execute them. So it remains to be seen throughout trial and error if more cases like this appear.
Source: RPCS3
Add your own comment

39 Comments on PlayStation 3 Emulator Delivers Modest Speed-Ups with Disabled E-Cores on Intel Alder Lake Processors

#26
Dr. Dro
chrcolukPost processing AA
Save states
Massively less cable clutter, and room required to house console.
Ability to memory hack games.

Definite advantage,
I personally would not buy a processor with RPCS3 in mind, as the emulator will definitely mature in the future and the team does do targeted optimizations for Ryzen, so it's not like you're missing out here.

But I also have to confess to being the lucky owner of a model CECHA console with full-hardware backwards compatibility (physical EE/GS, 4 USB, card readers), so until RPCS3 reaches about the same level of maturity as PCSX2 has, I can't say i'm too eager to play the little PS3 games I play through it (not to mention most already received PC ports since then), mostly because barring save states, my console can do everything else and more. This console with Rebug firmware and dev kernel installed on it is nothing short of a treat.
windwhirlI'll add to this that there are reports of 5800X splitting the cores over two dies instead of the single one. Not sure if those are true (so much can go wrong if the testing isn't meticulous), but it's a possibility.

Ah, nevermind, the second chiplet is always disabled
Yeah, they're disabled. Ryzen 7 and below parts have one of the chiplet slots on the packaging completely vacant.
Posted on Reply
#27
londiste
qubitSomething is badly wrong with a CPU architecture if disabling half the cores results in a performance improvement.
Software, not hardware. The result is not from disabling the cores but from software running on correct cores.
Edit: I might be wrong about that due to AVX512.
lilunxm12Because P-core only enables AVX512, which wasn't very useful outside of several cases and may cause unexpected throttling
AFAIK it does not enable AVX512 automatically. Either way, I am actually impressed that RPCS3 does support AVC512 :)
CobainNot to mention you can play every PS3 game for free using cloud service. They are all locked to 720p 30fps natively anyway. I don't believe someone has a reason to spend their time playing 20 ps3 titles. Maybe the ocasional gem here and there, like Red Dead. You complete it and move on to other games
Cloud service is a whole different ballgame, both in terms of visual quality due to compression artifacts plus huge input lag. On the other side, with emulator like RPCS3 you are not locked to 720p 30fps, far from that.
Posted on Reply
#28
seth1911
I gave it up, yeah u can run it on 4K and 60 FPS with what sort of CPU 500$?.

I have now 2x PS3 Slim, one on normal FW one modded.
I can still use Games via the modded console if theyr require a Connection to a PS server.
Posted on Reply
#29
auxy
mb194dcPretty pointless application? You can get a PS3 off Ebay for about £50 and run anything on the native hardware, saving yourself the huge upgrade cost to Alder Lake for this purpose!
CobainNot to mention you can play every PS3 game for free using cloud service. They are all locked to 720p 30fps natively anyway. I don't believe someone has a reason to spend their time playing 20 ps3 titles. Maybe the ocasional gem here and there, like Red Dead. You complete it and move on to other games
As others have pointed out, you two are grossly mistaken. Besides the main purpose of preservation once PS3 hardware is no longer around, emulators also serve the purpose of allowing people to play these games with various improvements, including higher frame rates (including unlocking FPS caps), higher resolutions, alternate control schemes, and while using quality-of-life features such as save states and memory patches ("cheats", though many are more "mods" than "cheats").
DuxCroI actually bought PS3 Super Slim 500GB this summer. Never had a PS3 before. Some games clearly look amazing. Like Resistance 3 and Killzone 3. However, low resolution prevents them from shining. I played Legend of Zelda:BOTW on CEMU in 4K/60fps and it's a game changer.
Try out Heavenly Sword on RPCS3! It's a blast when it's not running at 12 FPS!
PunkenjoyThe thing is on Zen 2, communication between CCX had to go thru the I/O die. The infinity fabric could become saturated by all those access and it had to compete with memory and i/o access too. And this round trip to the I/O die was costly on latency and power usage.

On Zen 3, all core within the CCD can communicate directly with each other but still have to go thru the I/O die via infinity fabrics and this have a latency impact. There are application that are faster on the 5800x than on the 5900x because they are affected by that latency. By example

(image removed for brevity)

But those are rare and generally, the higher frequency compensate the latency problem. It's true that the OS should just use the 5950x as a Single CCD but it's harder to implement in real life than in theory. It's more up to the application to establish that.
This is almost, but not quite correct. Zen 2 has two separate 4-core CCXes, each with 16MB of L3 cache, per "Core Complex Die" or CCD. Ergo, a Ryzen 9 3950X has two CCDs, each with two CCXes.

Like in Zen 1 (which did not have a separate I/O die), the CCXes on a single die communicate with each other across the Infinity Fabric interface on the die itself; the signal never goes to the cIOD (the I/O die.)

Otherwise you are correct, though.
Ferrum MasterMore cases like this will not appear.

Where else you would need to mimic instruction sets of a super complex CELL CPU? Also a lot of contributes the raw over 5GHz single core boost. Not only the AVX512. The added performance number corelates more with the added frequency gap.
The testing was done at iso clocks, meaning the two processors were locked to the same clock rate. Also, both processors tested support AVX-512. The difference in the two is simply down to the changes between the Willow Cove core used in the 11900K and the Golden Cove cores in the 12900K.
Ferrum MasterActually the emulator is usable, I have played Metal Gear 4 on it. Occasional freezing is more an issue than the lack of CPU power. It is 30FPS limited ingame either way, so what's the fuss?
RPCS3 has the ability to bypass 30 FPS locks in many titles.
Posted on Reply
#30
Ferrum Master
auxyRPCS3 has the ability to bypass 30 FPS locks in many titles.
I was actually talking about graph below in comments. Some overphilospohy with AMD specific deficiency etc

It is clearly seen the gain from 6 to 8 cores is minimal on both Intel and AMD, you get more just from the core boost within the same arch. 5950X bench shows, that the app totally doesn't know what to do with 16 threads while in gaming. It may during the first code transition phase.

As with any shit code, it likes one fast single thread and then it escalates even further. Praising just one extension that it speeds up that one clearly unoptimized thread is kind like licking your own balls. I understand that Intel Software Development Emulator is nice to use. But it still is a whacky code in the core with very poor multithreading.

It still will need years in development. These news will die in news just as they added the dreaded TSX support that was disabled afterwards in CPU firmware due to few HW bugs. I wonder even why that option even lingers in the emulator.
Posted on Reply
#31
auxy
Ferrum MasterIt is clearly seen the gain from 6 to 8 cores is minimal on both Intel and AMD, you get more just from the core boost within the same arch. 5950X bench shows, that the app totally doesn't know what to do with 16 threads while in gaming. It may during the first code transition phase.

As with any shit code, it likes one fast single thread and then it escalates even further. Praising just one extension that it speeds up that one clearly unoptimized thread is kind like licking your own balls. I understand that Intel Software Development Emulator is nice to use. But it still is a whacky code in the core with very poor multithreading.
Multithreading isn't magic. You can't make something that can't be parallelized faster by throwing more threads at it. Learn about Amdahl's law.
Ferrum MasterIt still will need years in development. These news will die in news just as they added the dreaded TSX support that was disabled afterwards in CPU firmware due to few HW bugs. I wonder even why that option even lingers in the emulator.
And the option is there because it works fine on processors with functional TSX, and it gives a big speed-up.
Posted on Reply
#32
londiste
Ferrum MasterIt is clearly seen the gain from 6 to 8 cores is minimal on both Intel and AMD, you get more just from the core boost within the same arch. 5950X bench shows, that the app totally doesn't know what to do with 16 threads while in gaming. It may during the first code transition phase.

As with any shit code, it likes one fast single thread and then it escalates even further. Praising just one extension that it speeds up that one clearly unoptimized thread is kind like licking your own balls.
This is a PS3 emulator. PS3 CPU has 1 PPE thread and 6 SPE threads (plus one for security and internal stuff). 7 threads if the game developer has done its job well.
lilunxm12That's basically an SIMD test. 2500 as a desktop processor only has 1/3 performance of 7700HQ because the former lacks AVX2 support.
2500 has 1/3 performance because it only has 4 threads. 7700HQ has 8. When a game uses more threads the slowdown is going to be huge. And RDR is one of these games.
Posted on Reply
#33
zlobby
Dr. DroWindows ecosystem is not exactly ready for such advanced technology yet.
Mhm! So advanced that one needs to press Scroll Lock to disable half of their CPU. I humbly bow down! :roll:
Posted on Reply
#34
Vya Domus
napataas it's the best way to scale up multicore CPU performance.
This is unequivocally wrong, it's the worst way to scale multicore performance.

You can see this for example in CPU-Z where 12900K achieves about 13X scaling and the 5950X achieves about 18X.

The 12900K is a "16 core" CPU just like the 5950X yet it can't match it's multicore scaling, it's not even close and all of this while the 5950X is also a lot more power efficient. Are you sure it isn't you who is ignorant here ?
Posted on Reply
#35
Dr. Dro
zlobbyMhm! So advanced that one needs to press Scroll Lock to disable half of their CPU. I humbly bow down! :roll:
I mean, that's precisely why. However, it's not Intel at fault here. Alder Lake actually has some amazing state of the art technology, that hardware scheduler they call the "Intel thread director" is, imo, hands down the best improvement a x86 processor has seen in quite some time. The truth is that Windows hopelessly relies on decades-old legacy code that nobody working at Microsoft currently understands or can do anything about, either because of the OS being a Jenga tower that directly relies on that by its very design, or because of legal/patent issues...

Here, right click your desktop, create a new text document and try naming it "COM1" or "LPT1", and you'll see what I mean. I could even go a step further, it's not only remnants from the DOS days four decades plus past, it still contains the dialer application from the NT 3 days in it and all of the surrounding cruft that makes it work, why on Earth does Windows 11 need this?



My point is that Windows is long since past its prime, and no amount of makeover Microsoft ever does to it is gonna change that. Since Windows 8's release, Microsoft's primary focus seems to have been keeping Windows' rotting corpse as neatly embalmed and dressed as possible, but major hardware design changes like this bring the nastiness outside. Eventually they'll have rewritten enough of the kernel and OS's low level functions that such a design will work, but who knows? If you have to disable half of your cores for your operating system to simply behave, something's wrong with it, and we all know what it is, we've just been telling ourselves otherwise over sheer convenience, to be frank.
Posted on Reply
#36
lilunxm12
londisteThis is a PS3 emulator. PS3 CPU has 1 PPE thread and 6 SPE threads (plus one for security and internal stuff). 7 threads if the game developer has done its job well.
2500 has 1/3 performance because it only has 4 threads. 7700HQ has 8. When a game uses more threads the slowdown is going to be huge. And RDR is one of these games.
No, hyper-threading doesn't bring in real cores, just make shared resources being utilized more efficiently. Even in best case scenario (similarcompute workload perfectly scale with thread and not memory/cache bandwidth bond), the performance gain is usually about 30% vs hyper threading disabled. In most games hyperthreading actully has negative impact because of introduced context switching...
AVX2 is clearly the deciding factor here.
Posted on Reply
#37
zlobby
Dr. DroI mean, that's precisely why. However, it's not Intel at fault here. Alder Lake actually has some amazing state of the art technology, that hardware scheduler they call the "Intel thread director" is, imo, hands down the best improvement a x86 processor has seen in quite some time. The truth is that Windows hopelessly relies on decades-old legacy code that nobody working at Microsoft currently understands or can do anything about, either because of the OS being a Jenga tower that directly relies on that by its very design, or because of legal/patent issues...

Here, right click your desktop, create a new text document and try naming it "COM1" or "LPT1", and you'll see what I mean. I could even go a step further, it's not only remnants from the DOS days four decades plus past, it still contains the dialer application from the NT 3 days in it and all of the surrounding cruft that makes it work, why on Earth does Windows 11 need this?



My point is that Windows is long since past its prime, and no amount of makeover Microsoft ever does to it is gonna change that. Since Windows 8's release, Microsoft's primary focus seems to have been keeping Windows' rotting corpse as neatly embalmed and dressed as possible, but major hardware design changes like this bring the nastiness outside. Eventually they'll have rewritten enough of the kernel and OS's low level functions that such a design will work, but who knows? If you have to disable half of your cores for your operating system to simply behave, something's wrong with it, and we all know what it is, we've just been telling ourselves otherwise over sheer convenience, to be frank.
Yes, we are on the same page, it seems.

It was my take on intel, as they have their fair share of BS to this day.

M$ should build a modern OS from the ground up, instead re-skinning the same old PoS each year, asking for even higher prices.
Posted on Reply
#38
Ferrum Master
auxyMultithreading isn't magic. You can't make something that can't be parallelized faster by throwing more threads at it. Learn about Amdahl's law.

And the option is there because it works fine on processors with functional TSX, and it gives a big speed-up.
Show me those functional CPUs...

Everything from Haswell to Kaby Lake is disabled in microcode. Later ones doesn't have the set as such.

Call me up when the emulator won't choke on one single thread while ingame... LLVM limitations are the key factor of shit multithreading here, not Amdahl's law, no matter how you try to defend it. Instead of relying on brute force AVX512, but instead aiding GPGPU/OpenCL for aiding in complex instruction sets.

It totally nuts to read people about needing a rare AVX512 that nobody does use in home/gaming scenarios. For professional use you use different caliber of gear with included ECC RAM if you wish for serious calculations and not fooling around.

Learn.
In August 2014, Intel announced that a bug exists in the TSX/TSX-NI implementation on Haswell, Haswell-E, Haswell-EP and early Broadwell CPUs, which resulted in disabling the TSX/TSX-NI feature on affected CPUs via a microcode update.[9][10][23] The bug was fixed in F-0 steppings of the vPro-enabled Core M-5Y70 Broadwell CPU in November 2014.[24]

The bug was found and then reported during a diploma thesis in the School of Electrical and Computer Engineering of the National Technical University of Athens.[25]

In October 2018, Intel disclosed a TSX/TSX-NI memory ordering issue found in Skylake processors.[26] As a result of a microcode update, HLE support was disabled in the affected CPUs, and RTM transactions would always abort in SGX and SMM modes of operation. System software would have to implement a workaround for the RTM memory ordering issue. In June 2021, Intel published a microcode update that further disables TSX/TSX-NI on various Xeon and Core processor models from Skylake through Coffee Lake and Whiskey Lake as a mitigation for unreliable behavior of a performance counter in the Performance Monitoring Unit (PMU).[27] By default, with the updated microcode, the processor would still indicate support for RTM but would always abort the transaction. System software is able to detect this mode of operation and mask support for TSX/TSX-NI from the CPUID instruction, preventing detection of TSX/TSX-NI by applications. System software may also enable the "Unsupported Software Development Mode", where RTM is fully active, but in this case RTM usage may be subject to the issues described earlier, and therefore this mode should not be enabled on production systems.

According to Intel 64 and IA-32 Architectures Optimization Reference Manual from May 2020, Volume 1, Chapter 2.5 Intel Instruction Set Architecture And Features Removed,[18] HLE has been removed from Intel products released in 2019 and later. RTM is not documented as removed. However, Intel 10th generation Comet Lake and Ice Lake CPUs, which were released in 2020, do not support TSX/TSX-NI,[28][29][30][31][32] including both HLE and RTM.

In Intel Architecture Instruction Set Extensions Programming Reference revision 41 from October 2020,[33] a new TSXLDTRK instruction set extension was documented and slated for inclusion in the upcoming Sapphire Rapids processors.
Posted on Reply
#39
Camm
ViperXTRThis is what ive been waiting for to see, hope others would use RPCS3 as a CPU benchmark like they did before with dolphin
RPCS3 is heavily AVX accelerated, which is cool for the usecase, but AVX has little relevancy in most use cases, and AVX512 even more so.
Posted on Reply
Add your own comment