Thursday, August 10th 2023

Atlas Fallen Optimization Fail: Gain 50% Additional Performance by Turning off the E-cores

Action RPG "Atlas Fallen" joins a long line of RPGs this Summer for you to grind into—Baldur's Gate 3, Diablo 4, and Starfield. We've been testing the game for our GPU performance article, and found something interesting—the game isn't optimized for Intel Hybrid processors, such as the Core i9-13900K "Raptor Lake" in our bench. The game scales across all CPU cores—which is normally a good thing—until we realize that not only does it saturate all of the 8 P-cores, but also the 16 E-cores. It ends up with under 80 FPS in busy gameplay at 1080p with a GeForce RTX 4090. Performance is "restored" only when the E-cores are disabled.

Normally, when a game saturates all of the E-cores, we don't interpret it as the game being "aware" of E-cores, but rather "unaware" of them. An ideal Hybrid-aware game should saturate the P-cores for its main workload, and use the E-cores for errands such as processing the audio stack (DSPs from the game), network stack (the game's unique multiplayer network component), physics, in-flight decompression of assets from the disk, etc., which show up in Task Manager as intermittent, irregular load. "Atlas Fallen" appears to be using the E-cores for its main worker threads, and this is found imposing a performance penalty as we found out by disabling the E-cores. This performance penalty is because the E-cores run slower than P-cores, at lower clock speeds, have much lower IPC, and are cache-starved. Frame data being processed by the P-cores end up having to wait for those from the E-cores, which causes the overall framerate to come down.
In the Task Manager screenshot above, the game is running in the foreground, we set Task Manager to be "always on top," so Thread Director won't interfere with the game. It prefers to allocate the P-cores to foreground tasks, which doesn't happen here, because the developers chose to specifically put work on the E-Cores.

For comparison we took four screenshots, with E-Cores enabled and disabled (through BIOS). We picked a "typical average" scene instead of a worst case, which is why the FPS are a bit higher. As you can see, with E-Cores enabled are pretty low (136 / 152 FPS), whereas turning off the E-Cores instantly increases performance right up to the engine's internal FPS cap (187 / 197 FPS).

With the E-cores disabled, the game is confined to what is essentially an 8-core/16-thread processor with just P-cores, which boost well above the 5.00 GHz mark, and have the full 36 MB slab of L3 cache to themselves. The framerate now shoots up to 200 FPS, which is a hard framerate limit set by the developer. Our RTX 4090 should be capable of higher framerates, and developers Deck13 Interactive should consider raising it, given that monitor refresh-rates are on the rise, and it's fairly easy to find a 240 Hz or 360 Hz monitor in the high-end segment. The game is based on the Fledge engine, and supports both DirectX 12 and Vulkan APIs. We used GeForce 536.99 WHQL in our testing. Be sure to check out our full performance review of Atlas Fallen later today.
Add your own comment

120 Comments on Atlas Fallen Optimization Fail: Gain 50% Additional Performance by Turning off the E-cores

#1
Assimilator
Do games themselves have to be optimised/aware of P- versus E-cores? I was under the impression that Intel Thread Director + the Win11 scheduling was sufficient for this, but I guess if there's a bug in either of those components it would also manifest in this regard.
Posted on Reply
#2
Sithaer
@btarunr
You mean't Diablo 4 and not 3. :)

Well I don't have E-cores anyway.:laugh: 'looking forward to the performance test '
I'm kind of interested in the game itself since it looks fun enough to me, maybe at a later point tho. 'got enough games to play for now'
Posted on Reply
#3
R0H1T
AMD right about now!
AssimilatorIntel Thread Director + the Win11 scheduling
OS scheduling is independent of thread director, I'm yet to see what TD actually does & how efficient/better it is to a similar but much better software solution I posted in the other thread!
Posted on Reply
#4
W1zzard
AssimilatorDo games themselves have to be optimised/aware of P- versus E-cores? I was under the impression that Intel Thread Director + the Win11 scheduling was sufficient for this, but I guess if there's a bug in either of those components it would also manifest in this regard.
Games don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
Posted on Reply
#5
Nyek
Mine were disabled day 1, the day I bought the new CPU.
Posted on Reply
#6
Assimilator
W1zzardGames don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
Is this simply a result of bad console porting then? Given that current console games don't have to be aware of the difference between P and E cores, since said difference doesn't exist on consoles?
Posted on Reply
#7
phanbuey
Disappointing, but I guess i will have to take the 15 seconds to lasso the game.

Probably the same thing on 7950x3d.
Posted on Reply
#8
W1zzard
AssimilatorIs this simply a result of bad console porting then? Given that current console games don't have to be aware of the difference between P and E cores, since said difference doesn't exist on consoles?
Yeah, or just "bad programming", possibly also "lack of QA testing"
phanbueyDisappointing, but I guess i will have to take the 15 seconds to lasso the game.
In a quick test I tried setting affinity, didn't have the expected results, FPS are still low. I suspect the game sees x cores on startup and spawns x workers across the available cores. If you later move the x threads onto x minus 16 cores, the workloads clash
Posted on Reply
#9
#22
I wonder how much has changed in terms of e-cores in general in gaming. E.g. in lastest games or if they are still work in progress and started making bigger difference or their impact looks the same as it was when it was popular to test it (year or two ago).
Posted on Reply
#10
efikkan
W1zzardGames don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
So in other words; software need to work around this issue. :rolleyes:
Well, I saw this coming when I first heard of Alder Lake.

But if you argue that games should know how many cores are "fast" for spawning threads, then I would argue spawning threads dynamically for any synchronized workload is a bad idea anyways. (it doesn't matter for async workloads though…)

So how do you propose the software should know how many fast cores there are?
Using something like the CPUID instruction on each core? Is there an opcode which says something about the performance of each core?
Or are there new features in the Windows API to query for this?
(Keep in mind that there are already hybrid designs with three different core classes for ARM designs.)
Posted on Reply
#11
phanbuey
efikkanSo in other words; software need to work around this issue. :rolleyes:
Well, I saw this coming when I first heard of Alder Lake.

But if you argue that games should know how many cores are "fast" for spawning threads, then I would argue spawning threads dynamically for any synchronized workload is a bad idea anyways. (it doesn't matter for async workloads though…)

So how do you propose the software should know how many fast cores there are?
Using something like the CPUID instruction on each core? Is there an opcode which says something about the performance of each core?
Or are there new features in the Windows API to query for this?
(Keep in mind that there are already hybrid designs with three different core classes for ARM designs.)
I can count on one hand how many times i've heard or experienced anything like this since Alder lake. As W1z wrote above, the developers needed to do nothing... they did something silly and we got this.
Posted on Reply
#12
W1zzard
efikkanSo in other words; software need to work around this issue. :rolleyes:
No, the system is designed to automatically do the right thing, like in 99% of other games out there. The only other case that I know and have researched is CP2077, which does something "smart", using its own scheduler, which makes it fail on X3D
efikkanSo how do you propose the software should know how many fast cores there are?
There's Windows APIs for that, also various CPU instructions
Posted on Reply
#13
Assimilator
efikkanSo how do you propose the software should know how many fast cores there are?
Using something like the CPUID instruction on each core? Is there an opcode which says something about the performance of each core?
Or are there new features in the Windows API to query for this?
(Keep in mind that there are already hybrid designs with three different core classes for ARM designs.)
All of these were added with ADL and Win11.
Posted on Reply
#14
Squared
efikkanBut if you argue that games should know how many cores are "fast" for spawning threads, then I would argue spawning threads dynamically for any synchronized workload is a bad idea anyways. (it doesn't matter for async workloads though…)
Yeah I don't know how games are supposed to use threads but ideally you never put work on another thread that needs to be synchronous. And you especially never want to do that on all cores; what if the OS needs to do something, or the user is streaming, or the user is running a rendering task and limited it to just a few cores so that it wouldn't interfere much with the game? Based on what I've heard here, this game would have abysmal performance in any of those situations even on a 7800X3D.
Posted on Reply
#15
pressing on
AssimilatorAll of these were added with ADL and Win11.
And although it is hidden away in Task Manager you can use CPU Affinity in Windows 10 to get the game to use P-cores only.
Posted on Reply
#16
Squared
I suspect performance would also improve if just some of the cores were disabled, even the P-cores. And I suspect performance would also suffer on the 7950X. If the game is trying to use threads that are inter-dependent, then it's spending a lot of CPU resources on thread synchronization, and that will get worse as the core count increases.
Posted on Reply
#18
bug
W1zzardGames don't have to be optimized, Thread Director will do the right thing automagically. The developers specifically put load on the E-Cores, which is a mechanism explicitly allowed by Thread Director / Windows 11. It seems Atlas Fallen developers either forgot that E-Cores exist (and simply designed the game to load all cores, no matter their capability), or thought they'd be smarter than Intel
Even so, the behavior is still strange. I mean, Cinebench also puts load on all cores, but still runs faster when also employing the E-cores. There's something fishy in that code, beyond the sloppy scheduling.
Posted on Reply
#19
atomek
The main failure was to introduce P/E idea to desktop CPUs. Who need efficient cores anyway in desktop?
Posted on Reply
#20
bug
atomekThe main failure was to introduce P/E idea to desktop CPUs. Who need efficient cores anyway in desktop?
They're a win for people running mutlithreaded workloads. Not because they're "efficient", but because they can squeeze more perf per sq mm (i.e. you can fit 3-4 E-cores where only 2 P-cores would fit and get better performance as a result).

E cores are not a failure, but, like any heterogenous design, results are not uniform anymore, they will vary with workload.
Posted on Reply
#21
atomek
bugThey're a win for people running mutlithreaded workloads. Not because they're "efficient", but because they can squeeze more perf per sq mm (i.e. you can fit 3-4 E-cores where only 2 P-cores would fit and get better performance as a result).

E cores are not a failure, but, like any heterogenous design, results are not uniform anymore, they will vary with workload.
They are but in efficiency limited environment like laptops. If all matters perf per sq mm then all cores should be "efficient" ones. But this is not the case. Efficient cores are nice to have for laptop, to handle background tasks. Having efficient cores in desktop CPU is just waste and lead to performance degradation like here, where you can improve performance by 50% by turning them off.
Posted on Reply
#22
Squared
bugEven so, the behavior is still strange. I mean, Cinebench also puts load on all cores, but still runs faster when also employing the E-cores. There's something fishy in that code, beyond the sloppy scheduling.
Cinebench probably has different kind of workload. You can give a thread one large task and tell it to communicate at the end, or you can have a task that needs to communicate with other tasks frequently. If each thread has to communicate too frequently, it'll use a lot of its cycles just talking to other threads. And if it's waiting for another thread to finish, then it won't get any work done. The more time each thread spends on its own task without talking to another thread, the better multithreading works.
Posted on Reply
#23
Dr_b_
have all my ecores disabled anyway
Posted on Reply
#24
bug
SquaredCinebench probably has different kind of workload. You can give a thread one large task and tell it to communicate at the end, or you can have a task that needs to communicate with other tasks frequently. If each thread has to communicate too frequently, it'll use a lot of its cycles just talking to other threads. And if it's waiting for another thread to finish, then it won't get any work done. The more time each thread spends on its own task without talking to another thread, the better multithreading works.
It definitely has a different kind of workload. But it still doesn't make sense to reduce the overall computing power available and see the performance go up.
Posted on Reply
#25
atomek
big.LITTLE architecture in desktop CPUs is completely retarded idea by Intel. It came as a response to ARM efficiency which is (and will be) out of reach for Intel or x86 arch in general. It makes some sense in laptops, to designate E cores for background tasks and save battery, but it is not something that can help catch-up with ARM in terms of efficiency (this is impossible due to architecture limitations of CISC - and if you respond that "Intel is RISC internally" - it doesn't matter. The problem is with non-fixed length instruction which makes optimisation of branch predictor miserable). Funny part is that AMD without P/E is way more efficient than Intel (but 5-7 times less efficient than ARM, especially Apple implementation of this ISA)
Posted on Reply
Add your own comment
Apr 27th, 2024 11:27 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts