• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel "Nova Lake-S" Core Ultra 3, Ultra 5, Ultra 7, and Ultra 9 Core Configurations Surface

Before that, E-cores were actually less efficient clock-for-clock. (See TPU E-core benchmarks for 12900K.) But I agree with your main point that E-cores are more space efficient than P-cores.
1 for 1 yes, ecores are less efficient, but that's not the proper comparison, Intel isn't replacing 1 P core with 1 ecore but with ~3.5. So the efficiency comparison should be 1 P core vs 3.5 ecores, and it goes without saying that 3.5 ecores slumdunk the pcore in performance / watt.
 
I feel Intel has no focus, it just switches constantly, I also feel getting rid of Gelzinger was a mistake....but time will tell

Unfortunately, shareholders were very angry and some heads had to roll. Gelsinger was a sacrificial lamb. LBT took over an Intel that is facing critical challenges in most of its businesses, his task of steering this ship back to safe waters will not be an easy one. If only Krzanich could have kept it in his pants, I wonder where Intel would be today.

More cores do not necessarily a better CPU make.

Guessing AMD will keep gaining market share. It seems to me, and I could be wrong, that Intel is going down another dead road ala NetBurst where it chased higher MHz at the cost of all else. Now they're cramming in as many cores as they can for apps that mostly can't use them.

Comparing NVL-S to NetBurst might as well be the most exquisite, glowing hot take I have ever seen... we've been stalled on CPU core counts for a while, this is a step in the right direction as long as the power and heat can be reliably and sustainably managed.
 
Next year nice.. will be hard to do a hand me down or sell my delidded 9950x3d...
 
1 for 1 yes, ecores are less efficient, but that's not the proper comparison, Intel isn't replacing 1 P core with 1 ecore but with ~3.5. So the efficiency comparison should be 1 P core vs 3.5 ecores, and it goes without saying that 3.5 ecores slumdunk the pcore in performance / watt.

1 P-core is 66% of the area of a 4 E-core LP island. So it's 2.6 E-cores.
Lion Cove = 4.53 mm², Skymont = 1.73 mm².
 
Yes, there will always be diminishing returns.
The usefulness beyond 8 cores in interactive applications is slim.


SMT has always been about utilizing idle cycles for other work, in order to get closer to a fully saturated core. But as we know, the cost and complexity of SMT is increasing, and the benefits diminishing as the CPU front-end gets better. This engineering effort and transistor budget is much better spent elsewhere.

SMT is only going to get even more irrelevant when APX is added, as it will greatly reduce the amount of pipeline stalls. We know it's coming with Diamond Rapids which have a similar launch window as Nova Lake, but there is no confirmation about Nova Lake yet.
Well exactly, that's what I'm saying, the less the core is bottlenecked(idling due to a bottleneck) the less Hyper-Threading is useful.
 
Well exactly, that's what I'm saying, the less the core is bottlenecked(idling due to a bottleneck) the less Hyper-Threading is useful.
It's both bottlenecking and free resources. The wider the cores and the lower IPC that can be utilized by an applications thread the more SMT/HT can show a benefit even outside of a bottleneck or stall on say memory access. I think it might be chips and cheese shows that Intel's current P-cores do great vs AMD in things that are not games because Games are low IPC implementations in general.
 
Games are low IPC implementations in general.
Some of them are low IQ as well.
Next year nice.. will be hard to do a hand me down or sell my delidded 9950x3d...
fff1c5ed725b6bfd6ba9dfa5cd02f96f.gif
 
  • Haha
Reactions: tfp
It's both bottlenecking and free resources. The wider the cores and the lower IPC that can be utilized by an applications thread the more SMT/HT can show a benefit even outside of a bottleneck or stall on say memory access.
Not the way Intel and AMD implements SMT, as they are just switching between two pipelines when one stalls, so it's not like the other thread can utilize free ALUs and vector units. The two most common causes for stalls are branch mispredictions and cache misses, along with various limitations in the front-end which in turn often worsens cache misses (e.g. complex logic prevents the front-end from dereferencing pointers in time leading to cache misses). It's actually quite common that complex logic will lead to more stalls, and even when the core isn't stalled, there is usually little instruction-level parallelism, which means when another thread jumps in, many execution ports can be idle. Intel have a much more capable front-end, which is why we see larger relative gains from SMT on AMD.

I think it might be chips and cheese shows that Intel's current P-cores do great vs AMD in things that are not games because Games are low IPC implementations in general.
I think the term you're looking for is bloat, not IPC.
Games tend to have complex logic and bloat, a combination which often leads to a chain of mispredictions and cache misses, effectively making the job for the front-end very hard.
 
Huh?

Jim Keller left Intel to care for his cancer stricken sister. Look it up.
Jim Keller also left Intel because he found it very difficult to work with them, they outright refused to listed to him or his ideas.

Look it up.

I still wonder why those many cores. Are there so many power users? My 6 core Ryzen 7600x is still able to handle basic tasks.

I still see those useless E-cores.

I'll wait and see how those new Intel processors will do basic tasks.
Any CPU can do basic tasks. A dual core pentium can browse the internet just like a core i9.

When it comes to productivity though, even an arrowlake ultrasupernamei3 is going to give your 7600 a very hard time. Ultra 5+ will dominate it.

That productivity could be as simple as editing video, or code compilation, or photo editing. All very large niches. Those "useless e cores" make a big difference on highly threaded software that used to be only AMD's domain.
 
Not the way Intel and AMD implements SMT, as they are just switching between two pipelines when one stalls, so it's not like the other thread can utilize free ALUs and vector units. The two most common causes for stalls are branch mispredictions and cache misses, along with various limitations in the front-end which in turn often worsens cache misses (e.g. complex logic prevents the front-end from dereferencing pointers in time leading to cache misses). It's actually quite common that complex logic will lead to more stalls, and even when the core isn't stalled, there is usually little instruction-level parallelism, which means when another thread jumps in, many execution ports can be idle. Intel have a much more capable front-end, which is why we see larger relative gains from SMT on AMD.

I'm not sure this is true googling I'm seeing that it will use idle resources with stalls, miss predictions, and data dependencies being the biggest causes of idles but not exclusively. However, I can't find something with good details on how Intel implemented SMT. AMD branch predictions generally isn't as good as Intel's, so HT would work better for them. I don't see AMD's HT implementation being vastly better in general.
 
Last edited:
And another new socket type...sheesh. New mobo, new brackets for cooling, etc etc.
Coolers for intel platforms last for many sockets with minimal changes
 
I'm not sure this is true googling I'm seeing that it will use idle resources with stalls, miss predictions, and data dependencies being the biggest causes of idles but not exclusively. However, I can't find something with good details on how Intel implemented SMT. AMD branch predictions generally isn't as good as Intel's, so HT would work better for them. I don't see AMD's HT implementation being vastly better in general.
AMD and Intel's branch predictors are very close to each other in accuracy.

1753475014496.png


As for SMT, a lot of the commentors are conflating it with coarse-grain multithreading. David Kanter's deep dive into Poulson, the last microarchitecture based on Itanium, has a nice diagram showing the difference between various forms of hardware multithreading. A processor implementing SMT doesn't rely on L3 misses or other stalls to issue instructions from the other thread; both threads share the CPU in all cycles.

1753475411737.png
 
Last edited:
  • Like
Reactions: tfp
well based on the chips & cheese documentary, its probably due to the tile layout design Intel did with ARL, though impressively the L0 and L1 caches are doing their job perfectly, its like L3 is not so much used at all if caching data on L0 and L1 was very effective to begin with.
You are misunderstanding the purpose of the L3. Look at the high latency of DRAM access for Arrow Lake; notice how it's worse than Zen 5 despite using much faster memory (DDR5 5600 CL36 vs DDR5 8000 CL40). The L3, like all of the other caches, helps hide DRAM latency. The L0 and L1, on their own, wouldn't suffice for most workloads.

1753476092746.png
 
If that Core Ultra 9 with 144MB L3 cache works properly, that is doesn't have weird RAM and cache latency problems, then AMD's next generation will need significant improvements. On the other hand, Intel might just cancel everything.
 
Last edited:
You are misunderstanding the purpose of the L3. Look at the high latency of DRAM access for Arrow Lake; notice how it's worse than Zen 5 despite using much faster memory (DDR5 5600 CL36 vs DDR5 8000 CL40). The L3, like all of the other caches, helps hide DRAM latency. The L0 and L1, on their own, wouldn't suffice for most workloads.

View attachment 409314
nope its not like that I misunderstood it, in that test, the ARL system isn't using the 200S bios (which aside from the boost on the clocks, also has microcode fixes altogether with the windows update from December 2024) and the tester isn't even tweaking anything at all just loading XMP..lmao, prior to that, I use an old Windows Preview version that has gotten my RPL platform performance back to its feet and it didn't somehow feel right using with the ARL. so I made a custom image based on the March 2025 24H2 image and formatted my ARL setup and all things working as they should now, the bottleneck on COD is now gone and the FPS net from the game is jsut 8-10fps far from what my 9950X3D is doing in 4k resolution with the same GPU(and clock settings), which results in a large misrepresentation of the real performance of ARL nowadays that the 200S bios is implemented by OEM, you guys really like what's falsely being fed up to you by Internet..lol

This is my 285k, on the 200S Bios running 8600MT's cl36 (my daily, I do have a 8800MT's cl38, cudimm 9000MT kit on the way), I am normally around 76ns when nothing on my background or foreground process is running, I currently have the browser and HWInfo running, its a far cry from what you are showing, latency wise vs AMD, I do also have a Zen5 platform, a 9950X3D and running the same benchmark, I get around the same 76ns with DDR5 8200MT's cl34 on it with an X870E Hero board.
Screenshot 2025-07-26 161356.png

Their article is a good read, but some things are even sketchier than sketch, I don't really trust the Internet with regards to information about tech unless they know what they are doing, seriously...

so where did that 99ns or what ever came from? my test is based on a real life scenario that I daily use and play around, its a FAR CRY from what I am having..
 
This is my 285k, on the 200S Bios running 8600MT's cl36 (my daily, I do have a 8800MT's cl38, cudimm 9000MT kit on the way), I am normally around 76ns when nothing on my background or foreground process is running, I currently have the browser and HWInfo running, its a far cry from what you are showing, latency wise vs AMD, I do also have a Zen5 platform, a 9950X3D and running the same benchmark, I get around the same 76ns with DDR5 8200MT's cl34 on it with an X870E Hero board.
View attachment 409356
Their article is a good read, but some things are even sketchier than sketch, I don't really trust the Internet with regards to information about tech unless they know what they are doing, seriously...

so where did that 99ns or what ever came from? my test is based on a real life scenario that I daily use and play around, its a FAR CRY from what I am having..

I really wonder how much of Arrow Lake's technically superior IPC do we lose for its actual dogshit memory subsystem, your mem kit should beat the daylights out of mine... but 21 ns regression despite the 1000 MT/s higher bandwidth at similar timings is just a dreadful result

1753519025002.png
 
I really wonder how much of Arrow Lake's technically superior IPC do we lose for its actual dogshit memory subsystem, your mem kit should beat the daylights out of mine... but 21 ns regression despite the 1000 MT/s higher bandwidth at similar timings is just a dreadful result

View attachment 409366
My RPL, is 51ns on this test, the tradeoff for the IMC on ARL vs RPL is, ARL was allowed/built to run higher speeds, I know a few folks daily dallying 9400-9466MT's CL34 CUDIMM on 1.5v ish with Gear 2, also one thing I recently discovered is I can run ARL IMC 1t Mode on Gear 2 at around 8000MT's which is leagues better than some RPL samples than can do it on 6200MT's, my RPL KS runs 6000MT's 1T max, its not a dreadful result, mainly RPL has stronger P-Cores than what ARL has, its the E-Cores who's chugging on the ARL, I would prefer the E-Cores from ARL glued to the RPL P-Cores, that will be a blast, I run my RPL 6ghz all P and 4.7 all E, just omit the devastating power draw and its an absolute monster.

Now for IPC, ARL E-Cores is just better, P-Cores its OK as long as you clock them at a minimum of 5.5Ghz (some folks I know run it 5.7 to 5.8), DLVR is really next level in terms of voltage regulation, you have a bypass mode and a regulation mode which is fun to play with, I think it will be mature with the NVL platform.
 
Last edited:
…so I made a custom image based on the March 2025 24H2 image and formatted my ARL setup and all things working as they should now, the bottleneck on COD is now gone and the FPS net from the game is jsut 8-10fps far from what my 9950X3D is doing in 4k resolution with the same GPU(and clock settings), which results in a large misrepresentation of the real performance of ARL nowadays that the 200S bios is implemented by OEM, you guys really like what's falsely being fed up to you by Internet..lol
While we can't expect random people on the Internet to be experts and do honest and fair assessments, I wish that the big reviewers did a much better job. Especially the popular ones on YouTube, it's often very clear from their "technological assessments" where the limits of their understanding is. And even some who have decades of experience with benchmarking and OC doesn't mean they have a deeper understanding of how software and hardware works.

I don't think many of the early benchmarks of Arrow Lake were fair at all, but I find it more interesting how the public perception changed because of misrepresentation; Two months earlier when Zen 5 launched, the reception was lukewarm at best, and I was one of few who pointed out it was a decent step forward. Then when Arrow Lake launched, since it "underperformed" in 720p/1080p and some very "unfair" reviews, all of a sudden "everyone's" opinion shifted and Zen 5 was excellent. If anything, this is a case study in how easily the masses are persuaded to take on a polar opposite position, and it's often comical how boldly some can have the strongest opinions without any deeper understanding subject matter.

This is my 285k, on the 200S Bios running 8600MT's cl36 (my daily, I do have a 8800MT's cl38, cudimm 9000MT kit on the way), I am normally around 76ns when nothing on my background or foreground process is running, I currently have the browser and HWInfo running, its a far cry from what you are showing, latency wise vs AMD, I do also have a Zen5 platform, a 9950X3D and running the same benchmark, I get around the same 76ns with DDR5 8200MT's cl34 on it with an X870E Hero board.
I'll try to keep this brief, but there is something people need to know about such benchmarks; they are at best a very rough estimate of performance. Firstly, the system clock isn't precise enough to measure something that tiny, it's not even precise down to microseconds, let alone nanoseconds, so they are clearly doing larger batches and interpolating something. Secondly, you can't control the caches like that, so what they are probably doing is doing access patterns to try to circumvent every CPU feature to "force" cache misses. Thirdly, in Windows you can't control the environment enough to get consistent results on this level, you'll need a special real-time kernel and no background processes/drivers, nothing that can disturb anything. Most likely, such benchmarks are using an algorithm or "correction factors" to produce "expected" results. This also means that such benchmarks are not precise across architectures, but rather more of a "relative" performance for that CPU with various settings or memory configurations. So I would strongly advice against using the results for assessing anything across different CPUs. And keep in mind, if a new CPU architecture have a very different way of using caches, the measured results would probably be wild. (We've seen with other synthetics like Geekbench that results can be all over the place, and then it's "fixed" in the next version. So keep this in mind, they are not measuring what you think they are measuring.)

As I always say, the only true benchmark is real world applications. Synthetics may have their role in development etc., or for a technical deep-dive to explain the various differences in hardware. But it shouldn't be the base of any purchasing decisions.

Having done a fair bit of low level optimizations, I know how hard it is just to benchmark correctly. If it's a larger algorithm that takes seconds or more to run, then all the variances become insignificant, but when dealing with tiny functions of just a few lines of code, and trying to do the correct assessments to extract the maximum performance, it actually becomes quite hard. You might think profiling is enough, but that doesn't run it in real time, so it doesn't tell everything. And just running it once is far too short to measure it precisely (the variance will be much larger than the average runtime). So what is often done is to run it millions of times in a loop, which is very much synthetic (and has loop overhead), but at the very least start to give consistent results good enough to compare implementations and test various hardware. When testing larger sets of implementations and testing various CPUs it starts to paint a picture various characteristics between generations and Intel vs. AMD.

And to the surprise to some, the differences between the CPUs are quite significant at this level. The difference can be like 2x, not 0.1%. The overall trend is that AMD has higher peak throughput, while Intel handles better complex logic. Some simple operations on AMD CPUs are extremely fast, so it's just about getting the code as computationally dense as possible, and sometimes even just getting rid of some register shuffling is enough to yield a good performance bump, and the bump is usually greater for AMD. And since most people have misconceptions about optimizations; after decades of programming I still don't know of ways to intentionally optimize for Intel over AMD even if I wanted to, other than not do any optimizing, as the only attribute that has consistently remained in Intel's favor is the ability to handle complex logic better.

This doesn't mean Intel is somehow universally better, as even though the CPUs are remarkably different on the finer level, the aggregated scores then do paint a more similar picture. But it's very important that the differences can be quite significant on an application level, and the overall trend holds true; AMD excel at larger batch loads (and now even more with AVX-512), while Intel holds an edge in more interactive applications, especially latency, which is why so many "feels" Intel is more responsive. (And no, SpecInt is not a very stressing benchmark for the branch predictor).

Looking forward, we could see some interesting changes in the coming CPU generations. As we know, Zen 5 increased the number of ALUs from 4 to 6, without a massive performance gain as a result, but once Intel and AMD rolls out APX support, this may very well shift things more in AMD's favor, and as we've seen with AVX-512; even when Intel does it first (even several tries), AMD often does it better. With AMD already having wider execution, APX will help saturate that and scale even further (and should help the front-end become much more efficient).

I really wonder how much of Arrow Lake's technically superior IPC do we lose for its actual dogshit memory subsystem, your mem kit should beat the daylights out of mine... but 21 ns regression despite the 1000 MT/s higher bandwidth at similar timings is just a dreadful result
If you study the changes in CPU architectures over time, you'll see a clear trend that caches are trading latency for bandwidth, especially since the Haswell era. The same goes for DDR5 vs. DDR4. And if you were to extrapolate "real world" performance based on theoretical specs, you would certainly conclude that all new hardware must suck big time, but it clearly doesn't.

There is a big disconnect because these synthetic benchmarks are usually trying to create the worst case possible, which isn't representative of real workloads. So in reality, the newer CPU is probably slower 1-2% of the time, but it's probably equal or faster the rest of the time, and the clock cycles saved here makes up for the cycles lost in the worst case scenarios. You also have to remember that as CPU front-ends become more and more efficient, the chances of a cache miss decreases, so even if the worst case is slightly worse, the overall performance may still be better. This is one of many reasons why synthetic benchmarks shouldn't be used to this purpose.

So to answer your question;
I believe the latency is far less important than even you think. While there are some edge-cases where Arrow Lake falls a little short, you almost have to select unrealistic benchmarks to paint that picture. Also remember that most reviews run Raptor Lake without or with higher power limits than stock, so that skews some results. In most applications it scales fine, and it does very well in Linux benchmarks. Overall it's faster, but some regressions are normal with architectural changes. By the next iteration the improvements overall will probably even out any such disadvantage.
 
As I always say, the only true benchmark is real world applications.
that's why I use it "normally", and stop those comments about "that platform is slow as hell in gaming" when they don't particularly own any of these modern platforms.
 
I agree with you guys. I had to forgo buying the 285K, the timing for my Z690 board's malfunction was pretty bad. ARL hadn't launched, I was in need of a working system asap and there were two choices, either I bought a motherboard to get it back up and running or i buy a lower end AMD motherboard and a lower end Ryzen... with a KS in my hands it seemed a no brainer especially since I had the chance to buy any exotic board I could think of, and naturally I went for the refreshed Apex.

The rocky at best launch for the 285K and multiple regressions at gaming eventually killed off my interest, especially given the cost and that it's basically impossible to obtain one. 265K is usually in stock, sometimes the 285 at stratospheric prices.

I still haven't diagnosed, cleaned or done anything to my Z690 Ace to this day, it's been about a year now. It was showing symptoms similar to a short, thought it wasn't worth risking my gear at that time.

Maybe sometime I'll remove all of the bits and pieces, do an inspection and get something like a i3-14100 for it if it somehow works again. No rush.
 
nope its not like that I misunderstood it, in that test, the ARL system isn't using the 200S bios (which aside from the boost on the clocks, also has microcode fixes altogether with the windows update from December 2024) and the tester isn't even tweaking anything at all just loading XMP..lmao, prior to that, I use an old Windows Preview version that has gotten my RPL platform performance back to its feet and it didn't somehow feel right using with the ARL. so I made a custom image based on the March 2025 24H2 image and formatted my ARL setup and all things working as they should now, the bottleneck on COD is now gone and the FPS net from the game is jsut 8-10fps far from what my 9950X3D is doing in 4k resolution with the same GPU(and clock settings), which results in a large misrepresentation of the real performance of ARL nowadays that the 200S bios is implemented by OEM, you guys really like what's falsely being fed up to you by Internet..lol

This is my 285k, on the 200S Bios running 8600MT's cl36 (my daily, I do have a 8800MT's cl38, cudimm 9000MT kit on the way), I am normally around 76ns when nothing on my background or foreground process is running, I currently have the browser and HWInfo running, its a far cry from what you are showing, latency wise vs AMD, I do also have a Zen5 platform, a 9950X3D and running the same benchmark, I get around the same 76ns with DDR5 8200MT's cl34 on it with an X870E Hero board.
View attachment 409356
Their article is a good read, but some things are even sketchier than sketch, I don't really trust the Internet with regards to information about tech unless they know what they are doing, seriously...

so where did that 99ns or what ever came from? my test is based on a real life scenario that I daily use and play around, its a FAR CRY from what I am having..
Thanks for sharing your results; you are using Chips and Cheese's benchmark for latency testing. The point isn't the absolute latency; rather it's the incredible difference in latency between even the L3 cache and DRAM which your own results show as well. In addition, the bandwidth of the L3 is far greater than DRAM could satisfy.

1753570093099.png
 
Last edited:
I agree with you guys. I had to forgo buying the 285K, the timing for my Z690 board's malfunction was pretty bad. ARL hadn't launched
I already bought the board before then when it was on sale because nobody bought it from where I'm at, processor was free, somebody sponsored the hobby, random retail, not too shabby, not too good either, but no complains so far.
The point isn't the absolute latency; rather it's the incredible difference in latency between even the L3 cache and DRAM which your own results show as well.
Yes I would agree, L3 is just far too hungry, it will scale given your Memory Controller and I/O doesn't slack (to be honest Zen5's bottleneck is the I/O die full effective bandwidth doesn't scale much beyond 95GBps)..but to be honest, we're all looking at "theoretical performance" I don't think there'd be any mainstream app that would need that much bandwidth nor would require that fast/low latency, some Games do, just "some".
Screenshot 2025-07-27 162543.png
 
I already bought the board before then when it was on sale because nobody bought it from where I'm at, processor was free, somebody sponsored the hobby, random retail, not too shabby, not too good either, but no complains so far.

Yes I would agree, L3 is just far too hungry, it will scale given your Memory Controller and I/O doesn't slack (to be honest Zen5's bottleneck is the I/O die full effective bandwidth doesn't scale much beyond 95GBps)..but to be honest, we're all looking at "theoretical performance" I don't think there'd be any mainstream app that would need that much bandwidth nor would require that fast/low latency, some Games do, just "some".
View attachment 409486
You have a point with bandwidth, but many applications are incredibly latency sensitive.
 
You have a point with bandwidth, but many applications are incredibly latency sensitive.
on a strict environment where results matter, it would. For my normal personal household use maybe, maybe not..depends on what you would want to run it..latency between my 8800MT's and 8600MT's are negligible, 0.02-0.04ns apart..I don't see anything unusual with those, I just choose to ran the latter for lower voltages..
 
Back
Top