• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel "Nova Lake‑AX" APU Enhances iGPU Performance with Plenty of Xe3 Cores

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
3,251 (1.12/day)
Intel is reportedly preparing "Nova Lake-AX," a high-end laptop SoC that combines a massive 52-core CPU complex with an expanded Xe3 graphics tile. While the standard Nova Lake‑S is set to arrive in 2026, followed by the H and HX mobile variants, the AX model will debut later as the flagship SKU. Built on Intel's second‑generation Foveros technology, Nova Lake‑AX stacks two compute tiles, each one with eight "Coyote Cove" P‑cores and 16 "Arctic Wolf" E‑cores, and a separate low‑power island with four LPE cores. A separate cache-boosted passive tile could add over 100 MB of Last Level Cache (bLLC), which feeds both the CPU cores and the Celestial Xe3 iGPU, potentially scaling up to 20-24 Xe3 cores. Intel already uses bLLC in its Clearwater Forest server processors as a passive interposer, where local cache integrates beneath active tiles, so its inclusion here could deliver a significant performance uplift.

Unlike the regular Nova Lake‑S, H, and HX variants, which are expected to feature half as many Xe3 cores or fewer, the AX model promises a truly gaming‑worthy APU experience with a massive iGPU. With a combined TDP approaching 150 W in mobile workstations, active cooling will be mandatory in all form factors. Intel clearly aims to challenge AMD's long‑standing lead in the APU space. Historically, Intel CPUs have shipped with modest iGPUs, while AMD's APUs have offered solid integrated performance capable of driving many games at 1080p and even 1440p. Nova Lake‑AX pairs substantial CPU horsepower with a massive Xe3 graphics engine, marking Intel's most aggressive entry yet for enthusiast and gaming laptops. As Nova Lake news ramps up, we expect more details, such as clock speeds, exact Xe3 core count, and pricing, to emerge in the coming months.




View at TechPowerUp Main Site | Source
 
The math is not mathing here.
Up to 52-core is the Nova Lake-HX/S with 2 compute tiles of each type plus 4 LPE ones - 8*2 + 16*2 + 4 = 52.
Nova Lake-AX is 1 compute tile of each type - 8 + 16 + 4 = 28.
Edit: looks like I read the sources incorrectly, my bad.

The 100MB bLLC might be the real headliner here.
 
Last edited:
The math is not mathing here.
Up to 52-core is the Nova Lake-HX/S with 2 compute tiles of each type plus 4 LPE ones - 8*2 + 16*2 + 4 = 52.
Nova Lake-AX is 1 compute tile of each type - 8 + 16 + 4 = 28.

The 100MB bLLC might be the real headliner here.
Yeah two of each. 2×(8+16)+4LPE on a separate island.
 
Last edited:
@londiste

If I'm not mistaken the four LPE cores are not on the compute tile. Therefore the math adds up.
 
Isn't really fast dedicated/unified memory needed to make most of these APU:s?
That large slab of cache can go a long way in accelerating the iGPU, think Infinity Cache.
 
Isn't really fast dedicated/unified memory needed to make most of these APU:s?
Intel currently the only CPU maker to support CUDIMMs and next gen form factors like CAMM2.
 
That large slab of cache can go a long way in accelerating the iGPU, think Infinity Cache.
That depends on how it's connected.
For example in the chiplet version of Zen 4/5 iGPU there is no access to the L3 caches located on CCDs since the iGPU is on the IO die, and has its own small L2 cache with no L3 (Infinity Cache) going back to Infinity Fabric for RAM access.
In Intel Arrow Lake the iGPU doesn't have access to CPU L3 caches either since L3 is on the compute tile, the iGPU on its own tile with exclusive 4MB of L2 cache going back to the SoC tile with its RAM controllers.
As far as I know even monolithic AMD APUs do not share L3 between iGPU and CPU.
Cache coherency is a very complex problem, so I'm curious if Intel is going to share this new big cache available with the iGPU, but I remain sceptical. Maybe a dynamic partitioning of this huge cache between iGPU and CPU depending on the workload?
 
That depends on how it's connected.
For example in the chiplet version of Zen 4/5 iGPU there is no access to the L3 caches located on CCDs since the iGPU is on the IO die, and has its own small L2 cache with no L3 (Infinity Cache) going back to Infinity Fabric for RAM access.
In Intel Arrow Lake the iGPU doesn't have access to CPU L3 caches either since L3 is on the compute tile, the iGPU on its own tile with exclusive 4MB of L2 cache going back to the SoC tile with its RAM controllers.
As far as I know even monolithic AMD APUs do not share L3 between iGPU and CPU.
Cache coherency is a very complex problem, so I'm curious if Intel is going to share this new big cache available with the iGPU, but I remain sceptical. Maybe a dynamic partitioning of this huge cache between iGPU and CPU depending on the workload?
I understand your concern. But going by the article:
A separate cache-boosted passive tile could add over 100 MB of Last Level Cache (bLLC), which feeds both the CPU cores and the Celestial Xe3 iGPU
maybe Intel can pull a rabbit out of the hat?
 
Last edited:
If this is still dual channel DDR5 (8000 mt ) it’s going to be bandwidth limited. Strix halo uses double the lanes to feed its iGPU - more of a difference than cache would seem to be able to make up for. Maybe there’s some high bandwidth scenario here too?
 
If this is still dual channel DDR5 (8000 mt ) it’s going to be bandwidth limited. Strix halo uses double the lanes to feed its iGPU - more of a difference than cache would seem to be able to make up for. Maybe there’s some high bandwidth scenario here too?

I have been wondering about this too. Intel and AMD only have so long that they can sit on Dual channel if core/thread counts are going to increase wrappedly unless RAM is going to ramp just as quickly.
 
Isn't really fast dedicated/unified memory needed to make most of these APU:s?
The RAM would need to be soldered on the motherboard, CUDIMM or CAMM2 in dual channel is still a bandwidth limitation for the iGPU, and is why Strix Halo for example has unified memory.
 
maybe Intel can pull a rabbit out of the hat?
They did try twice before with two eDRAM-as-L4 designs on Broadwell and Skylake. Also the Ponte Vecchio compute accelerator had a giant cache in the base tile, but that turned out to be under-performing especially in terms of latency. So much that its successor got canceled.
AMD's 3D-VCache design is terrific (barely increased latency for significant size growth), let's hope Intel can match it in the client segment for the sake of competition. My concern is that Intel's design will be complex to manufacture so possibly more expensive as a result.
 
These sound super cool and im all on board for seeing where they go with this. But every time i read something about "look at our crazy flagship sku idea" and its a year or more out... i gotta really wait and see. I think this would hit the performance/portability middle ground a lot of people are looking for, and now with intels graphics drivers being pretty well matured i could see these doing really well.
Id like to see these in a desktop form factor, just a cracked out desktop APU that could be shoved into USFF systems (which will be 50% heatsink by volume).
 
What apps are actually using 50+ cores other than rendering/compiling? What is the point? Most of the time 1 to 8 cores will be fairly well loaded, and the rest will be wasted silicon. Is there some epeen award for laptops with the most wasted silicon I'm not aware of?
 
What apps are actually using 50+ cores other than rendering/compiling? What is the point? Most of the time 1 to 8 cores will be fairly well loaded, and the rest will be wasted silicon. Is there some epeen award for laptops with the most wasted silicon I'm not aware of?
The point is probably the best efficiency in various workloads. That is ranging from idle, to single-threaded and then multi-threaded (moderate and heavy) workloads.

Having just one type of core means (very) good performance but likely poor efficiency, something like killing a fly with a sledgehammer.

But the operating system needs to be optimized in order to handle properly this hybrid arch, otherwise it will be a semi-flop.
 
and is why Strix Halo for example has unified memory.
All APUs/CPUs with iGPUs have "unified memory".
Strix Halo difference from other x86 consumer CPUs is that it has a 256-bit bus instead of a 128-bit one.

What apps are actually using 50+ cores other than rendering/compiling? What is the point? Most of the time 1 to 8 cores will be fairly well loaded, and the rest will be wasted silicon. Is there some epeen award for laptops with the most wasted silicon I'm not aware of?
I guess then don't buy such high-end laptop if you're not doing any of such tasks that it could be useful?
 
All APUs/CPUs with iGPUs have "unified memory".
Strix Halo difference from other x86 consumer CPUs is that it has a 256-bit bus instead of a 128-bit one.
If anything, Lunar Lake has memory packaged alongside the SoC. Strix Halo doesn't do the same, afaik.
 
If anything, Lunar Lake has memory packaged alongside the SoC. Strix Halo doesn't do the same, afaik.
Where you place the memory is not really relevant to it being "unified" or not fwiw. It does help with power consumption, and it may help with getting higher clocks due to signal integrity, but it's not a direct way to get more performance.
In the end even a LNL or even an Apple Mx chip just end up with a higher frequency on the jedec standard, but nothing out of the ordinary.
 
Where you place the memory is not really relevant to it being "unified" or not fwiw. It does help with power consumption, and it may help with getting higher clocks due to signal integrity, but it's not a direct way to get more performance.
In the end even a LNL or even an Apple Mx chip just end up with a higher frequency on the jedec standard, but nothing out of the ordinary.
Please, I wasn't disagreeing with you nor anything, just saying something which could cause some terminology confusion. Especially where marketing gets involved.
 
Please, I wasn't disagreeing with you nor anything, just saying something which could cause some terminology confusion. Especially where marketing gets involved.
No worries, sorry if it came out as some sort of disagreeing, I was just adding extra info to your point as well.

Just to clarify what we both seem to be talking about, soldered/on-package != "unified".
And having it soldered on the PCB vs on-package has no direct performance implications.
 
The point is probably the best efficiency in various workloads. That is ranging from idle, to single-threaded and then multi-threaded (moderate and heavy) workloads.

Having just one type of core means (very) good performance but likely poor efficiency, something like killing a fly with a sledgehammer.

But the operating system needs to be optimized in order to handle properly this hybrid arch, otherwise it will be a semi-flop.
The operating system should be linux qemu/KVM utilizing host CPU, sr-iov, pcie passthrough, virtio, and CPU affinity
The guest operating systems are then multiples of windows/linux guests, hevc/av1 streaming..
Not one single OS

That's how you use all those cores
 
a Last Level Cache similar to AMD Infinity Fabric doesn't work like usual L1/L2/L3 cache. It's cache attached to the memory controller and Memory operation are cached at this level. This mean a CPU core or a GPU core will first try to find the data in their own cache, and if it's not in cache, will do a normal memory operation to access the data. Then the memory controller will perform the cache lookup to see if it's in its cache.


This simplify cache topology but since it probably have higher latency than a usual large L3 cache like on Ryzen X3D. It will still probably be way better than nothing.
 
a Last Level Cache similar to AMD Infinity Fabric doesn't work like usual L1/L2/L3 cache. It's cache attached to the memory controller and Memory operation are cached at this level. This mean a CPU core or a GPU core will first try to find the data in their own cache, and if it's not in cache, will do a normal memory operation to access the data. Then the memory controller will perform the cache lookup to see if it's in its cache.


This simplify cache topology but since it probably have higher latency than a usual large L3 cache like on Ryzen X3D. It will still probably be way better than nothing.
Look up Broadwell. It had a 4th level cache that was blazing fast before the memory controller.

1752722126340.png
 
Back
Top