Die-shot Suggests "Phoenix 2" is AMD's First Hybrid Processor

JustBenching · Sep 11, 2023

Squared said:
What's different is that a Zen 4c core will behave exactly like a Zen 4 core when at the same clock speed (except maybe for cache, I'm not sure). So if the Zen 4 cores are loaded, a third Zen 4 core couldn't boost enough to outpetform a Zen 4c core. So there shouldn't be any difference in performance. Now Windows does have to schedule with to the Zen 4 cores first, but that's trivial. Remember that because if simultaneous multi-threading, this processor presents 12 cores to Windows, and Windows has to choose just one thread on each core until it gets to 7 threads. I don't hear concern about how that's working.

If the Zen 4c core would perform as well as the full fat core then there wouldn't be any full fat cores. Obviously that is not the case, zen 4c will be slower so it will have the same "issues" ecores do.

atomsymbol · Sep 11, 2023

fevgatos said:
And how is this different in practice? If a workload decides to load the C core instead of the full fat core, the end result is the same

A factor that makes practical difference is that [when the number of tasks running on the CPU is higher than the number of P-cores (irrespective of whether it is an AMD or an Intel P-core)] whether another task scheduled on an already half-used P-core (which has HT/SMT) will run faster or slower compared to scheduling the task on an unused E-core, while taking into account the fact that running a 2nd thread on a P-core will slow down the 1st thread on the P-core by about 40%. Instead of time, the question can also be reformulated in terms of power usage. Given the numbers and performance ratios, the usual core allocation order (when optimizing for time and not optimizing for power usage) is the following: 1 thread per P-core, then 1 thread per E-core, then the 2nd threads on P-cores, then the 2nd threads on E-cores (if the E-cores have HT/SMT). If optimizing for power usage instead of time, the allocation order is different. The allocation order can also be different if the application (or the OS) knows that certain threads running on the CPU are heavily sharing data via L1D / L1I / L2 caches.

AnotherReader · Sep 11, 2023

atomsymbol said:
A factor that makes practical difference is that [when the number of tasks running on the CPU is higher than the number of P-cores (irrespective of whether it is an AMD or an Intel P-core)] whether another task scheduled on an already half-used P-core (which has HT/SMT) will run faster or slower compared to scheduling the task on an unused E-core, while taking into account the fact that running a 2nd thread on a P-core will slow down the 1st thread on the P-core by about 40%. Instead of time, the question can also be reformulated in terms of power usage. Given the numbers and performance ratios, the usual core allocation order (when optimizing for time and not optimizing for power usage) is the following: 1 thread per P-core, then 1 thread per E-core, then the 2nd threads on P-cores, then the 2nd threads on E-cores (if the E-cores have HT/SMT). If optimizing for power usage instead of time, the allocation order is different. The allocation order can also be different if the application (or the OS) knows that certain threads running on the CPU are heavily sharing data via L1D / L1I / L2 caches.

Optimizing for power is more complex especially with Intel's E cores which are less power efficient than P cores for many tasks.

Assimilator · Sep 11, 2023

fevgatos said:
If the Zen 4c core would perform as well as the full fat core then there wouldn't be any full fat cores. Obviously that is not the case, zen 4c will be slower so it will have the same "issues" ecores do.

It will almost certainly be slower in some scenarios. What those scenarios are we don't know yet, and they may be irrelevant to most users.

atomsymbol · Sep 11, 2023

AnotherReader said:
Optimizing for power is more complex especially with Intel's E cores which are less power efficient than P cores for many tasks.

It is considered to be complex today (year 2023). A major issue is that most operating systems aren't designed to measure total task power. If they were able to measure it, optimizing for total task consumption would be very simple from end-user perspective (i.e: just a few key presses or mouse button clicks to turn such an optimization target on/off). Sometime in the future, it will be simple.

persondb · Sep 11, 2023

atomsymbol said:
A factor that makes practical difference is that [when the number of tasks running on the CPU is higher than the number of P-cores (irrespective of whether it is an AMD or an Intel P-core)] whether another task scheduled on an already half-used P-core (which has HT/SMT) will run faster or slower compared to scheduling the task on an unused E-core, while taking into account the fact that running a 2nd thread on a P-core will slow down the 1st thread on the P-core by about 40%. Instead of time, the question can also be reformulated in terms of power usage. Given the numbers and performance ratios, the usual core allocation order (when optimizing for time and not optimizing for power usage) is the following: 1 thread per P-core, then 1 thread per E-core, then the 2nd threads on P-cores, then the 2nd threads on E-cores (if the E-cores have HT/SMT). If optimizing for power usage instead of time, the allocation order is different. The allocation order can also be different if the application (or the OS) knows that certain threads running on the CPU are heavily sharing data via L1D / L1I / L2 caches.

Most of what you say is already applicable to systems with only one core type. Note that a lot of thing that you take as assumption isn't necessarily true too.

You talked about a P-core that is 'half-used', in which you seem to be talking about one thread running in one of the two hardware threads. Will scheduling a new task on the other hardware thread end up harming the performance of the first? Maybe.

There is actually no way of knowing without analyzing exactly what each of those threads are doing. For all we know, the first thread might be stalled because it's accessing some peripheral(say a SSD) that has a latency in the orders of microseconds(note clocks are in nanoseconds or less range for GHz), if it's already not using the core resources then it likely won't matter. Same thing for say, the first thread doesn't use FP/Vector code and the second one does use it very heavily. Or etc etc, threads can end up stalling for whatever reason(and OoO execution tries to hide that but it's not perfect) and so one thread basically has all core resources for some time.

Also, it's not as simple as your solution seems to suggest, you are just locking threads on core types as they are spawned. A good reason to start with is that any modern OS will have more than a hundred threads running at the same time...

Well, there is a lot of reasons, but this is already too long.

atomsymbol · Sep 11, 2023

persondb said:
Most of what you say is already applicable to systems with only one core type. Note that a lot of thing that you take as assumption isn't necessarily true too.

You talked about a P-core that is 'half-used', in which you seem to be talking about one thread running in one of the two hardware threads. Will scheduling a new task on the other hardware thread end up harming the performance of the first? Maybe.

There is actually no way of knowing without analyzing exactly what each of those threads are doing. For all we know, the first thread might be stalled because it's accessing some peripheral(say a SSD) that has a latency in the orders of microseconds(note clocks are in nanoseconds or less range for GHz), if it's already not using the core resources then it likely won't matter. Same thing for say, the first thread doesn't use FP/Vector code and the second one does use it very heavily. Or etc etc, threads can end up stalling for whatever reason(and OoO execution tries to hide that but it's not perfect) and so one thread basically has all core resources for some time.

Also, it's not as simple as your solution seems to suggest, you are just locking threads on core types as they are spawned. A good reason to start with is that any modern OS will have more than a hundred threads running at the same time...

Well, there is a lot of reasons, but this is already too long.

The above arguments seem obvious from my perspective. So, I agree.

Geofrancis · Sep 11, 2023

persondb said:
Notice the same clock speed. This isn't going to have the same clock speed at all, if AMD could do a core that was half the size and had roughly the same clock speed, they would just do that...

If the 2 Zen 4 cores are loaded and the program needs more cores then the Zen4C will be the bottleneck just like how Gracemont is for Golden Cove. Or there might be cases like the OS schedules tasks to Zen4C(which will boost to the highest clock) instead of Zen4 cores

The physical implementation is different anyhow too, so how knows the effect of stuff like the different Memory Cell that they are using for Zen4C.

Zen 4c: AMD’s Response to Hyperscale ARM & Intel Atom

Bergamo Volumes, ASP, Performance, Hyperscale Order Shift, Die Shot, Floorplan, Physical Design, and Future Use of Dense Core Variants Bergamo, AMD’s upcoming 128-core server part sets new heights …

www.semianalysis.com

My understanding is that windows normally wants to send the most resource hungry process to the core with the highest clock speed, with the 7950x3d it was a problem because the non v-Cache cores ran faster than the ones with the cache, that's a totally different scenario to this where the zen4 cores will always clock faster than the zen4c cores so will always be preferred.

Squared · Sep 11, 2023

fevgatos said:
If the Zen 4c core would perform as well as the full fat core then there wouldn't be any full fat cores. Obviously that is not the case, zen 4c will be slower so it will have the same "issues" ecores do.

Zen 4 is a balanced core; it is designed to get the best of power efficiency, density, and clock speed from 15W notebooks to 170W desktops. Zen 4C has the same logic but re-arranged for power efficiency and density. The result is that at a given lower clock speed, it needs less power than Zen 4 at that same clock speed, but it performs the same. But it consumes more power at higher clock speeds, so it can't clock as high. A single Zen 4 core can consume 15W, so in a 15W laptop processor, two Zen 4 cores cannot reach their maximum clock speed simultaneously. So as the number of cores in use goes up, the clock speed goes down to stay within the power and heat limits of the laptop. At some point, the clock speed will go down to a speed where Zen 4 and Zen 4c are equally efficient. Below that point, Zen 4c will actually be able to clock higher in the same power limit, or use less power so that Zen 4 can keep up.

So in a 15W laptop processor, it's quite possible and I think more likely than not that two Zen 4 plus four Zen 4c will be faster in most tasks than six Zen 4 cores.

R0H1T · Sep 11, 2023

persondb said:
If the 2 Zen 4 cores are loaded and the program needs more cores then the Zen4C will be the bottleneck just like how Gracemont is for Golden Cove. Or there might be cases like the OS schedules tasks to Zen4C(which will boost to the highest clock) instead of Zen4 cores

That's only because of the alleged low clock speeds of zen4c on servers, we don't know how they'll clock on desktops if they're even released there.

persondb said:
The physical implementation is different anyhow too, so how knows the effect of stuff like the different Memory Cell that they are using for Zen4C.

Why does it matter how they implement the cores? They could throw in a fake 4D effect for all I care ~ AMD Ryzen Z1 APU Features Zen 4c Cores

The only thing that matters is the performance & with a shared L3 it looks like there could be minimal difference there!

persondb · Sep 11, 2023

R0H1T said:
Why does it matter how they implement the cores? They could throw in a fake 4D effect for all I care ~ AMD Ryzen Z1 APU Features Zen 4c Cores

The only thing that matters is the performance & with a shared L3 it looks like there could be minimal difference there!

Because physical implementation is part of the performance. There is a lot of details in physical implementation, like say register duplication, which is in many cases much faster than without as you would be reducing the critical paths, routing and etc.

There is very likely a reason why they have chosen to only use the 6T pseudo-dual port memory cell for Zen 4c and not the normal cores. Those details are likely going to bring clocks considerably down.

About shared L3, it can also depend if they are doing a single CCX or not. I remember that there were stuff about a 4 cores Zen 5 and 8 cores Zen 5c APU(or was it Zen 4 variants for both?) that had each of the core types in one CCX. The new memory cells might also be of higher latency too, it doesn't contradict with anything AMD said about Zen 4c afaik since that isn't viewed as part of the architecture and they only said the architecture remains the same.

Either way, the result is likely that it isn't going to clock anywhere as high as normal Zen 4 and it's also going to have different V/F curves.

R0H1T · Sep 11, 2023

And AMD can simply artificially limit the zen4 clocks if they really need to, though that would defeat the purpose of going this route in essence. Also for the current performance targets, for Z1 APU, the clocks are perfectly reasonable.

What you're saying is also not unheard of, in fact *dozer probably had some of these higher density variants on 28nm(?) IIRC which were supposedly designed to save space. It was around that timeline, though I'm not 100% sure if it were the same products.

qcmadness · Sep 11, 2023

R0H1T said:
And AMD can simply artificially limit the zen4 clocks if they really need to, though that would defeat the purpose of going this route in essence. Also for the current performance targets, for Z1 APU, the clocks are perfectly reasonable.

What you're saying is also not unheard of, in fact *dozer probably had some of these higher density variants on 28nm(?) IIRC which were supposedly designed to save space. It was around that timeline, though I'm not 100% sure if it were the same products.

Saving die space and thus current leakage.

zlobby · Sep 11, 2023

Space Lynx said:
@lexluthermiester for context, I hate ecores with a passion. I saw total war warhammer utilize 81% in ecore and like 10% in pcore randomly at times, causing the game to have low fps dips once in awhile. 99% of games work fine, but I want ALL games to work fine.

down with ecores!!! DOWN WITH THEM!!!!

Amen!

lexluthermiester · Sep 12, 2023

fevgatos said:
the end result is the same

No, it isn't.

persondb said:
There is no difference between P-cores and E-cores in Intel client implementation, ISA-wise.

Moose muffins. There absolutely is a difference. Just because you don't understand the difference doesn't mean there isn't one. However, it is VERY complicated and I'm not going to take the time to explain it.

dyonoctis · Sep 12, 2023

lexluthermiester said:
No, it isn't.

Moose muffins. There absolutely is a difference. Just because you don't understand the difference doesn't mean there isn't one. However, it is VERY complicated and I'm not going to take the time to explain it.

I just feel likes this whole debate will ultimately depends on whether or not AMD decides to limit the clock of zen 4C vs classic zen 4.

lexluthermiester · Sep 12, 2023

dyonoctis said:
I just feel likes this whole debate will ultimately depends on whether or not AMD decides to limit the clock of zen 4C vs classic zen 4.

It seems they have already done that. The thing is, the cores are seemingly electrically and functionally the same, just clock limited. With Intel's Big/Little, the P-Cores are functionally different from the E-Cores. The E-Cores are an enhanced Atom generation CPU core, were as the P-Cores are the new hotness. Windows has to have a different set of runtimes for one core VS the other on Intel, where-as Windows does not need to do anything different for this new Ryzen CPU as the cores are functionally the same, just at different speeds, something far easier to manage.

Squared · Sep 12, 2023

lexluthermiester said:
Moose muffins. There absolutely is a difference. Just because you don't understand the difference doesn't mean there isn't one. However, it is VERY complicated and I'm not going to take the time to explain it.

The instruction set architecture or ISA is the language right processor and software talk to one another in. Software either uses a subset of the ISA that's common to all current x86 processors or it checks what instructions are available and uses what it can. Windows will move software around between P cores and E cores while the software is running, so if the software starts in a P core and starts using AVX-512, then gets moved to an E core that doesn't have it, it'll probably crash. Intel disabled all instructions in Alder Lake that weren't available to both Golden Cove (P core) and Gracemont (E core) to avoid this issue. So they have an identical ISA.

What's different is their microarchitecture, which is the inner workings of the CPU core. Zen 4 and Zen 4c on the other hand have nearly identical microarchitectures. They're differentiated by tracing, layout, and cache, with AMD claiming that the end result is identical performance when at the same clock speed. But a Golden Cove core is a lot faster than a Gracemont core when running at the same clock speed.

persondb · Sep 12, 2023

lexluthermiester said:
Moose muffins. There absolutely is a difference. Just because you don't understand the difference doesn't mean there isn't one. However, it is VERY complicated and I'm not going to take the time to explain it.

No, there isn't. The ISA/feature level is exactly the same, it runs the same code.

If you are talking about performance, then yes, for sure, there is a huge difference.

Are you talking about specific MSRs? That does seem to have a difference, specially for the performance counters but that isn't meaningful in the discussion. The difference we are talking about is what it can execute or not at different performance levels, sure a Golden Cove is going to be much faster than a Gracemont, but so is going to be Zen 4 vs Zen 4c, which is where the complication of scheduling arrives.

https://perfmon-events.intel.com/ahybrid.html

Noting that people have been implementing the same core but in different ways for quite some time. There are plenty of A53 SoCs that have two clusters running at different clocks.

Wirko · Sep 12, 2023

dyonoctis said:
I just feel likes this whole debate will ultimately depends on whether or not AMD decides to limit the clock of zen 4C vs classic zen 4.

lexluthermiester said:
It seems they have already done that. The thing is, the cores are seemingly electrically and functionally the same

Sure, AMD did limit the clock of the 4C core, but they did it at the design stage, not only after the processor was powered on. Larger transistors = faster transistors, at least on average, because the necessary size of each transistor depends on its role in the circuit. If its role is to drive some signal to many other transistors and/or over longer wires and/or at a higher speed, it needs to be larger in order to overcome the capacitances in the circuit. I'm posting a link to an article by David Kanter here again, it's not an easy read but it is very informative.

High-speed designs like the server processor tend to use more custom circuit design and larger transistors that have greater drive strength and reduced variability. In modern FinFET-based designs, this translates into more transistors with 2 fins, 3 fins, or even more. In contrast, lower-speed logic like an explicitly parallel GPU or ASICs often employ the densest transistors that use just a single fin, sacrificing clock speed to improve density. Similar to high-speed logic, ultra-low leakage transistors are often larger as well.

persondb said:
There is very likely a reason why they have chosen to only use the 6T pseudo-dual port memory cell for Zen 4c and not the normal cores.

They did that for the L2 cache IIRC, right?

JustBenching · Sep 12, 2023

lexluthermiester said:
It seems they have already done that. The thing is, the cores are seemingly electrically and functionally the same, just clock limited. With Intel's Big/Little, the P-Cores are functionally different from the E-Cores. The E-Cores are an enhanced Atom generation CPU core, were as the P-Cores are the new hotness. Windows has to have a different set of runtimes for one core VS the other on Intel, where-as Windows does not need to do anything different for this new Ryzen CPU as the cores are functionally the same, just at different speeds, something far easier to manage.

So if a thread is sent to the 4c core, performance will suffer, just like with ecores. Yes?

Wirko · Sep 12, 2023

fevgatos said:
So if a thread is sent to the 4c core, performance will suffer, just like with ecores. Yes?

Yes, and in a way, it's worse: performance will suffer even more if two threads are sent to a 4c core when it's not absolutely necessary. That is, when you have another free 4c core but you choose to keep it idle to save power.

qcmadness · Sep 12, 2023

fevgatos said:
So if a thread is sent to the 4c core, performance will suffer, just like with ecores. Yes?

Theoretically, yes.
Practically, no.

AnotherReader · Sep 12, 2023

I think there's much ado about nothing here. Regular processors, even ones with the same cores such as the 12600k or AMD's entire Ryzen portfolio before this SKU, already have different maximum clock speeds. In this case, it looks like the Zen 4 and Zen 4c cores share the same L3. If that's the case, then there will be no difference in IPC. However, the Zen 4c cores will clock lower than the Zen 4 cores which is a consequence of their physical design. In a power constrained scenario, that's unlikely to matter as these cores will only have threads scheduled onto them if the Zen 4 cores are occupied. In that case, the entire SOC will be running below peak clocks. Remember that Windows allocates threads in this manner:

Cores get threads allocated in order of their speed with thread 1 going to core 0, thread 2 going to core 1, and so on
Once the number of threads in the active task reaches the number of cores, then simultaneous mulithreading kicks in and again threads are allocated in order of core speed

As can be seen from the above, for a hypothetical process that can utilize 12 threads, it'll get scheduled onto the Zen 4 cores first (2 threads) and then 4 threads will be scheduled onto the Zen 4c cores. The remaining 6 threads will also be scheduled in a similar fashion. If it spawned only 2 threads, then they will be scheduled onto the Zen 4 cores.

Squared · Sep 12, 2023

fevgatos said:
So if a thread is sent to the 4c core, performance will suffer, just like with ecores. Yes?

Theoretically, if Zen 4c is more power-efficient at low power and if in a low-power device, it could be faster than Zen 4. Especially if the two Zen 4 cores are already busy.

System Name	Mean machine
Processor	AMD 6900HS
Memory	2x16 GB 4800C40
Video Card(s)	AMD Radeon 6700S

Processor	Ryzen 9 9950X
Motherboard	X670 chipset
Cooling	Arctic Liquid Freezer III 240
Memory	64 GiB
Video Card(s)	RX 7800XT
Storage	WD Black SN750, Seagate FireCuda 530, Crucial BX500, WD Blue HDD, Seagate IronWolf HDD
Display(s)	Samsung (4K, FreeSync)
Case	Phanteks NEO Air
Power Supply	EVGA 750 B5
Mouse	Eternico wireless mouse
Keyboard	HyperX Alloy Origins Core Aqua with Corsair Onyx Black keycaps
Software	Linux + KVM

Processor	Ryzen 7 5700X
Motherboard	ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling	Noctua NH-C14S (two fans)
Memory	2x16GB DDR4 3200
Video Card(s)	Reference Vega 64
Storage	Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s)	Nixeus NX-EDG27, and Samsung S23A700
Case	Fractal Design R5
Power Supply	Seasonic PRIME TITANIUM 850W
Mouse	Logitech
VR HMD	Oculus Rift
Software	Windows 11 Pro, and Ubuntu 20.04

System Name	Firelance.
Processor	Threadripper 3960X
Motherboard	ROG Strix TRX40-E Gaming
Cooling	IceGem 360 + 6x Arctic Cooling P12
Memory	8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s)	MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage	2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s)	Dell S3221QS(A) (32" 38x21 60Hz) + 2x AOC Q32E2N (32" 25x14 75Hz)
Case	Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply	Fractal Design Ion+ 2 Platinum 760W
Mouse	Logitech G604
Keyboard	Razer Pro Type Ultra
Software	Windows 10 Professional x64

Processor	Ryzen 9 9950X
Motherboard	X670 chipset
Cooling	Arctic Liquid Freezer III 240
Memory	64 GiB
Video Card(s)	RX 7800XT
Storage	WD Black SN750, Seagate FireCuda 530, Crucial BX500, WD Blue HDD, Seagate IronWolf HDD
Display(s)	Samsung (4K, FreeSync)
Case	Phanteks NEO Air
Power Supply	EVGA 750 B5
Mouse	Eternico wireless mouse
Keyboard	HyperX Alloy Origins Core Aqua with Corsair Onyx Black keycaps
Software	Linux + KVM

System Name	Machine XX
Processor	Ryzen 7600
Motherboard	MSI X670E GAMING PLUS
Cooling	120mm heatsink
Memory	32GB DDR5 6000 CL30
Video Card(s)	RX5700XT 8Gb
Storage	280GB Optane 900p + 2tb 4.0 NVME + 2tb sata ssd.
Display(s)	19" + 23" + 17"
Case	ATX
Audio Device(s)	Soundblaster Z
Power Supply	800W
Software	Windows 11

Processor	AMD Ryzen 3700x
Motherboard	asus ROG Strix B-350I Gaming
Cooling	Deepcool LS520 SE
Memory	crucial ballistix 32Gb DDR4
Video Card(s)	RTX 3070 FE
Storage	WD sn550 1To/WD ssd sata 1To /WD black sn750 1To/Seagate 2To/WD book 4 To back-up
Display(s)	LG GL850
Case	Dan A4 H2O
Audio Device(s)	sennheiser HD58X
Power Supply	Corsair SF600
Mouse	MX master 3
Keyboard	Master Key Mx
Software	win 11 pro

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin