• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
12,440 (3.45/day)
Location
Concord, NH
System Name Apollo
Processor Intel Core i9 9880H
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Full Size Wireless Apple Magic Keyboard
Software MacOS 10.15.7
That is indeed an assumption. Irrespective of that, I still don't think their core architecture is that much better, virtually all of SECS's tests are known to rely heavily on memory ops, favoring either huge caches or fast system memory.
I think you're underestimating the benefits of improving the cache hit ratio. Most of the time in the environment I work in, caching performance is what determines a huge bit of performance since latency otherwise is dominated by reach out to do I/O. Granted, this is caching at a different level of the mem hierarchy, but the idea is the same. Every time you improve the hit ratio, you're improving performance because you're essentially taking a fraction of the time to do the same thing, not to mention that it won't get nearly as close to stalling the pipeline.

Just look at AMD. Infinity cache is serving a very important purpose and it's the same purpose as why Apple has a very large cache as well. More cache means better hit ratios which yields better performance. It might seem like an oversimplification, but it's really not.
 
Joined
Jan 8, 2017
Messages
7,047 (3.93/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
I think you're underestimating the benefits of improving the cache hit ratio.

I'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.

On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth. This is really the only concrete area where Intel and AMD can't do jack shit about, not by themselves anyway.
 
Last edited:
Joined
Oct 15, 2019
Messages
376 (0.48/day)
IPC has never been a system level metric, it has always been processor specific. You can search for papers regarding measurements of IPC and you'll never come across a system level study of IPC because it doesn't make sense, the CPU is a constant, the system isn't. They always focus on isolating the characteristic of the CPU alone.
I’m pretty sure that the system is also constant enough. For zen3 it is jedec 3200 ram etc. For M1max it is whatever apple decided to pair it with. Pretty simple.




Plus to say that it's a system level thing implies that everything should be measured together, what do we do if we want to measure FLOPS throughput ? Do you count the GPU and all the various other accelerators in as well ? After all it's all on the same SoC, same system, right ?
If the software you are using to calculate IPC with happens to use the GPU things get pretty complicated, but I don’t see why it should not be used. IPC is always application specific.


It results in more instruction been executed per clock some of the time, the upper and lower bounds of IPC and it's behavior remain exactly the same.
Upper bound isn’t interesting, and neither is lower bound. Trying to write software to get the min and max of a given processor is a pointless excercise. I’m interested in IPC measured with software that is actually used by people to do something productive.

edit: and for max, you can just read the wikichip page of a given processor, and check how many instructions it can dispatch every cycle. Is that something that relates to actual application performance? No.
 
Joined
Mar 10, 2010
Messages
9,485 (2.21/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R7 3800X@4.350/525/ Intel 8750H
Motherboard Crosshair hero7 @bios 2703/?
Cooling 360EK extreme rad+ 360$EK slim all push, cpu Monoblock Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 32Gb in four sticks./16Gb
Video Card(s) Sapphire refference Rx vega 64 EK waterblocked/Rtx 2060
Storage Silicon power qlc nvmex3 in raid 0/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd
Display(s) Samsung UAE28"850R 4k freesync.
Case Lianli p0-11 dynamic
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
I’m pretty sure that the system is also constant enough. For zen3 it is jedec 3200 ram etc. For M1max it is whatever apple decided to pair it with. Pretty simple.





If the software you are using to calculate IPC with happens to use the GPU things get pretty complicated, but I don’t see why it should not be used. IPC is always application specific.



Upper bound isn’t interesting, and neither is lower bound. Trying to write software to get the min and max of a given processor is a pointless excercise. I’m interested in IPC measured with software that is actually used by people to do something productive.

edit: and for max, you can just read the wikichip page of a given processor, and check how many instructions it can dispatch every cycle. Is that something that relates to actual application performance? No.
I don't think I like your definition of IPC thankfully we already have adequate ways to test.

IPC is chip ,no core specific everything else in a system is changeable.

And I do get your point , so do others that's why reviews exist showing different application performance metrics.
 
Joined
May 2, 2017
Messages
5,496 (3.27/day)
Location
Norway, currently in Lund, Sweden
System Name Hotbox
Processor AMD Ryzen 7 5800X
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling Aquanaut + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UV@950mV/2050MHz/180W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G602
Keyboard Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
That is indeed an assumption. Irrespective of that, I still don't think their core architecture is that much better, virtually all of SECS's tests are known to rely heavily on memory ops, favoring either huge caches or fast system memory.
It really isn't. There's no reason to expect an M1 in a Mac Mini to throttle under a single-core workload - it's neither thermally nor power constrained. And a 5950X under any type of reasonable cooling can maintain its max turbo (or even exceed it) in any ST workload indefinitely.
If the pro max matches the 3080 in workloads it will be impressive... mark gurman said the desktop m1 max is 128 core gpu thats 3090 desktop performance
It won't - they don't even claim that. They claim ballpark 3080 mobile performance, but even in their vague and unlabeled graph they don't reach the same level. The Pro is compared to a mobile 3050 Ti, with the Max compared to the 3080 at 100 and 160W, beating the former and coming close to the latter.
Sure, but if that's the case let's stop thinking that their chips are the greatest thing since sliced bread.
Nobody is saying that, we're just recognizing that they're pulling off some pretty astounding performance from a ~3GHz ARM core, matching or beating the fastest X86 cores at a fraction of the power and clock speed.
I don't think anyone buys 3000$+ laptops for office work, or if they do they're incredibly unintelligent. What Apple knows is that people want to use some of the professional software available for mac, not necessarily their hardware.
You'd be surprised. Not to mention the rich kids with $3000+ MBPs for school/studies etc. Though tbf, most of those have the air or at worst the 13" M1 MBP, which are both cheaper and less powerful. These will sell like hotcakes to photographers, videographers, animators, journalists, musicians, all kinds of creative professionals, and a whackton of image-obsessed rich people.
L3 caches are sometimes too slow to provide meaningful speed ups, it's usually the L1 caches that get hammered.
Then why are AMD claiming a 15% IPC (in gaming workloads) bump from their stacked 3D cache? Given that those are among the most latency sensitive workloads, and a stacked, via-connected cache is a bit of a worst case scenario, that seems to contradict what you're saying here.
Then it doesn't even make sense talking about IPC in that case, because any CPU will suddenly have higher IPC if it gets faster system memory or larger caches.
Only if it manages to keep latencies down while increasing them - that's part of why Rocket Lake has such weird performance characteristics.
IPC has never been a system level metric, it has always been processor specific. You can search for papers regarding measurements of IPC and you'll never come across a system level study of IPC because it doesn't make sense, the CPU is a constant, the system isn't. They always focus on isolating the characteristic of the CPU alone.

Plus to say that it's a system level thing implies that everything should be measured together, what do we do if we want to measure FLOPS throughput ? Do you count the GPU and all the various other accelerators in as well ? After all it's all on the same SoC, same system, right ?
I think what they mean is that you can't measure IPC outside of the influence of the OS and its systems and workings - you need software to run and ways for that software to communicate with the CPU, after all. So in that way it is a system-level metric, as non-hardware changes can also affect it. The L3 latency bug in W11 would seem to noticeably lower Zen2/Zen3 IPC, for example.
I'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.

On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth.
I agree that it's about time PC CPU manufacturers start breaking down some walls, but you don't quite seem to appreciate the scope of Apple's engineering on their recent SoCs ad CPU core architectures. I'd highly recommend reading AnandTech's M1/A14 deep dive, as it goes in depth (with self-written feature tests which are excellent illustrations) for everything from L1 cache behaviour to numbering the various execution ports in the CPU and estimating an overall layout of the CPU.

In short, what it shows is that Apple is somehow managing L1 and L2 caches several times the size of the competition (6x the L1I size!) with lower latency - which is downright incredible, as conventional logic says that any cache size increase will increase latency too (which has borne out over several generations of Intel and AMD CPUs, for example) - while also having re-order buffers 2-3x the size of Intel and AMD, an 8-wide (compared to 4-wide for both Intel and AMD) decoder, and 2-3x the execution ports, etc. Managing to design a CPU core this wide without significant performance or power penalties and managing to keep it fed is very impressive - and likely highly dependent on tightly integrated RAM, as well as those massive caches, but that doesn't take away from the performance results. The main drawback of ultra-wide core designs is clock speeds, but Apple seems to be doing decently there as well with >3GHz sustained and even 3GHz on the mobile A14.

Is this "the best CPU out there"? Not necessarily. That depends on your use case and software needs. But is it the most advanced architecture out there? Without a doubt. Do AMD and Intel have their work cut out for them to keep up, let alone catch up? Absolutely.

Me? I really hope this leads AMD to bet on more integrated APUs, and unified memory. I would love a balls-to-the-wall APU with heaps of LPDDR5 for my next laptop. 20-30CUs at low clocks? That would be amazing. It wouldn't be cheap, but it would be fantastic, as long as they can get unified memory working in Windows.
 
Joined
Oct 15, 2019
Messages
376 (0.48/day)
I don't think I like your definition of IPC thankfully we already have adequate ways to test.

IPC is chip ,no core specific everything else in a system is changeable.

And I do get your point , so do others that's why reviews exist showing different application performance metrics.
It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.
 
Joined
Oct 12, 2005
Messages
293 (0.05/day)
The PC here have it's inconvenient of their advantages and these Apple M1* have the advantages of their inconvenient.

The PC parts need to be supported in various system, they need to be upgradable. (like expanding memory). This is an advantages over the M1 but the inconvenient is slower standard adoption, more latency due to the fact that the memory isn't standard, and is further away from the CPU. They also have less flexibility on the memory design since adding channels require a new socket.

On the M1 part, they are specifically designed for specific form factor. The memory isn't upgradable and is being soldered on the motherboard close to the CPU. Their design allow them to scale up and down the memory bus and adopt new standard rapidly since they don't have to deal with a standard form factor for upgrade .This also allow them to have the memory very close for better latency and better energy efficiency. But if you want to get more memory because you didn't buy large enough, you have to buy a new device. This is good for apple because people will tend to buy higher than they need because they will not want to have a costly upgrade later.


Apple is just pushing their advantages since no one seems to care about their inconvenient on their platform. But if AMD and Intel would do something similar, many PC enthusiast wouldn't like that.

It still make a lot of sense to do on a laptop since a lot of the time, it will never be upgraded. Also Apple own their entire stack. If they want to put an accelerator, they can leverage an API in the OS and make their compiler to use it whenever it needed.

In reality, i think they are where they are supposed to be regarding their own performance. The fact isn't that they outperform now, it's that they sucked for 2 decade being slowed down by Intel chips. They are just where a company that own their full stack should be right now.

And the fun things is you can buy if you want, and you can buy a PC if you prefer. PC isn't dead.

It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.

In the purest form, this is IPC.

But IPC is problematic across different instruction set.
Let say in theory, you have a CISC instruction that load a number, increment it by 1 then save it into the memory. It can do that accross 3 cycles. On the other side, you have a RISC CPU that need a load instruction, an increment instruction and and a store instruction to do the same amount of work, but each take 1 cycle to run. This mean this cpu run 3 time the number of instruction for the same amount of work. We could say it have 3x the IPC than the CISC cpu but in the end nothing more was done.

This is why in it's purest form. IPC is only a good comparison within the same Instruction Set. And it's probably only really useful to compare 2 cpu of the same manufacturer once you factor in the frequency they run.

Also, the same processor can get a higher IPC at lower frequency than at higher if it have to wait less for I/O or Memory. Waiting 60 ns for data to arrive at 2 GHz is less cycle loss than the same wait at 5 GHz. This is why it's hard to extract IPC in it's purest form from Benchmark.

What most people Call IPC these days is mostly a somehow standardized metric like the Spec Benchmark. It's no longer the amount of Instruction per clock but the amount of Work per clock. And in the end, that is what really matter.

but we should say WPC or something similar instead of IPC.
 
Joined
Oct 15, 2019
Messages
376 (0.48/day)
In the purest form, this is IPC.

But IPC is problematic across different instruction set.
Let say in theory, you have a CISC instruction that load a number, increment it by 1 then save it into the memory. It can do that accross 3 cycles. On the other side, you have a RISC CPU that need a load instruction, an increment instruction and and a store instruction to do the same amount of work, but each take 1 cycle to run. This mean this cpu run 3 time the number of instruction for the same amount of work. We could say it have 3x the IPC than the CISC cpu but in the end nothing more was done.

This is why in it's purest form. IPC is only a good comparison within the same Instruction Set. And it's probably only really useful to compare 2 cpu of the same manufacturer once you factor in the frequency they run.

Also, the same processor can get a higher IPC at lower frequency than at higher if it have to wait less for I/O or Memory. Waiting 60 ns for data to arrive at 2 GHz is less cycle loss than the same wait at 5 GHz. This is why it's hard to extract IPC in it's purest form from Benchmark.

What most people Call IPC these days is mostly a somehow standardized metric like the Spec Benchmark. It's no longer the amount of Instruction per clock but the amount of Work per clock. And in the end, that is what really matter.

but we should say WPC or something similar instead of IPC.
Yep, WPC is a lot more meaningful metric, and also always system specific, as well as application specific in the same way.

IPC is a bit silly, as for example avx512 lowers IPC, but improves WPC.
 
Joined
Jan 8, 2017
Messages
7,047 (3.93/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
Upper bound isn’t interesting, and neither is lower bound.
It's absolutely important to know the lower bounds because that tells you what the worst case scenarios is.

Trying to write software to get the min and max of a given processor is a pointless excercise.

Huh ? You always aim to use the most out of processor given whatever the time constraints are, I don't know what you are talking about.

I’m interested in IPC measured with software that is actually used by people to do something productive.
That's such a bizarre thing to say. OK, you find out that a CPU can achieve X IPC in a certain application. What can you do with that information ? Absolutely nothing, people measure IPC to generalize what the performance characteristics are. If you are only interested in an application, as you say, then IPC measurements are pointless, you're actually just interested in the performance of that application.

You'd be surprised. Not to mention the rich kids with $3000+ MBPs for school/studies etc.

I guess ? It's still stupid and that doesn't tell you anything about how relevant the hardware is if you argue that people who didn't actually needed one would buy it anyway.

Then why are AMD claiming a 15% IPC (in gaming workloads) bump from their stacked 3D cache? Given that those are among the most latency sensitive workloads, and a stacked, via-connected cache is a bit of a worst case scenario, that seems to contradict what you're saying here.

I have no idea but I fail to see why it contradicts anything that I said. I just said that more L3 cache doesn't always translate to much improved performance. If your code and data mostly resides in L1 cache then messing around with the L3 cache wont do anything. Obviously real world workloads are a mixture of stuff that benefits more or less from different levels of caches or from none of them at all.
I agree that it's about time PC CPU manufacturers start breaking down some walls, but you don't quite seem to appreciate the scope of Apple's engineering on their recent SoCs ad CPU core architectures. I'd highly recommend reading AnandTech's M1/A14 deep dive, as it goes in depth (with self-written feature tests which are excellent illustrations) for everything from L1 cache behaviour to numbering the various execution ports in the CPU and estimating an overall layout of the CPU.

I read many of his articles and while they're very good I can't help but notice he has a particular affinity for everything Apple does.
Managing to design a CPU core this wide without significant performance or power penalties and managing to keep it fed is very impressive

A wide core with huge caches and most importantly a very conservative clock speed. That's why I am not impressed, trust me that if their chips ran at much higher clocks comparable to Intel's an AMD's chips while retaining the same characteristics then I'd be really impressed. But I know that's not possible, ultimately all they did is a simple clock for area/transistor budget tradeoff because that way efficiency increases more than linearly. I just cannot give them much credit when they outperform Intel and AMD in some metrics while using who knows, maybe several times more transistors per core ?
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
12,440 (3.45/day)
Location
Concord, NH
System Name Apollo
Processor Intel Core i9 9880H
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Full Size Wireless Apple Magic Keyboard
Software MacOS 10.15.7
I'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.
You can't just throw more bandwidth at a problem and expect it to go faster. Take way back when I used an i7 3820. Quad channel DDR3-2133 gives impressive bandwidth numbers compared to a 2700k, but the reality is that the 3820 was only something like 5-8% faster at stock, but that difference wasn't the clock speeds, it was the extra 2MB of L3 that the 3820 had over the regular SB i7 chips. So bandwidth alone doesn't make a chip faster, otherwise the 290(x)/390(x) should have been insanely fast when in reality, nVidia was doing the same with half the width.

So to make a long story short, how the different levels of the memory hierarchy are built out really influences how it benefits the SoC as a whole. A huge LLC won't do you a whole lot of good if your L2 is absolutely tiny. So it's a bit more complicated than just throwing more of x, y, or z at a problem.

On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth. This is really the only concrete area where Intel and AMD can't do jack shit about, not by themselves anyway.
Wide memory interfaces for DRAM costs a lot of die space, power, and traces for the memory chips makes boards expensive to produce for it. It's not a good path forward for traditional DRAM. Now, I would agree with respect to HBM2 given its bandwidth and power characteristics, but it also comes with trade-offs in the sense that it's relatively expensive to produce. Apple is basically doing that with their DRAM, so they have the advantage of economy of scale.
 
Joined
Oct 15, 2019
Messages
376 (0.48/day)
It's absolutely important to know the lower bounds because that tells you what the worst case scenarios is.
The lower bound is something stupidly low on modern processors, but exceedingly hard to achieve. You’ll need to write a test that causes the most cache misses and is impossible to predict. Is that something that any proper SW ever expiriences? No.

Zen3 with 3200 jedec has ram latency of 80ns or so. The worst possible IPC i can think of would require the most program instructions to be fetched from ram. That requires just a huge 3d LUT check and goto based on that. So one ram latency per two instructions, meaning an ipc of around 1/200. If the prediction logic can see through that, you’ll need to add some stupid instruction to do an address conversion that cannot easily be predicted (some hash function maybe, that has a single instruction in some extension) and you end up with an IPC of around 1/133.

edit: a cleaner solution would to just write a simple routine that reads a byte at addr, then writes some hash (the processors must have some hash extension, so that it is simply one instruction) of byte to addr and loops. That would produce an ipc of 1/100 or so.
Huh ? You always aim to use the most out of processor given whatever the time constraints are, I don't know what you are talking about.
I was talking about the min and max IPC. trying to measure them is pointless.
 
Last edited:
Joined
Jan 8, 2017
Messages
7,047 (3.93/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
I was talking about the min and max IPC. trying to measure them is pointless.
It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.

The lower bound is something stupidly low on modern processors, but exceedingly hard to achieve.
Not really, all it takes is one or two unfortunate instructions in a loop to have to have major performance implications. Happens all the time and you never know about it, you just hit compile and assume that that's just how it is. It can sometimes be as simple as choosing a 32 bit variable over a 64 bit one which leads to some weird instructions under the hood that may run abnormally slow on some processors.
 
Last edited:
Joined
Oct 15, 2019
Messages
376 (0.48/day)
It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.


Not really, all it takes is one or two unfortunate instructions in a loop to have to have major performance implications. Happens all the time and you never know about it, you just hit compile and assume that that's just how it is. It can sometimes be as simple as choosing a 32 bit variable over a 64 bit one.
Nah, that is nowhere close to being at the lower bound. I updated some silly scenario that can be much much worse.
 
Joined
May 2, 2017
Messages
5,496 (3.27/day)
Location
Norway, currently in Lund, Sweden
System Name Hotbox
Processor AMD Ryzen 7 5800X
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling Aquanaut + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UV@950mV/2050MHz/180W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G602
Keyboard Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
I guess ? It's still stupid and that doesn't tell you anything about how relevant the hardware is if you argue that people who didn't actually needed one would buy it anyway.
I wasn't talking about how relevant the hardware was, I was responding to you stating that you don't think anyone buys $3000+ laptops for "office work", and your arguments against Apple knowing their audience. It's pretty clear that they do (in part because they've been pissing off their core creative audience for years, and are now finally delivering something they would want).
I have no idea but I fail to see why it contradicts anything that I said. I just said that more L3 cache doesn't always translate to much improved performance. If your code and data mostly resides in L1 cache then messing around with the L3 cache wont do anything. Obviously real world workloads are a mixture of stuff that benefits more or less from different levels of caches or from none of them at all.
Well, that's why they also have massive L1 and L2 caches (and actually no CPU L3 at all, but a system-level LLC). But it contradicts what you said because you said (without any reservations) that "L3 caches are sometimes too slow to provide meaningful speed ups", which ... well, if they didn't meaningfully do so in real-world workloads it would be pretty odd for AMD to invest millions into stacked L3 cache tech, IMO.
I read many of his articles and while they're very good I can't help but notice he has a particular affinity for everything Apple does.
...and? Is appreciating high-end engineering wrong? I haven't seen a single article that comes even close to the level of depth and quality of analysis of these articles. And nothing to contradict anything said either.
A wide core with huge caches and most importantly a very conservative clock speed. That's why I am not impressed, trust me that if their chips ran at much higher clocks comparable to Intel's an AMD's chips while retaining the same characteristics then I'd be really impressed. But I know that's not possible, ultimately all they did is a simple clock for area/transistor budget tradeoff because that way efficiency increases more than linearly.
You could say that, but only if you ignore the latencies and how they're keeping the cores fed. As I said, increasing cache size should balloon latency, yet theirs is lower than the competition despite 3-6x larger caches. And with that wide a core, you're really starting to push the boundaries of what can be effectively fed with conventional software - yet they're pulling it off. It would also be expected that this much larger die, even at lower clock speeds, would be rather power hungry for what it does - yet it isn't. This is no doubt largely down to granular power gating and the large caches saving them a lot of data shuffling (especially into/out of RAM), but that isn't the whole story.
I just cannot give them much credit when they outperform Intel and AMD in some metrics while using who knows, maybe several times more transistors per core ?
Their core are absolutely massive, that is absolutely true. But so what? They're still managing to use them in smartphones(!) and thin-and-light laptops. This mainly demonstrates that Apple is less margin conscious on this level than AMD and Intel - which is very understandable. That clearly makes this core less suited for budget devices. But less impressive? Nah. A 5950X is a $750 CPU. If Apple sold these at retail they'd no doubt be more than that, but we're not comparing to budget devices, we're comparing to the best they're putting out.

But given that Intel's latest L1 cache size increase (24 to 32K, IIRC) came with a 1-cycle latency penalty, I can't quite see how they (or AMD) would suddenly pull a 3-6x increase in cache sizes out of their sleeves without also dramatically increasing latencies, which begs the question of whether others would even be able to make a similarly huge, wide, and cache-rich core without it being hobbled by slow cache accesses and thus not being fed. That seems to be the case, as we would otherwise most likely see much wider designs for servers and other markets where costs don't matter.

It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.
No, that's why we have industry-standard benchmarks based on real-world workloads. It's obvious that no such thing will ever be perfect, but it is a reasonable approximation of performance across a wide range of real-world usage scenarios.
 
Joined
Mar 10, 2010
Messages
9,485 (2.21/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R7 3800X@4.350/525/ Intel 8750H
Motherboard Crosshair hero7 @bios 2703/?
Cooling 360EK extreme rad+ 360$EK slim all push, cpu Monoblock Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 32Gb in four sticks./16Gb
Video Card(s) Sapphire refference Rx vega 64 EK waterblocked/Rtx 2060
Storage Silicon power qlc nvmex3 in raid 0/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd
Display(s) Samsung UAE28"850R 4k freesync.
Case Lianli p0-11 dynamic
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.
Crack on show a link then no one calls it a system measurement ,IPC is not a system measurement.
It's a core micro architecture measurement, and is only indicative of performance not demonstrative of all performance.
 
Joined
Jan 8, 2017
Messages
7,047 (3.93/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
Nah, that is nowhere close to being at the lower bound.
But it's exceedingly easy to stumble across, it doesn't have to be silly.

But so what?
It's ridiculously inefficient and it's not scalable. It's funny, they used to make fun of other smartphone manufacturers because "everyone can make something bigger". Now they do the same with their chips.
Well, that's why they also have massive L1 and L2 caches (and actually no CPU L3 at all, but a system-level LLC). But it contradicts what you said because you said (without any reservations) that "L3 caches are sometimes too slow to provide meaningful speed ups", which ... well, if they didn't meaningfully do so in real-world workloads it would be pretty odd for AMD to invest millions into stacked L3 cache tech, IMO.
I don't get it, what's so hard to understand about the word "sometimes" ?
But given that Intel's latest L1 cache size increase (24 to 32K, IIRC) came with a 1-cycle latency penalty, I can't quite see how they (or AMD) would suddenly pull a 3-6x increase in cache sizes out of their sleeves without also dramatically increasing latencies, which begs the question of whether others would even be able to make a similarly huge, wide, and cache-rich core without it being hobbled by slow cache accesses and thus not being fed. That seems to be the case, as we would otherwise most likely see much wider designs for servers and other markets where costs don't matter.
It all has to do with TLB and the size of pages that it uses to map the memory space. Basically Apple is using larger pages than on x86 platforms so the search space for the same amount of memory is smaller, hence lower latencies.
 
Joined
Oct 15, 2019
Messages
376 (0.48/day)
But it's exceedingly easy to stumble across, it doesn't have to be silly.
But exceedingly far from the actual lower bound IPC, which you said was somehow important to know.


Crack on show a link then no one calls it a system measurement ,IPC is not a system measurement.
It's a core micro architecture measurement, and is only indicative of performance not demonstrative of all performance.
Of course it’s indicative of only the performance measured. I’d never generalize IPC measurements to overall performance.

Just read from here: https://en.m.wikipedia.org/wiki/Instructions_per_cycle

”The number of instructions executed per clock is not a constant for a given processor; it depends on how the particular software being run interacts with the processor, and indeed the entire machine, particularly the memory hierarchy.”
 
Joined
Jan 8, 2017
Messages
7,047 (3.93/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
But exceedingly far from the actual lower bound IPC, which you said was somehow important to know.
How is it "exceedingly" far ? Even misaligned SSE/AVX loads or the odd floating point division here and there can destroy the IPC, you don't even have to mess around with the code find out how you can create the most cache misses or mispredictions. Plus you're not even testing the IPC at that point, you're just writing terrible code in general.
 
Joined
May 2, 2017
Messages
5,496 (3.27/day)
Location
Norway, currently in Lund, Sweden
System Name Hotbox
Processor AMD Ryzen 7 5800X
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling Aquanaut + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UV@950mV/2050MHz/180W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G602
Keyboard Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
It's ridiculously inefficient and it's not scalable. It's funny, they used to make fun of other smartphone manufacturers because "everyone can make something bigger". Now they do the same with their chips.
Well, depends what type of efficiency you're looking for. It's not efficient per area or transistor, but per unit of power or clock speed? Massively so.
I don't get it, what's so hard to understand about the word "sometimes" ?
Apparently equally hard as it is to understand that your "sometimes" in this case isn't particularly applicable, neither in this case nor in other relevant comparisons. That doesn't mean it's untrue, it just means it's not particularly relevant as an objection.
It all has to do with TLB and the size of pages that it uses to map the memory space. Basically Apple is using larger pages than on x86 platforms so the search space for the same amount of memory is smaller, hence lower latencies.
Again: if it was that simple, why aren't everyone doing that? Given how many server chips Intel sells, if they could make a huge core like this for servers and deliver 50% higher IPC and ISO performance at half the power per core, they would do so, regardless of the area needed. You could always blame server vendors for not wanting to adopt such a system, but frankly I don't think that would be a problem. Google and Facebook would gobble them up for R&D purposes if nothing else, and they wouldn't care if the CPUs were $10 000 apiece. (Also, they use 16KB TLB pages, but they are also compatible with 4KB pages.)

And that's the key here: the individual parts of what Apple is doing here might not be that impressive, but that they're managing to make all of this into a functional, well balanced and highly performant and efficient core? That is impressive. Very much so. In the end, what matters at the user end is performance and power consumption, which are always in tension, especially in mobile and SFF use cases. The M1 (and upcoming siblings) manages to shift to an entirely different level in this balance, most likely delivering 5800X-level performance (if not higher) at half the power or less (a 5800X is ~140W under full boost and an all-core load after all, these are 50-60W chips), while also containing either a mid-range or high-end dPGU-level iGPU. That is obviously impressive. Will it come with tradeoffs? Of course it will. Concurrent CPU and GPU loads will be power and/or thermally limited, as always, and they do spend an almost silly amount of silicon per chip. But does that matter when the laptop is comparably priced to competitors? No. And sure, you can no doubt find a comparable laptop for less. But a 5980HX+3080/Quadro RTX workstation isn't going to cost you any less than an M1 Max MBP, and both that and the cheaper consumer-focused version is going to be much bigger and heavier, and have terrible battery life. Making a product is, when it comes down to it, about the full package. These chips clearly have downsides, but they are downsides that are largely immaterial in the context of the overall package. And that's what makes them impressive.
 
Joined
Jan 8, 2017
Messages
7,047 (3.93/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
Again: if it was that simple, why aren't everyone doing that?

I guess it would need a lot of changes on many levels, I am not sure, I am not that well versed in these details to be able to tell. The point is there is nothing that incomprehensible about how they achieved these things.

It's not efficient per area or transistor, but per unit of power or clock speed? Massively so.

But what's the point if say you achieve X times the efficiency using more than X times the area ? The disadvantages will eventually outpace the advantages, and you'll be stuck with design that's hard to change.

And that's the key here: the individual parts of what Apple is doing here might not be that impressive, but that they're managing to make all of this into a functional, well balanced and highly performant and efficient core? That is impressive. Very much so.
It's not that the end product isn't impressive, it's how they got there than isn't.

A 400+ mm^2 SoC on the newest node with 400GB/s bandwidth that's really fast ? Wow... I guess.
 
Joined
Oct 15, 2019
Messages
376 (0.48/day)
How is it "exceedingly" far ? Even misaligned SSE/AVX loads or the odd floating point division here and there can destroy the IPC, you don't even have to mess around with the code find out how you can create the most cache misses or mispredictions. Plus you're not even testing the IPC at that point, you're just writing terrible code in general.
But IPC is always application specific!!!
Terrible IPC always requires terrible code to go with it.

my example got to around one instruction per 100 clock cycles, and is likely the worst you can get to without disabling processor features. What is the IPC in your misaligned avx loads?
 
Joined
Jan 8, 2017
Messages
7,047 (3.93/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Deepcool Gammaxx L240 V2
Memory 16GB - Corsair Vengeance LPX - 3333 Mhz CL16
Video Card(s) OEM Dell GTX 1080 with Kraken G12 + Water 3.0 Performer C
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Deepcool Matrexx 70
Power Supply GPS-750C
But IPC is always application specific!!!
Terrible IPC always requires terrible code to go with it.
Then what's the point of measuring it ? What do you do with that information ?
 

r9

Joined
Jul 28, 2008
Messages
2,918 (0.60/day)
System Name Primary | Secondary | Poweredge r410|Dell XPS
Processor i7 9700k| Ryzen 1600| 2 x E5620 |i5 5500U
Memory 16GB DDR4 |16GB DDR4 | 32GB ECC DDR3|8GB DDR4
Video Card(s) GTX 1070|2 x RX570 |On-Board|On-Board|
Storage 512GB SSD+1TB SSD|512GB SSD+1TB|2x256GBSSD 2x2TBGB
Display(s) 50" 4k TV | 27" + 2 x 24" LCD Setup
VR HMD Samsung Odyssey+
Software Windows 10 |Windows 10| Server 2012 r2
Personally when I heard of M1 I though Apple gonna make laptop shaped phone thay will be cheap to make but they will charge the same with poor software support but after got released was watching few YouTube videos and
all I know is that people were very happy with it especially with things like rendering and battery life and having the same performance on on battery as on power something nor AMD or Intel can ever do with x86. Video rendering discussion can be closed as Intel/AMD wont be able to come even close to pro/max. Also OSX support is light-years ahead of Windows for ARM. ARM can be the future if Microsoft can make something as efficient as Rosetta. IMO 95% of people use only 10% of the instructions set so why not rip all the benefits of a RISC chip for the majority of people and for those 5% can always go with Intel/AMD. So it makes much more sense ARM to be the mainstream option not the other way around. The problem is only way we get proper PC/Windows ARM platform is for AMD and Intel to enter that market. And it will all depend on what Apple does with it but Apple being Apple they create their own markets and sell expensive laptops to only very small portion of the global laptop market so it won't be like AMD or Intel will ever be in position where they have no choice but to switch to ARM.
 
Joined
Oct 15, 2019
Messages
376 (0.48/day)
Then what's the point of measuring it ? What do you do with that information ?
You are the one who insisted that knowing the lower bound of the IPC was important!!

I have never stated that knowing it is of any importance.


IPC in itself (or rather WPC) is a great way to compare different systems analytically. I.e. to understand differences in generic application performance and where the differences might come from.
 
Last edited:
Top