Monday, October 18th 2021

Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

Apple today announced M1 Pro and M1 Max, the next breakthrough chips for the Mac. Scaling up M1's transformational architecture, M1 Pro offers amazing performance with industry-leading power efficiency, while M1 Max takes these capabilities to new heights. The CPU in M1 Pro and M1 Max delivers up to 70 percent faster CPU performance than M1, so tasks like compiling projects in Xcode are faster than ever. The GPU in M1 Pro is up to 2x faster than M1, while M1 Max is up to an astonishing 4x faster than M1, allowing pro users to fly through the most demanding graphics workflows.

M1 Pro and M1 Max introduce a system-on-a-chip (SoC) architecture to pro systems for the first time. The chips feature fast unified memory, industry-leading performance per watt, and incredible power efficiency, along with increased memory bandwidth and capacity. M1 Pro offers up to 200 GB/s of memory bandwidth with support for up to 32 GB of unified memory. M1 Max delivers up to 400 GB/s of memory bandwidth—2x that of M1 Pro and nearly 6x that of M1—and support for up to 64 GB of unified memory. And while the latest PC laptops top out at 16 GB of graphics memory, having this huge amount of memory enables graphics-intensive workflows previously unimaginable on a notebook. The efficient architecture of M1 Pro and M1 Max means they deliver the same level of performance whether MacBook Pro is plugged in or using the battery. M1 Pro and M1 Max also feature enhanced media engines with dedicated ProRes accelerators specifically for pro video processing. M1 Pro and M1 Max are by far the most powerful chips Apple has ever built.
"M1 has transformed our most popular systems with incredible performance, custom technologies, and industry-leading power efficiency. No one has ever applied a system-on-a-chip design to a pro system until today with M1 Pro and M1 Max," said Johny Srouji, Apple's senior vice president of Hardware Technologies. "With massive gains in CPU and GPU performance, up to six times the memory bandwidth, a new media engine with ProRes accelerators, and other advanced technologies, M1 Pro and M1 Max take Apple silicon even further, and are unlike anything else in a pro notebook."

M1 Pro: A Whole New Level of Performance and Capability
Utilizing the industry-leading 5-nanometer process technology, M1 Pro packs in 33.7 billion transistors, more than 2x the amount in M1. A new 10-core CPU, including eight high-performance cores and two high-efficiency cores, is up to 70 percent faster than M1, resulting in unbelievable pro CPU performance. Compared with the latest 8-core PC laptop chip, M1 Pro delivers up to 1.7x more CPU performance at the same power level and achieves the PC chip's peak performance using up to 70 percent less power. Even the most demanding tasks, like high-resolution photo editing, are handled with ease by M1 Pro.
M1 Pro has an up-to-16-core GPU that is up to 2x faster than M1 and up to 7x faster than the integrated graphics on the latest 8-core PC laptop chip. Compared to a powerful discrete GPU for PC notebooks, M1 Pro delivers more performance while using up to 70 percent less power. And M1 Pro can be configured with up to 32 GB of fast unified memory, with up to 200 GB/s of memory bandwidth, enabling creatives like 3D artists and game developers to do more on the go than ever before.
M1 Max: The World's Most Powerful Chip for a Pro Notebook
M1 Max features the same powerful 10-core CPU as M1 Pro and adds a massive 32-core GPU for up to 4x faster graphics performance than M1. With 57 billion transistors—70 percent more than M1 Pro and 3.5x more than M1—M1 Max is the largest chip Apple has ever built. In addition, the GPU delivers performance comparable to a high-end GPU in a compact pro PC laptop while consuming up to 40 percent less power, and performance similar to that of the highest-end GPU in the largest PC laptops while using up to 100 watts less power. This means less heat is generated, fans run quietly and less often, and battery life is amazing in the new MacBook Pro. M1 Max transforms graphics-intensive workflows, including up to 13x faster complex timeline rendering in Final Cut Pro compared to the previous-generation 13-inch MacBook Pro.
M1 Max also offers a higher-bandwidth on-chip fabric, and doubles the memory interface compared with M1 Pro for up to 400 GB/s, or nearly 6x the memory bandwidth of M1. This allows M1 Max to be configured with up to 64 GB of fast unified memory. With its unparalleled performance, M1 Max is the most powerful chip ever built for a pro notebook.

Fast, Efficient Media Engine, Now with ProRes
M1 Pro and M1 Max include an Apple-designed media engine that accelerates video processing while maximizing battery life. M1 Pro also includes dedicated acceleration for the ProRes professional video codec, allowing playback of multiple streams of high-quality 4K and 8K ProRes video while using very little power. M1 Max goes even further, delivering up to 2x faster video encoding than M1 Pro, and features two ProRes accelerators. With M1 Max, the new MacBook Pro can transcode ProRes video in Compressor up to a remarkable 10x faster compared with the previous-generation 16-inch MacBook Pro.
Advanced Technologies for a Complete Pro System
Both M1 Pro and M1 Max are loaded with advanced custom technologies that help push pro workflows to the next level:
  • A 16-core Neural Engine for on-device machine learning acceleration and improved camera performance.
  • A new display engine drives multiple external displays.
  • Additional integrated Thunderbolt 4 controllers provide even more I/O bandwidth.
  • Apple's custom image signal processor, along with the Neural Engine, uses computational video to enhance image quality for sharper video and more natural-looking skin tones on the built-in camera.
  • Best-in-class security, including Apple's latest Secure Enclave, hardware-verified secure boot, and runtime anti-exploitation technologies.A Huge Step in the Transition to Apple Silicon
  • The Mac is now one year into its two-year transition to Apple silicon, and M1 Pro and M1 Max represent another huge step forward. These are the most powerful and capable chips Apple has ever created, and together with M1, they form a family of chips that lead the industry in performance, custom technologies, and power efficiency.
macOS and Apps Unleash the Capabilities of M1 Pro and M1 Max
macOS Monterey is engineered to unleash the power of M1 Pro and M1 Max, delivering breakthrough performance, phenomenal pro capabilities, and incredible battery life. By designing Monterey for Apple silicon, the Mac wakes instantly from sleep, and the entire system is fast and incredibly responsive. Developer technologies like Metal let apps take full advantage of the new chips, and optimizations in Core ML utilize the powerful Neural Engine so machine learning models can run even faster. Pro app workload data is used to help optimize how macOS assigns multi-threaded tasks to the CPU cores for maximum performance, and advanced power management features intelligently allocate tasks between the performance and efficiency cores for both incredible speed and battery life.

The combination of macOS with M1, M1 Pro, or M1 Max also delivers industry-leading security protections, including hardware-verified secure boot, runtime anti-exploitation technologies, and fast, in-line encryption for files. All of Apple's Mac apps are optimized for—and run natively on—Apple silicon, and there are over 10,000 Universal apps and plug-ins available. Existing Mac apps that have not yet been updated to Universal will run seamlessly with Apple's Rosetta 2 technology, and users can also run iPhone and iPad apps directly on the Mac, opening a huge new universe of possibilities.
Apple's Commitment to the Environment
Today, Apple is carbon neutral for global corporate operations, and by 2030, plans to have net-zero climate impact across the entire business, which includes manufacturing supply chains and all product life cycles. This also means that every chip Apple creates, from design to manufacturing, will be 100 percent carbon neutral.
Add your own comment

156 Comments on Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

#101
Dredi
Vya DomusIPC has never been a system level metric, it has always been processor specific. You can search for papers regarding measurements of IPC and you'll never come across a system level study of IPC because it doesn't make sense, the CPU is a constant, the system isn't. They always focus on isolating the characteristic of the CPU alone.
I’m pretty sure that the system is also constant enough. For zen3 it is jedec 3200 ram etc. For M1max it is whatever apple decided to pair it with. Pretty simple.
Vya DomusPlus to say that it's a system level thing implies that everything should be measured together, what do we do if we want to measure FLOPS throughput ? Do you count the GPU and all the various other accelerators in as well ? After all it's all on the same SoC, same system, right ?
If the software you are using to calculate IPC with happens to use the GPU things get pretty complicated, but I don’t see why it should not be used. IPC is always application specific.
Vya DomusIt results in more instruction been executed per clock some of the time, the upper and lower bounds of IPC and it's behavior remain exactly the same.
Upper bound isn’t interesting, and neither is lower bound. Trying to write software to get the min and max of a given processor is a pointless excercise. I’m interested in IPC measured with software that is actually used by people to do something productive.

edit: and for max, you can just read the wikichip page of a given processor, and check how many instructions it can dispatch every cycle. Is that something that relates to actual application performance? No.
Posted on Reply
#102
TheoneandonlyMrK
DrediI’m pretty sure that the system is also constant enough. For zen3 it is jedec 3200 ram etc. For M1max it is whatever apple decided to pair it with. Pretty simple.





If the software you are using to calculate IPC with happens to use the GPU things get pretty complicated, but I don’t see why it should not be used. IPC is always application specific.



Upper bound isn’t interesting, and neither is lower bound. Trying to write software to get the min and max of a given processor is a pointless excercise. I’m interested in IPC measured with software that is actually used by people to do something productive.

edit: and for max, you can just read the wikichip page of a given processor, and check how many instructions it can dispatch every cycle. Is that something that relates to actual application performance? No.
I don't think I like your definition of IPC thankfully we already have adequate ways to test.

IPC is chip ,no core specific everything else in a system is changeable.

And I do get your point , so do others that's why reviews exist showing different application performance metrics.
Posted on Reply
#103
Valantar
Vya DomusThat is indeed an assumption. Irrespective of that, I still don't think their core architecture is that much better, virtually all of SECS's tests are known to rely heavily on memory ops, favoring either huge caches or fast system memory.
It really isn't. There's no reason to expect an M1 in a Mac Mini to throttle under a single-core workload - it's neither thermally nor power constrained. And a 5950X under any type of reasonable cooling can maintain its max turbo (or even exceed it) in any ST workload indefinitely.
RichardsIf the pro max matches the 3080 in workloads it will be impressive... mark gurman said the desktop m1 max is 128 core gpu thats 3090 desktop performance
It won't - they don't even claim that. They claim ballpark 3080 mobile performance, but even in their vague and unlabeled graph they don't reach the same level. The Pro is compared to a mobile 3050 Ti, with the Max compared to the 3080 at 100 and 160W, beating the former and coming close to the latter.
Vya DomusSure, but if that's the case let's stop thinking that their chips are the greatest thing since sliced bread.
Nobody is saying that, we're just recognizing that they're pulling off some pretty astounding performance from a ~3GHz ARM core, matching or beating the fastest X86 cores at a fraction of the power and clock speed.
Vya DomusI don't think anyone buys 3000$+ laptops for office work, or if they do they're incredibly unintelligent. What Apple knows is that people want to use some of the professional software available for mac, not necessarily their hardware.
You'd be surprised. Not to mention the rich kids with $3000+ MBPs for school/studies etc. Though tbf, most of those have the air or at worst the 13" M1 MBP, which are both cheaper and less powerful. These will sell like hotcakes to photographers, videographers, animators, journalists, musicians, all kinds of creative professionals, and a whackton of image-obsessed rich people.
Vya DomusL3 caches are sometimes too slow to provide meaningful speed ups, it's usually the L1 caches that get hammered.
Then why are AMD claiming a 15% IPC (in gaming workloads) bump from their stacked 3D cache? Given that those are among the most latency sensitive workloads, and a stacked, via-connected cache is a bit of a worst case scenario, that seems to contradict what you're saying here.
Vya DomusThen it doesn't even make sense talking about IPC in that case, because any CPU will suddenly have higher IPC if it gets faster system memory or larger caches.
Only if it manages to keep latencies down while increasing them - that's part of why Rocket Lake has such weird performance characteristics.
Vya DomusIPC has never been a system level metric, it has always been processor specific. You can search for papers regarding measurements of IPC and you'll never come across a system level study of IPC because it doesn't make sense, the CPU is a constant, the system isn't. They always focus on isolating the characteristic of the CPU alone.

Plus to say that it's a system level thing implies that everything should be measured together, what do we do if we want to measure FLOPS throughput ? Do you count the GPU and all the various other accelerators in as well ? After all it's all on the same SoC, same system, right ?
I think what they mean is that you can't measure IPC outside of the influence of the OS and its systems and workings - you need software to run and ways for that software to communicate with the CPU, after all. So in that way it is a system-level metric, as non-hardware changes can also affect it. The L3 latency bug in W11 would seem to noticeably lower Zen2/Zen3 IPC, for example.
Vya DomusI'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.

On this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth.
I agree that it's about time PC CPU manufacturers start breaking down some walls, but you don't quite seem to appreciate the scope of Apple's engineering on their recent SoCs ad CPU core architectures. I'd highly recommend reading AnandTech's M1/A14 deep dive, as it goes in depth (with self-written feature tests which are excellent illustrations) for everything from L1 cache behaviour to numbering the various execution ports in the CPU and estimating an overall layout of the CPU.

In short, what it shows is that Apple is somehow managing L1 and L2 caches several times the size of the competition (6x the L1I size!) with lower latency - which is downright incredible, as conventional logic says that any cache size increase will increase latency too (which has borne out over several generations of Intel and AMD CPUs, for example) - while also having re-order buffers 2-3x the size of Intel and AMD, an 8-wide (compared to 4-wide for both Intel and AMD) decoder, and 2-3x the execution ports, etc. Managing to design a CPU core this wide without significant performance or power penalties and managing to keep it fed is very impressive - and likely highly dependent on tightly integrated RAM, as well as those massive caches, but that doesn't take away from the performance results. The main drawback of ultra-wide core designs is clock speeds, but Apple seems to be doing decently there as well with >3GHz sustained and even 3GHz on the mobile A14.

Is this "the best CPU out there"? Not necessarily. That depends on your use case and software needs. But is it the most advanced architecture out there? Without a doubt. Do AMD and Intel have their work cut out for them to keep up, let alone catch up? Absolutely.

Me? I really hope this leads AMD to bet on more integrated APUs, and unified memory. I would love a balls-to-the-wall APU with heaps of LPDDR5 for my next laptop. 20-30CUs at low clocks? That would be amazing. It wouldn't be cheap, but it would be fantastic, as long as they can get unified memory working in Windows.
Posted on Reply
#104
Dredi
TheoneandonlyMrKI don't think I like your definition of IPC thankfully we already have adequate ways to test.

IPC is chip ,no core specific everything else in a system is changeable.

And I do get your point , so do others that's why reviews exist showing different application performance metrics.
It is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.
Posted on Reply
#105
Punkenjoy
The PC here have it's inconvenient of their advantages and these Apple M1* have the advantages of their inconvenient.

The PC parts need to be supported in various system, they need to be upgradable. (like expanding memory). This is an advantages over the M1 but the inconvenient is slower standard adoption, more latency due to the fact that the memory isn't standard, and is further away from the CPU. They also have less flexibility on the memory design since adding channels require a new socket.

On the M1 part, they are specifically designed for specific form factor. The memory isn't upgradable and is being soldered on the motherboard close to the CPU. Their design allow them to scale up and down the memory bus and adopt new standard rapidly since they don't have to deal with a standard form factor for upgrade .This also allow them to have the memory very close for better latency and better energy efficiency. But if you want to get more memory because you didn't buy large enough, you have to buy a new device. This is good for apple because people will tend to buy higher than they need because they will not want to have a costly upgrade later.


Apple is just pushing their advantages since no one seems to care about their inconvenient on their platform. But if AMD and Intel would do something similar, many PC enthusiast wouldn't like that.

It still make a lot of sense to do on a laptop since a lot of the time, it will never be upgraded. Also Apple own their entire stack. If they want to put an accelerator, they can leverage an API in the OS and make their compiler to use it whenever it needed.

In reality, i think they are where they are supposed to be regarding their own performance. The fact isn't that they outperform now, it's that they sucked for 2 decade being slowed down by Intel chips. They are just where a company that own their full stack should be right now.

And the fun things is you can buy if you want, and you can buy a PC if you prefer. PC isn't dead.
DrediIt is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.
In the purest form, this is IPC.

But IPC is problematic across different instruction set.
Let say in theory, you have a CISC instruction that load a number, increment it by 1 then save it into the memory. It can do that accross 3 cycles. On the other side, you have a RISC CPU that need a load instruction, an increment instruction and and a store instruction to do the same amount of work, but each take 1 cycle to run. This mean this cpu run 3 time the number of instruction for the same amount of work. We could say it have 3x the IPC than the CISC cpu but in the end nothing more was done.

This is why in it's purest form. IPC is only a good comparison within the same Instruction Set. And it's probably only really useful to compare 2 cpu of the same manufacturer once you factor in the frequency they run.

Also, the same processor can get a higher IPC at lower frequency than at higher if it have to wait less for I/O or Memory. Waiting 60 ns for data to arrive at 2 GHz is less cycle loss than the same wait at 5 GHz. This is why it's hard to extract IPC in it's purest form from Benchmark.

What most people Call IPC these days is mostly a somehow standardized metric like the Spec Benchmark. It's no longer the amount of Instruction per clock but the amount of Work per clock. And in the end, that is what really matter.

but we should say WPC or something similar instead of IPC.
Posted on Reply
#106
Dredi
PunkenjoyIn the purest form, this is IPC.

But IPC is problematic across different instruction set.
Let say in theory, you have a CISC instruction that load a number, increment it by 1 then save it into the memory. It can do that accross 3 cycles. On the other side, you have a RISC CPU that need a load instruction, an increment instruction and and a store instruction to do the same amount of work, but each take 1 cycle to run. This mean this cpu run 3 time the number of instruction for the same amount of work. We could say it have 3x the IPC than the CISC cpu but in the end nothing more was done.

This is why in it's purest form. IPC is only a good comparison within the same Instruction Set. And it's probably only really useful to compare 2 cpu of the same manufacturer once you factor in the frequency they run.

Also, the same processor can get a higher IPC at lower frequency than at higher if it have to wait less for I/O or Memory. Waiting 60 ns for data to arrive at 2 GHz is less cycle loss than the same wait at 5 GHz. This is why it's hard to extract IPC in it's purest form from Benchmark.

What most people Call IPC these days is mostly a somehow standardized metric like the Spec Benchmark. It's no longer the amount of Instruction per clock but the amount of Work per clock. And in the end, that is what really matter.

but we should say WPC or something similar instead of IPC.
Yep, WPC is a lot more meaningful metric, and also always system specific, as well as application specific in the same way.

IPC is a bit silly, as for example avx512 lowers IPC, but improves WPC.
Posted on Reply
#107
Vya Domus
DrediUpper bound isn’t interesting, and neither is lower bound.
It's absolutely important to know the lower bounds because that tells you what the worst case scenarios is.
DrediTrying to write software to get the min and max of a given processor is a pointless excercise.
Huh ? You always aim to use the most out of processor given whatever the time constraints are, I don't know what you are talking about.
DrediI’m interested in IPC measured with software that is actually used by people to do something productive.
That's such a bizarre thing to say. OK, you find out that a CPU can achieve X IPC in a certain application. What can you do with that information ? Absolutely nothing, people measure IPC to generalize what the performance characteristics are. If you are only interested in an application, as you say, then IPC measurements are pointless, you're actually just interested in the performance of that application.
ValantarYou'd be surprised. Not to mention the rich kids with $3000+ MBPs for school/studies etc.
I guess ? It's still stupid and that doesn't tell you anything about how relevant the hardware is if you argue that people who didn't actually needed one would buy it anyway.
ValantarThen why are AMD claiming a 15% IPC (in gaming workloads) bump from their stacked 3D cache? Given that those are among the most latency sensitive workloads, and a stacked, via-connected cache is a bit of a worst case scenario, that seems to contradict what you're saying here.
I have no idea but I fail to see why it contradicts anything that I said. I just said that more L3 cache doesn't always translate to much improved performance. If your code and data mostly resides in L1 cache then messing around with the L3 cache wont do anything. Obviously real world workloads are a mixture of stuff that benefits more or less from different levels of caches or from none of them at all.
ValantarI agree that it's about time PC CPU manufacturers start breaking down some walls, but you don't quite seem to appreciate the scope of Apple's engineering on their recent SoCs ad CPU core architectures. I'd highly recommend reading AnandTech's M1/A14 deep dive, as it goes in depth (with self-written feature tests which are excellent illustrations) for everything from L1 cache behaviour to numbering the various execution ports in the CPU and estimating an overall layout of the CPU.
I read many of his articles and while they're very good I can't help but notice he has a particular affinity for everything Apple does.
ValantarManaging to design a CPU core this wide without significant performance or power penalties and managing to keep it fed is very impressive
A wide core with huge caches and most importantly a very conservative clock speed. That's why I am not impressed, trust me that if their chips ran at much higher clocks comparable to Intel's an AMD's chips while retaining the same characteristics then I'd be really impressed. But I know that's not possible, ultimately all they did is a simple clock for area/transistor budget tradeoff because that way efficiency increases more than linearly. I just cannot give them much credit when they outperform Intel and AMD in some metrics while using who knows, maybe several times more transistors per core ?
Posted on Reply
#108
Aquinus
Resident Wat-man
Vya DomusI'm not, I'm just saying that everyone can throw more cache or faster system memory at the problem and get good results. I am just not impressed by that. Apple has paired server level system bandwidth with their SoC, yeah it's going to be very fast compared to a lot of other CPUs. I'd laughable if it wasn't.
You can't just throw more bandwidth at a problem and expect it to go faster. Take way back when I used an i7 3820. Quad channel DDR3-2133 gives impressive bandwidth numbers compared to a 2700k, but the reality is that the 3820 was only something like 5-8% faster at stock, but that difference wasn't the clock speeds, it was the extra 2MB of L3 that the 3820 had over the regular SB i7 chips. So bandwidth alone doesn't make a chip faster, otherwise the 290(x)/390(x) should have been insanely fast when in reality, nVidia was doing the same with half the width.

So to make a long story short, how the different levels of the memory hierarchy are built out really influences how it benefits the SoC as a whole. A huge LLC won't do you a whole lot of good if your L2 is absolutely tiny. So it's a bit more complicated than just throwing more of x, y, or z at a problem.
Vya DomusOn this note, I do find it really pathetic that PC manufactures haven't moved to a new configuration that allows for wider interfaces. It's insane that we have to wait for years on end so that we can move to a new DDR standard in order to get more bandwidth. This is really the only concrete area where Intel and AMD can't do jack shit about, not by themselves anyway.
Wide memory interfaces for DRAM costs a lot of die space, power, and traces for the memory chips makes boards expensive to produce for it. It's not a good path forward for traditional DRAM. Now, I would agree with respect to HBM2 given its bandwidth and power characteristics, but it also comes with trade-offs in the sense that it's relatively expensive to produce. Apple is basically doing that with their DRAM, so they have the advantage of economy of scale.
Posted on Reply
#109
Dredi
Vya DomusIt's absolutely important to know the lower bounds because that tells you what the worst case scenarios is.
The lower bound is something stupidly low on modern processors, but exceedingly hard to achieve. You’ll need to write a test that causes the most cache misses and is impossible to predict. Is that something that any proper SW ever expiriences? No.

Zen3 with 3200 jedec has ram latency of 80ns or so. The worst possible IPC i can think of would require the most program instructions to be fetched from ram. That requires just a huge 3d LUT check and goto based on that. So one ram latency per two instructions, meaning an ipc of around 1/200. If the prediction logic can see through that, you’ll need to add some stupid instruction to do an address conversion that cannot easily be predicted (some hash function maybe, that has a single instruction in some extension) and you end up with an IPC of around 1/133.

edit: a cleaner solution would to just write a simple routine that reads a byte at addr, then writes some hash (the processors must have some hash extension, so that it is simply one instruction) of byte to addr and loops. That would produce an ipc of 1/100 or so.
Vya DomusHuh ? You always aim to use the most out of processor given whatever the time constraints are, I don't know what you are talking about.
I was talking about the min and max IPC. trying to measure them is pointless.
Posted on Reply
#110
Vya Domus
DrediI was talking about the min and max IPC. trying to measure them is pointless.
It's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.
DrediThe lower bound is something stupidly low on modern processors, but exceedingly hard to achieve.
Not really, all it takes is one or two unfortunate instructions in a loop to have to have major performance implications. Happens all the time and you never know about it, you just hit compile and assume that that's just how it is. It can sometimes be as simple as choosing a 32 bit variable over a 64 bit one which leads to some weird instructions under the hood that may run abnormally slow on some processors.
Posted on Reply
#111
Dredi
Vya DomusIt's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.


Not really, all it takes is one or two unfortunate instructions in a loop to have to have major performance implications. Happens all the time and you never know about it, you just hit compile and assume that that's just how it is. It can sometimes be as simple as choosing a 32 bit variable over a 64 bit one.
Nah, that is nowhere close to being at the lower bound. I updated some silly scenario that can be much much worse.
Posted on Reply
#112
Valantar
Vya DomusI guess ? It's still stupid and that doesn't tell you anything about how relevant the hardware is if you argue that people who didn't actually needed one would buy it anyway.
I wasn't talking about how relevant the hardware was, I was responding to you stating that you don't think anyone buys $3000+ laptops for "office work", and your arguments against Apple knowing their audience. It's pretty clear that they do (in part because they've been pissing off their core creative audience for years, and are now finally delivering something they would want).
Vya DomusI have no idea but I fail to see why it contradicts anything that I said. I just said that more L3 cache doesn't always translate to much improved performance. If your code and data mostly resides in L1 cache then messing around with the L3 cache wont do anything. Obviously real world workloads are a mixture of stuff that benefits more or less from different levels of caches or from none of them at all.
Well, that's why they also have massive L1 and L2 caches (and actually no CPU L3 at all, but a system-level LLC). But it contradicts what you said because you said (without any reservations) that "L3 caches are sometimes too slow to provide meaningful speed ups", which ... well, if they didn't meaningfully do so in real-world workloads it would be pretty odd for AMD to invest millions into stacked L3 cache tech, IMO.
Vya DomusI read many of his articles and while they're very good I can't help but notice he has a particular affinity for everything Apple does.
...and? Is appreciating high-end engineering wrong? I haven't seen a single article that comes even close to the level of depth and quality of analysis of these articles. And nothing to contradict anything said either.
Vya DomusA wide core with huge caches and most importantly a very conservative clock speed. That's why I am not impressed, trust me that if their chips ran at much higher clocks comparable to Intel's an AMD's chips while retaining the same characteristics then I'd be really impressed. But I know that's not possible, ultimately all they did is a simple clock for area/transistor budget tradeoff because that way efficiency increases more than linearly.
You could say that, but only if you ignore the latencies and how they're keeping the cores fed. As I said, increasing cache size should balloon latency, yet theirs is lower than the competition despite 3-6x larger caches. And with that wide a core, you're really starting to push the boundaries of what can be effectively fed with conventional software - yet they're pulling it off. It would also be expected that this much larger die, even at lower clock speeds, would be rather power hungry for what it does - yet it isn't. This is no doubt largely down to granular power gating and the large caches saving them a lot of data shuffling (especially into/out of RAM), but that isn't the whole story.
Vya DomusI just cannot give them much credit when they outperform Intel and AMD in some metrics while using who knows, maybe several times more transistors per core ?
Their core are absolutely massive, that is absolutely true. But so what? They're still managing to use them in smartphones(!) and thin-and-light laptops. This mainly demonstrates that Apple is less margin conscious on this level than AMD and Intel - which is very understandable. That clearly makes this core less suited for budget devices. But less impressive? Nah. A 5950X is a $750 CPU. If Apple sold these at retail they'd no doubt be more than that, but we're not comparing to budget devices, we're comparing to the best they're putting out.

But given that Intel's latest L1 cache size increase (24 to 32K, IIRC) came with a 1-cycle latency penalty, I can't quite see how they (or AMD) would suddenly pull a 3-6x increase in cache sizes out of their sleeves without also dramatically increasing latencies, which begs the question of whether others would even be able to make a similarly huge, wide, and cache-rich core without it being hobbled by slow cache accesses and thus not being fed. That seems to be the case, as we would otherwise most likely see much wider designs for servers and other markets where costs don't matter.
Vya DomusIt's not like you sit in front of a PC for hours on end every time you want to write something to measure IPC. Plus, you're the one that says you want to measure IPC for every single application you're interested in and you didn't explain why that isn't pointless as well.
No, that's why we have industry-standard benchmarks based on real-world workloads. It's obvious that no such thing will ever be perfect, but it is a reasonable approximation of performance across a wide range of real-world usage scenarios.
Posted on Reply
#113
TheoneandonlyMrK
DrediIt is not my definition of IPC, it is the definition of IPC. You simply divide the number of instruction executed by the number of clock cycles it took and have a result. You can’t do that without running the DUT in a system, and it makes no sense to use any other system components than the fastest ones that the manufacturer suggest that you use for the given DUT.
Crack on show a link then no one calls it a system measurement ,IPC is not a system measurement.
It's a core micro architecture measurement, and is only indicative of performance not demonstrative of all performance.
Posted on Reply
#114
Vya Domus
DrediNah, that is nowhere close to being at the lower bound.
But it's exceedingly easy to stumble across, it doesn't have to be silly.
ValantarBut so what?
It's ridiculously inefficient and it's not scalable. It's funny, they used to make fun of other smartphone manufacturers because "everyone can make something bigger". Now they do the same with their chips.
ValantarWell, that's why they also have massive L1 and L2 caches (and actually no CPU L3 at all, but a system-level LLC). But it contradicts what you said because you said (without any reservations) that "L3 caches are sometimes too slow to provide meaningful speed ups", which ... well, if they didn't meaningfully do so in real-world workloads it would be pretty odd for AMD to invest millions into stacked L3 cache tech, IMO.
I don't get it, what's so hard to understand about the word "sometimes" ?
ValantarBut given that Intel's latest L1 cache size increase (24 to 32K, IIRC) came with a 1-cycle latency penalty, I can't quite see how they (or AMD) would suddenly pull a 3-6x increase in cache sizes out of their sleeves without also dramatically increasing latencies, which begs the question of whether others would even be able to make a similarly huge, wide, and cache-rich core without it being hobbled by slow cache accesses and thus not being fed. That seems to be the case, as we would otherwise most likely see much wider designs for servers and other markets where costs don't matter.
It all has to do with TLB and the size of pages that it uses to map the memory space. Basically Apple is using larger pages than on x86 platforms so the search space for the same amount of memory is smaller, hence lower latencies.
Posted on Reply
#115
Dredi
Vya DomusBut it's exceedingly easy to stumble across, it doesn't have to be silly.
But exceedingly far from the actual lower bound IPC, which you said was somehow important to know.
TheoneandonlyMrKCrack on show a link then no one calls it a system measurement ,IPC is not a system measurement.
It's a core micro architecture measurement, and is only indicative of performance not demonstrative of all performance.
Of course it’s indicative of only the performance measured. I’d never generalize IPC measurements to overall performance.

Just read from here: en.m.wikipedia.org/wiki/Instructions_per_cycle

”The number of instructions executed per clock is not a constant for a given processor; it depends on how the particular software being run interacts with the processor, and indeed the entire machine, particularly the memory hierarchy.”
Posted on Reply
#116
Vya Domus
DrediBut exceedingly far from the actual lower bound IPC, which you said was somehow important to know.
How is it "exceedingly" far ? Even misaligned SSE/AVX loads or the odd floating point division here and there can destroy the IPC, you don't even have to mess around with the code find out how you can create the most cache misses or mispredictions. Plus you're not even testing the IPC at that point, you're just writing terrible code in general.
Posted on Reply
#117
Valantar
Vya DomusIt's ridiculously inefficient and it's not scalable. It's funny, they used to make fun of other smartphone manufacturers because "everyone can make something bigger". Now they do the same with their chips.
Well, depends what type of efficiency you're looking for. It's not efficient per area or transistor, but per unit of power or clock speed? Massively so.
Vya DomusI don't get it, what's so hard to understand about the word "sometimes" ?
Apparently equally hard as it is to understand that your "sometimes" in this case isn't particularly applicable, neither in this case nor in other relevant comparisons. That doesn't mean it's untrue, it just means it's not particularly relevant as an objection.
Vya DomusIt all has to do with TLB and the size of pages that it uses to map the memory space. Basically Apple is using larger pages than on x86 platforms so the search space for the same amount of memory is smaller, hence lower latencies.
Again: if it was that simple, why aren't everyone doing that? Given how many server chips Intel sells, if they could make a huge core like this for servers and deliver 50% higher IPC and ISO performance at half the power per core, they would do so, regardless of the area needed. You could always blame server vendors for not wanting to adopt such a system, but frankly I don't think that would be a problem. Google and Facebook would gobble them up for R&D purposes if nothing else, and they wouldn't care if the CPUs were $10 000 apiece. (Also, they use 16KB TLB pages, but they are also compatible with 4KB pages.)

And that's the key here: the individual parts of what Apple is doing here might not be that impressive, but that they're managing to make all of this into a functional, well balanced and highly performant and efficient core? That is impressive. Very much so. In the end, what matters at the user end is performance and power consumption, which are always in tension, especially in mobile and SFF use cases. The M1 (and upcoming siblings) manages to shift to an entirely different level in this balance, most likely delivering 5800X-level performance (if not higher) at half the power or less (a 5800X is ~140W under full boost and an all-core load after all, these are 50-60W chips), while also containing either a mid-range or high-end dPGU-level iGPU. That is obviously impressive. Will it come with tradeoffs? Of course it will. Concurrent CPU and GPU loads will be power and/or thermally limited, as always, and they do spend an almost silly amount of silicon per chip. But does that matter when the laptop is comparably priced to competitors? No. And sure, you can no doubt find a comparable laptop for less. But a 5980HX+3080/Quadro RTX workstation isn't going to cost you any less than an M1 Max MBP, and both that and the cheaper consumer-focused version is going to be much bigger and heavier, and have terrible battery life. Making a product is, when it comes down to it, about the full package. These chips clearly have downsides, but they are downsides that are largely immaterial in the context of the overall package. And that's what makes them impressive.
Posted on Reply
#118
Vya Domus
ValantarAgain: if it was that simple, why aren't everyone doing that?
I guess it would need a lot of changes on many levels, I am not sure, I am not that well versed in these details to be able to tell. The point is there is nothing that incomprehensible about how they achieved these things.
ValantarIt's not efficient per area or transistor, but per unit of power or clock speed? Massively so.
But what's the point if say you achieve X times the efficiency using more than X times the area ? The disadvantages will eventually outpace the advantages, and you'll be stuck with design that's hard to change.
ValantarAnd that's the key here: the individual parts of what Apple is doing here might not be that impressive, but that they're managing to make all of this into a functional, well balanced and highly performant and efficient core? That is impressive. Very much so.
It's not that the end product isn't impressive, it's how they got there than isn't.

A 400+ mm^2 SoC on the newest node with 400GB/s bandwidth that's really fast ? Wow... I guess.
Posted on Reply
#119
Dredi
Vya DomusHow is it "exceedingly" far ? Even misaligned SSE/AVX loads or the odd floating point division here and there can destroy the IPC, you don't even have to mess around with the code find out how you can create the most cache misses or mispredictions. Plus you're not even testing the IPC at that point, you're just writing terrible code in general.
But IPC is always application specific!!!
Terrible IPC always requires terrible code to go with it.

my example got to around one instruction per 100 clock cycles, and is likely the worst you can get to without disabling processor features. What is the IPC in your misaligned avx loads?
Posted on Reply
#120
Vya Domus
DrediBut IPC is always application specific!!!
Terrible IPC always requires terrible code to go with it.
Then what's the point of measuring it ? What do you do with that information ?
Posted on Reply
#121
r9
Personally when I heard of M1 I though Apple gonna make laptop shaped phone thay will be cheap to make but they will charge the same with poor software support but after got released was watching few YouTube videos and
all I know is that people were very happy with it especially with things like rendering and battery life and having the same performance on on battery as on power something nor AMD or Intel can ever do with x86. Video rendering discussion can be closed as Intel/AMD wont be able to come even close to pro/max. Also OSX support is light-years ahead of Windows for ARM. ARM can be the future if Microsoft can make something as efficient as Rosetta. IMO 95% of people use only 10% of the instructions set so why not rip all the benefits of a RISC chip for the majority of people and for those 5% can always go with Intel/AMD. So it makes much more sense ARM to be the mainstream option not the other way around. The problem is only way we get proper PC/Windows ARM platform is for AMD and Intel to enter that market. And it will all depend on what Apple does with it but Apple being Apple they create their own markets and sell expensive laptops to only very small portion of the global laptop market so it won't be like AMD or Intel will ever be in position where they have no choice but to switch to ARM.
Posted on Reply
#122
Dredi
Vya DomusThen what's the point of measuring it ? What do you do with that information ?
You are the one who insisted that knowing the lower bound of the IPC was important!!

I have never stated that knowing it is of any importance.


IPC in itself (or rather WPC) is a great way to compare different systems analytically. I.e. to understand differences in generic application performance and where the differences might come from.
Posted on Reply
#123
R0H1T
Now this would be interesting :nutkick:
For those who think the M1 Pro and M1 Max in the MacBook Pro are impressive, the new Mac Pro desktop is expected to come in at least two variations: 2X and 4X the number of CPU and GPU cores as the M1 Max. That’s up to 40 CPU cores and 128 GPU cores on the high-end.
Posted on Reply
#124
Vya Domus
DrediYou are the one who insisted that knowing the lower bound of the IPC was important!!
Nah, don't spin this around, you first claimed that IPC is a system specific metric, which it isn't, it's a property of the processor. Then you went on about how you're only interested in finding out what the IPC is in applications that people use, at which point I asked what do you do with that information and you never gave a proper answer.
Dredito understand differences in generic application performance and where the differences might come from.
That makes no sense, IPC is simply used to estimate generic performance, you can't infer what the differences might be between applications from that, that's the point.
Posted on Reply
#125
Dredi
Vya DomusIPC is simply used to estimate generic performance
How?? There is no ”generic software” to use in measuring IPC.
Vya Domusyou first claimed that IPC is a system specific metric
It definitely is. Try measuring IPC on your PC without a motherboard, or main memory. I’m not holding my breath.
Posted on Reply
Add your own comment
Apr 26th, 2024 05:52 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts