• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

IPC instructions per clock is a hypothetical unbending max of the hardware. . .

That value is 6+ for Intel/AMD, depending on some details of uop caches I don't quite remember. If the uop cache is full, it drops to 4 (fastest the decoder can operate from L1). Furthermore, Zen, Zen2, and Zen3 all have the same value.

Which is nonsense. We can clearly see that "average IPC" in video game applications goes up from Zen -> Zen2 -> Zen3. IPC is... this vague wishy-washy term that people use to sound technical but is really extremely poorly defined. We want to calculate IPC so that we can calculate GHz between processors and come up with an idea of which processor is faster. But it turns out that reality isn't very kind to us, and that these CPUs are horribly, horribly complicated beasts.

No practical computer program sits in the uop cache of Skylake/Zen and reaches 6 IPC. None. Maybe micro-benchmark programs like SuperPi get close, but that's the kind of program you'd need to get anywhere close to the max-IPC on today's systems. Very, very few programs are written like SuperPi / HyperPi.
 
That value is 6+ for Intel/AMD, depending on some details of uop caches I don't quite remember. If the uop cache is full, it drops to 4 (fastest the decoder can operate from L1). Furthermore, Zen, Zen2, and Zen3 all have the same value.

Which is nonsense. We can clearly see that "average IPC" in video game applications goes up from Zen -> Zen2 -> Zen3. IPC is... this vague wishy-washy term that people use to sound technical but is really extremely poorly defined. We want to calculate IPC so that we can calculate GHz between processors and come up with an idea of which processor is faster. But it turns out that reality isn't very kind to us, and that these CPUs are horribly, horribly complicated beasts.

No practical computer program sits in the uop cache of Skylake/Zen and reaches 6 IPC. None. Maybe micro-benchmark programs like SuperPi get close, but that's the kind of program you'd need to get anywhere close to the max-IPC on today's systems. Very, very few programs are written like SuperPi / HyperPi.
I agree, IPC should not be a term used in the discussion of modern processor's in the way people do, but there are Benchmark programs out there that can measure average IPC with a modicum of logic to the end result like Pi
Comparable to another chip on the same ISA only.
It's not something that translates to a useful performance metric of a chip or core anymore.
 
It's marketing bullshit.

1) You can't compare TFLOPS directly when using different architecture. It's nonsensical.
3090 - 35.6 TFLOPS, 6900xt -20.6 TFLOPS. Almost 40% difference yet the real performance difference is 3-10% depending on a game\task

2) They didn't even show what exactly was tested. Show us some games running with the same settings and same resolution! Oh, wait. there are no games \sarcasm

3)Forget about AAA games on M1. Metal API+ ARM=> no proper gaming. Metal API essentially killed Mac gaming even on x86 architecture (bootcamp excluded).
Going Metal route instead of Vulkan was a huge mistake.

I have no doubt M1 max will be a great laptop for video editing and stuff... but if you think about getting it in hopes of running proper non-mobile games on it with good graphics settings, resolution and performance then think twice...
 
I agree, IPC should not be a term used in the discussion of modern processor's in the way people do, but there are Benchmark programs out there that can measure average IPC with a modicum of logic to the end result like Pi
Comparable to another chip on the same ISA only.
It's not something that translates to a useful performance metric of a chip or core anymore.
Consider it another way. The maximum theoretical throughput of a given CPU is (typically,) known. What benchmarks help us determine (indirectly,) is how much time is spent on the pipeline getting stalled or worse, a branch misprediction. Those characteristics are what makes CPUs different, otherwise number of threads, cores, and clock speeds should tell you want you want to know. What benchmarks do is tell us that for whatever speed we're running at with whatever resources, how much we can get done.

Now, what in the world does this have to do with cache? Well, if the majority of pipeline stalls are due to memory access (and not branch misprediction,) then larger caches are likely to help limit those stalls due to memory access, which means the CPU is spending more time doing stuff and less time waiting on memory. This will get worse as the working set of data grows which completely depends on the application.

After reading the last several pages, I feel like people are basically saying, "It's all handwavy and vague, so trust none of it," which is amusing in a sad sort of way.
 
Consider it another way. The maximum theoretical throughput of a given CPU is (typically,) known. What benchmarks help us determine (indirectly,) is how much time is spent on the pipeline getting stalled or worse, a branch misprediction. Those characteristics are what makes CPUs different, otherwise number of threads, cores, and clock speeds should tell you want you want to know. What benchmarks do is tell us that for whatever speed we're running at with whatever resources, how much we can get done.

Now, what in the world does this have to do with cache? Well, if the majority of pipeline stalls are due to memory access (and not branch misprediction,) then larger caches are likely to help limit those stalls due to memory access, which means the CPU is spending more time doing stuff and less time waiting on memory. This will get worse as the working set of data grows which completely depends on the application.

After reading the last several pages, I feel like people are basically saying, "It's all handwavy and vague, so trust none of it," which is amusing in a sad sort of way.
That's kind of how I see the term as well. Sure, many people use it as a generic term for "performance per clock", but even then that's reasonably accurate. (Please don't get me started on the people who think "IPC" means "performance".) Theoretical maximum IPC isn't relevant in real-world use cases, as processors, systems and applications are far too complex for this to be a 1:1 relation, so the point of calculating "IPC" for comparison is to see how well the system as a whole is able to run a specific application/collection of applications while accounting for clock speed. Is it strictly a measure of instructions? Obviously not - CPU-level instructions aren't generally visible to users, after all, and if you're using real-world applications for testing then you can't really know that unless you have access to the source code either. So it's all speculative to some degree. Which is perfectly fine.

This means that the only practically applicable and useful way of defining IPC in real-world use is clock-normalized performance in known application tests. These must of course be reasonably well written, and should ideally be representative of overall usage of the system. That last point is where it gets iffy, as this is extremely variable, and why for example SPEC is a poor representation of gaming workloads - the tests are just too different. Does that make SPEC testing any less useful? Not whatsoever. It just means you need a modicum of knowledge of how to interpret the results. Which is always the case anyhow.

This also obviously means there's an argument for there being more (collections of) test(s), as no benchmark collection will ever be wholly representative, and any single score/average/geomean/whatever calculated from a collection of tests will never be so either. But again, this is obvious, and is not a problem. Measured IPC should always be understood as having "averaged across a selection of tested applications" tacked on at the end. And that's perfectly fine. This is neither hand-wavy, vague, or problematic, but a necessary consequence of PCs and their uses being complex.
 
Last edited:
ReBAR doesn't have anything to do with this - it allows the CPU to write to the entire VRAM rather than smaller chunks, but the CPU still can't work off of VRAM - it needs copying to system RAM for the CPU to work on it. You're right that shared memory has its downsides, but with many times the bandwidth of any x86 CPU (and equal to many dGPUs) I doubt that will be a problem, especially considering Apple's penchant for massive caches.
1. It comes down to hardware configuration optimal for the use case.

PC's Direct12 has the option to minimize copy with the game's resource.


PC DirectX12 Memory Layout.png


2. On the subject of massive cache, ARM's code density is inferior to X86 and X86-64.
 
1. It comes down to hardware configuration optimal for the use case.
Well, yes. That is kind of obvious, no? "The hardware best suited to the task is best suited to the task" is hardly a revelation.
PC's Direct12 has the option to minimize copy with the game's resource.


View attachment 222141
That presentation has nothing to do with unified memory architectures, so I don't see why you bring it up. All it does is present the advantages in Frostbite of an aliasing memory layout, reducing the memory footprint of transient data. While this no doubt reduces copying, it has no bearing on whether or not these are unified memory layouts.
2. On the subject of massive cache, ARM's code density is inferior to X86 and X86-64.
And? Are you saying it's sufficiently inferior to make up for a 3-6x cache size disadvantage? 'Cause Anandtech's benchmarks show otherwise.
 
Back
Top