• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

Joined
Apr 24, 2020
Messages
2,560 (1.75/day)
IPC instructions per clock is a hypothetical unbending max of the hardware. . .

That value is 6+ for Intel/AMD, depending on some details of uop caches I don't quite remember. If the uop cache is full, it drops to 4 (fastest the decoder can operate from L1). Furthermore, Zen, Zen2, and Zen3 all have the same value.

Which is nonsense. We can clearly see that "average IPC" in video game applications goes up from Zen -> Zen2 -> Zen3. IPC is... this vague wishy-washy term that people use to sound technical but is really extremely poorly defined. We want to calculate IPC so that we can calculate GHz between processors and come up with an idea of which processor is faster. But it turns out that reality isn't very kind to us, and that these CPUs are horribly, horribly complicated beasts.

No practical computer program sits in the uop cache of Skylake/Zen and reaches 6 IPC. None. Maybe micro-benchmark programs like SuperPi get close, but that's the kind of program you'd need to get anywhere close to the max-IPC on today's systems. Very, very few programs are written like SuperPi / HyperPi.
 
Joined
Mar 10, 2010
Messages
11,878 (2.30/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
That value is 6+ for Intel/AMD, depending on some details of uop caches I don't quite remember. If the uop cache is full, it drops to 4 (fastest the decoder can operate from L1). Furthermore, Zen, Zen2, and Zen3 all have the same value.

Which is nonsense. We can clearly see that "average IPC" in video game applications goes up from Zen -> Zen2 -> Zen3. IPC is... this vague wishy-washy term that people use to sound technical but is really extremely poorly defined. We want to calculate IPC so that we can calculate GHz between processors and come up with an idea of which processor is faster. But it turns out that reality isn't very kind to us, and that these CPUs are horribly, horribly complicated beasts.

No practical computer program sits in the uop cache of Skylake/Zen and reaches 6 IPC. None. Maybe micro-benchmark programs like SuperPi get close, but that's the kind of program you'd need to get anywhere close to the max-IPC on today's systems. Very, very few programs are written like SuperPi / HyperPi.
I agree, IPC should not be a term used in the discussion of modern processor's in the way people do, but there are Benchmark programs out there that can measure average IPC with a modicum of logic to the end result like Pi
Comparable to another chip on the same ISA only.
It's not something that translates to a useful performance metric of a chip or core anymore.
 
Joined
Feb 29, 2016
Messages
10 (0.00/day)
Processor Intel Core i7-3770K @4.4
Motherboard ASUS SABERTOOTH Z77
Memory 16GB DDR3 2133 Mhz
Video Card(s) intel hd 4000
Power Supply ATX 900W Antec HCG-900
It's marketing bullshit.

1) You can't compare TFLOPS directly when using different architecture. It's nonsensical.
3090 - 35.6 TFLOPS, 6900xt -20.6 TFLOPS. Almost 40% difference yet the real performance difference is 3-10% depending on a game\task

2) They didn't even show what exactly was tested. Show us some games running with the same settings and same resolution! Oh, wait. there are no games \sarcasm

3)Forget about AAA games on M1. Metal API+ ARM=> no proper gaming. Metal API essentially killed Mac gaming even on x86 architecture (bootcamp excluded).
Going Metal route instead of Vulkan was a huge mistake.

I have no doubt M1 max will be a great laptop for video editing and stuff... but if you think about getting it in hopes of running proper non-mobile games on it with good graphics settings, resolution and performance then think twice...
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,147 (2.94/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
I agree, IPC should not be a term used in the discussion of modern processor's in the way people do, but there are Benchmark programs out there that can measure average IPC with a modicum of logic to the end result like Pi
Comparable to another chip on the same ISA only.
It's not something that translates to a useful performance metric of a chip or core anymore.
Consider it another way. The maximum theoretical throughput of a given CPU is (typically,) known. What benchmarks help us determine (indirectly,) is how much time is spent on the pipeline getting stalled or worse, a branch misprediction. Those characteristics are what makes CPUs different, otherwise number of threads, cores, and clock speeds should tell you want you want to know. What benchmarks do is tell us that for whatever speed we're running at with whatever resources, how much we can get done.

Now, what in the world does this have to do with cache? Well, if the majority of pipeline stalls are due to memory access (and not branch misprediction,) then larger caches are likely to help limit those stalls due to memory access, which means the CPU is spending more time doing stuff and less time waiting on memory. This will get worse as the working set of data grows which completely depends on the application.

After reading the last several pages, I feel like people are basically saying, "It's all handwavy and vague, so trust none of it," which is amusing in a sad sort of way.
 
Joined
May 2, 2017
Messages
7,762 (3.05/day)
Location
Back in Norway
System Name Hotbox
Processor AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G603
Keyboard Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
Consider it another way. The maximum theoretical throughput of a given CPU is (typically,) known. What benchmarks help us determine (indirectly,) is how much time is spent on the pipeline getting stalled or worse, a branch misprediction. Those characteristics are what makes CPUs different, otherwise number of threads, cores, and clock speeds should tell you want you want to know. What benchmarks do is tell us that for whatever speed we're running at with whatever resources, how much we can get done.

Now, what in the world does this have to do with cache? Well, if the majority of pipeline stalls are due to memory access (and not branch misprediction,) then larger caches are likely to help limit those stalls due to memory access, which means the CPU is spending more time doing stuff and less time waiting on memory. This will get worse as the working set of data grows which completely depends on the application.

After reading the last several pages, I feel like people are basically saying, "It's all handwavy and vague, so trust none of it," which is amusing in a sad sort of way.
That's kind of how I see the term as well. Sure, many people use it as a generic term for "performance per clock", but even then that's reasonably accurate. (Please don't get me started on the people who think "IPC" means "performance".) Theoretical maximum IPC isn't relevant in real-world use cases, as processors, systems and applications are far too complex for this to be a 1:1 relation, so the point of calculating "IPC" for comparison is to see how well the system as a whole is able to run a specific application/collection of applications while accounting for clock speed. Is it strictly a measure of instructions? Obviously not - CPU-level instructions aren't generally visible to users, after all, and if you're using real-world applications for testing then you can't really know that unless you have access to the source code either. So it's all speculative to some degree. Which is perfectly fine.

This means that the only practically applicable and useful way of defining IPC in real-world use is clock-normalized performance in known application tests. These must of course be reasonably well written, and should ideally be representative of overall usage of the system. That last point is where it gets iffy, as this is extremely variable, and why for example SPEC is a poor representation of gaming workloads - the tests are just too different. Does that make SPEC testing any less useful? Not whatsoever. It just means you need a modicum of knowledge of how to interpret the results. Which is always the case anyhow.

This also obviously means there's an argument for there being more (collections of) test(s), as no benchmark collection will ever be wholly representative, and any single score/average/geomean/whatever calculated from a collection of tests will never be so either. But again, this is obvious, and is not a problem. Measured IPC should always be understood as having "averaged across a selection of tested applications" tacked on at the end. And that's perfectly fine. This is neither hand-wavy, vague, or problematic, but a necessary consequence of PCs and their uses being complex.
 
Last edited:
Joined
Nov 3, 2011
Messages
690 (0.15/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H115i Elite Capellix XT
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB and Toshiba N300 NAS 10TB HDD
Display(s) 2X LG 27UL600 27in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
ReBAR doesn't have anything to do with this - it allows the CPU to write to the entire VRAM rather than smaller chunks, but the CPU still can't work off of VRAM - it needs copying to system RAM for the CPU to work on it. You're right that shared memory has its downsides, but with many times the bandwidth of any x86 CPU (and equal to many dGPUs) I doubt that will be a problem, especially considering Apple's penchant for massive caches.
1. It comes down to hardware configuration optimal for the use case.

PC's Direct12 has the option to minimize copy with the game's resource.


PC DirectX12 Memory Layout.png


2. On the subject of massive cache, ARM's code density is inferior to X86 and X86-64.
 
Joined
May 2, 2017
Messages
7,762 (3.05/day)
Location
Back in Norway
System Name Hotbox
Processor AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G603
Keyboard Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
1. It comes down to hardware configuration optimal for the use case.
Well, yes. That is kind of obvious, no? "The hardware best suited to the task is best suited to the task" is hardly a revelation.
PC's Direct12 has the option to minimize copy with the game's resource.


View attachment 222141
That presentation has nothing to do with unified memory architectures, so I don't see why you bring it up. All it does is present the advantages in Frostbite of an aliasing memory layout, reducing the memory footprint of transient data. While this no doubt reduces copying, it has no bearing on whether or not these are unified memory layouts.
2. On the subject of massive cache, ARM's code density is inferior to X86 and X86-64.
And? Are you saying it's sufficiently inferior to make up for a 3-6x cache size disadvantage? 'Cause Anandtech's benchmarks show otherwise.
 
Top