Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

dragontamer5788 · Oct 19, 2021

TheoneandonlyMrK said:
IPC instructions per clock is a hypothetical unbending max of the hardware. . .

That value is 6+ for Intel/AMD, depending on some details of uop caches I don't quite remember. If the uop cache is full, it drops to 4 (fastest the decoder can operate from L1). Furthermore, Zen, Zen2, and Zen3 all have the same value.

Which is nonsense. We can clearly see that "average IPC" in video game applications goes up from Zen -> Zen2 -> Zen3. IPC is... this vague wishy-washy term that people use to sound technical but is really extremely poorly defined. We want to calculate IPC so that we can calculate GHz between processors and come up with an idea of which processor is faster. But it turns out that reality isn't very kind to us, and that these CPUs are horribly, horribly complicated beasts.

No practical computer program sits in the uop cache of Skylake/Zen and reaches 6 IPC. None. Maybe micro-benchmark programs like SuperPi get close, but that's the kind of program you'd need to get anywhere close to the max-IPC on today's systems. Very, very few programs are written like SuperPi / HyperPi.

TheoneandonlyMrK · Oct 19, 2021

dragontamer5788 said:
That value is 6+ for Intel/AMD, depending on some details of uop caches I don't quite remember. If the uop cache is full, it drops to 4 (fastest the decoder can operate from L1). Furthermore, Zen, Zen2, and Zen3 all have the same value.

Which is nonsense. We can clearly see that "average IPC" in video game applications goes up from Zen -> Zen2 -> Zen3. IPC is... this vague wishy-washy term that people use to sound technical but is really extremely poorly defined. We want to calculate IPC so that we can calculate GHz between processors and come up with an idea of which processor is faster. But it turns out that reality isn't very kind to us, and that these CPUs are horribly, horribly complicated beasts.

No practical computer program sits in the uop cache of Skylake/Zen and reaches 6 IPC. None. Maybe micro-benchmark programs like SuperPi get close, but that's the kind of program you'd need to get anywhere close to the max-IPC on today's systems. Very, very few programs are written like SuperPi / HyperPi.

I agree, IPC should not be a term used in the discussion of modern processor's in the way people do, but there are Benchmark programs out there that can measure average IPC with a modicum of logic to the end result like Pi
Comparable to another chip on the same ISA only.
It's not something that translates to a useful performance metric of a chip or core anymore.

Semel · Oct 19, 2021

It's marketing bullshit.

1) You can't compare TFLOPS directly when using different architecture. It's nonsensical.
3090 - 35.6 TFLOPS, 6900xt -20.6 TFLOPS. Almost 40% difference yet the real performance difference is 3-10% depending on a game\task

2) They didn't even show what exactly was tested. Show us some games running with the same settings and same resolution! Oh, wait. there are no games \sarcasm

3)Forget about AAA games on M1. Metal API+ ARM=> no proper gaming. Metal API essentially killed Mac gaming even on x86 architecture (bootcamp excluded).
Going Metal route instead of Vulkan was a huge mistake.

I have no doubt M1 max will be a great laptop for video editing and stuff... but if you think about getting it in hopes of running proper non-mobile games on it with good graphics settings, resolution and performance then think twice...

Aquinus · Oct 19, 2021

TheoneandonlyMrK said:
I agree, IPC should not be a term used in the discussion of modern processor's in the way people do, but there are Benchmark programs out there that can measure average IPC with a modicum of logic to the end result like Pi
Comparable to another chip on the same ISA only.
It's not something that translates to a useful performance metric of a chip or core anymore.

Consider it another way. The maximum theoretical throughput of a given CPU is (typically,) known. What benchmarks help us determine (indirectly,) is how much time is spent on the pipeline getting stalled or worse, a branch misprediction. Those characteristics are what makes CPUs different, otherwise number of threads, cores, and clock speeds should tell you want you want to know. What benchmarks do is tell us that for whatever speed we're running at with whatever resources, how much we can get done.

Now, what in the world does this have to do with cache? Well, if the majority of pipeline stalls are due to memory access (and not branch misprediction,) then larger caches are likely to help limit those stalls due to memory access, which means the CPU is spending more time doing stuff and less time waiting on memory. This will get worse as the working set of data grows which completely depends on the application.

After reading the last several pages, I feel like people are basically saying, "It's all handwavy and vague, so trust none of it," which is amusing in a sad sort of way.

Valantar · Oct 20, 2021

Aquinus said:
Consider it another way. The maximum theoretical throughput of a given CPU is (typically,) known. What benchmarks help us determine (indirectly,) is how much time is spent on the pipeline getting stalled or worse, a branch misprediction. Those characteristics are what makes CPUs different, otherwise number of threads, cores, and clock speeds should tell you want you want to know. What benchmarks do is tell us that for whatever speed we're running at with whatever resources, how much we can get done.

Now, what in the world does this have to do with cache? Well, if the majority of pipeline stalls are due to memory access (and not branch misprediction,) then larger caches are likely to help limit those stalls due to memory access, which means the CPU is spending more time doing stuff and less time waiting on memory. This will get worse as the working set of data grows which completely depends on the application.

After reading the last several pages, I feel like people are basically saying, "It's all handwavy and vague, so trust none of it," which is amusing in a sad sort of way.

That's kind of how I see the term as well. Sure, many people use it as a generic term for "performance per clock", but even then that's reasonably accurate. (Please don't get me started on the people who think "IPC" means "performance".) Theoretical maximum IPC isn't relevant in real-world use cases, as processors, systems and applications are far too complex for this to be a 1:1 relation, so the point of calculating "IPC" for comparison is to see how well the system as a whole is able to run a specific application/collection of applications while accounting for clock speed. Is it strictly a measure of instructions? Obviously not - CPU-level instructions aren't generally visible to users, after all, and if you're using real-world applications for testing then you can't really know that unless you have access to the source code either. So it's all speculative to some degree. Which is perfectly fine.

This means that the only practically applicable and useful way of defining IPC in real-world use is clock-normalized performance in known application tests. These must of course be reasonably well written, and should ideally be representative of overall usage of the system. That last point is where it gets iffy, as this is extremely variable, and why for example SPEC is a poor representation of gaming workloads - the tests are just too different. Does that make SPEC testing any less useful? Not whatsoever. It just means you need a modicum of knowledge of how to interpret the results. Which is always the case anyhow.

This also obviously means there's an argument for there being more (collections of) test(s), as no benchmark collection will ever be wholly representative, and any single score/average/geomean/whatever calculated from a collection of tests will never be so either. But again, this is obvious, and is not a problem. Measured IPC should always be understood as having "averaged across a selection of tested applications" tacked on at the end. And that's perfectly fine. This is neither hand-wavy, vague, or problematic, but a necessary consequence of PCs and their uses being complex.

ValenOne · Oct 23, 2021

Valantar said:
ReBAR doesn't have anything to do with this - it allows the CPU to write to the entire VRAM rather than smaller chunks, but the CPU still can't work off of VRAM - it needs copying to system RAM for the CPU to work on it. You're right that shared memory has its downsides, but with many times the bandwidth of any x86 CPU (and equal to many dGPUs) I doubt that will be a problem, especially considering Apple's penchant for massive caches.

1. It comes down to hardware configuration optimal for the use case.

PC's Direct12 has the option to minimize copy with the game's resource.

FrameGraph: Extensible Rendering Architecture in Frostbite

The document discusses the evolution and architecture of the Frostbite rendering engine, highlighting improvements made from 2007 to 2017. Key advancements include the introduction of a frame graph for better resource management, a transient resource system, and enhanced extensibility for...

www.slideshare.net

2. On the subject of massive cache, ARM's code density is inferior to X86 and X86-64.

Valantar · Oct 24, 2021

rvalencia said:
1. It comes down to hardware configuration optimal for the use case.

Well, yes. That is kind of obvious, no? "The hardware best suited to the task is best suited to the task" is hardly a revelation.

rvalencia said:
PC's Direct12 has the option to minimize copy with the game's resource.

FrameGraph: Extensible Rendering Architecture in Frostbite

The document discusses the evolution and architecture of the Frostbite rendering engine, highlighting improvements made from 2007 to 2017. Key advancements include the introduction of a frame graph for better resource management, a transient resource system, and enhanced extensibility for...

www.slideshare.net

View attachment 222141

That presentation has nothing to do with unified memory architectures, so I don't see why you bring it up. All it does is present the advantages in Frostbite of an aliasing memory layout, reducing the memory footprint of transient data. While this no doubt reduces copying, it has no bearing on whether or not these are unified memory layouts.

rvalencia said:
2. On the subject of massive cache, ARM's code density is inferior to X86 and X86-64.

And? Are you saying it's sufficiently inferior to make up for a 3-6x cache size disadvantage? 'Cause Anandtech's benchmarks show otherwise.

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

Processor	Intel Core i7-3770K @4.4
Motherboard	ASUS SABERTOOTH Z77
Memory	16GB DDR3 2133 Mhz
Video Card(s)	intel hd 4000
Power Supply	ATX 900W Antec HCG-900

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, AirPods Max
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.5

System Name	Hotbox
Processor	AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard	ASRock Phantom Gaming B550 ITX/ax
Cooling	LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory	32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s)	PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage	2TB Adata SX8200 Pro
Display(s)	Dell U2711 main, AOC 24P2C secondary
Case	SSUPD Meshlicious
Audio Device(s)	Optoma Nuforce μDAC 3
Power Supply	Corsair SF750 Platinum
Mouse	Logitech G603
Keyboard	Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software	Windows 10 Pro

System Name	Eula
Processor	AMD Ryzen 9 7950X
Motherboard	MSI MPG B850 Edge Ti WiFi
Cooling	Corsair H150i Elite LCD XT White
Memory	Trident Z5 Neo RGB DDR5-6000 CL32-38-38-96 1.40V 64GB (2x32GB) AMD EXPO F5-6000J3238G32GX2-TZ5NR
Video Card(s)	Gigabyte GeForce RTX 4080 GAMING OC
Storage	Crucial P3 Plus, 4 TB NVMe, Samsung 980 Pro 2TB NVMe, Toshiba N300 10TB HDD, WDC Red Pro NAS HDD
Display(s)	Acer Predator X32FP 32in 160Hz 4K, Corsair Xeneon 32UHD144 32in 144 hz 4K
Case	Antec Constellation C8 RGB White
Audio Device(s)	Creative Sound Blaster Z
Power Supply	Corsair HX1000 Platinum 1000W
Mouse	SteelSeries Prime Pro Gaming Mouse
Keyboard	SteelSeries Apex 5
Software	MS Windows 11 Pro

Apple Introduces M1 Pro and M1 Max: the Most Powerful Chips Apple Has Ever Built

Resident Wat-man