- Jun 5, 2016
- 69 (0.05/day)
This sounds very impressive, 29% ipc for integer workloads... but that is one specific workload type, this is not a general use scenario with 29% improvement so dont get too hype and also for those trying to call this out, well it's pretty honest in it's information, but only if your workload is integer heavy. Overall hopefully they can get a 10% + improvement on ipc and clocks go up as well.
It's a mixed floating point and integer workload with what should be a pretty good amount of hits through the L2. It tells us what the core can do on its own in a roughly ideal situation to extract IPC. Papermaster showed exactly why... there's no missing explanation for the specific benchmark result.
From what we know, the breakdown in performance improvements. for this workload, probably looks something such as this:
- Fetch: . . . . . . 0-5% . . . . . . . .(from L2/L3/IMC)
- Dispatch: . . . 30~35% . . . . . (next instruction counter, larger uop cache, wider dispatch width)
- ALU: . . . . . . . 5-15% . . . . . . . (instructions in play are all too simple to see much improvement, so this would be the predictor improvement as it relates to these simple tests)
- FPU: . . . . . . . 15-33% . . . . . . (non-AVX workload, advantage comes from load bandwidth doubling).
- Retire: . . . . . .70~80% . . . . . .(from doubling of retirement bandwidth - 128-bit to 256-bit - not 100% because of naturally imperfect scaling)
These values would average together to become the IPC increase for this particular workload. These should be the ranges to expect for any program going through the CPU... with some major caveats - such as the fetch and ALU performance not being well represented in this workload - and the dispatch and retire ruling the day.
Also, x86 has plenty of room for improvement. We just have to start walking away from relative energy efficiency.
If we had a process that allowed us to execute and fetch memory with almost no power usage, we would easily double IPC. Everything in a modern CPU is a compromise for power efficiency... including how aggressively you do predictive computation.
Heck, if we created a semi-dedicated pipeline for predictions and left another dedicated path for in-order execution (leaving instruction bubbles and all, but with power gating), we would see cache miss penalties drop close to zero as we could execute both possibilities for a branch outcome then just move over each stage results after a branch prediction is shown true - removing the instruction bubble with a single cycle latency and resulting in nearly perfect prediction performance. This is insane in the world where power consumption is important... you will be executing (partly or in full) nearly every instruction in a program - even for branches not taken... we're talking about potentially more than doubling how much is executed for every clock cycle. Still, this would be something like a 50% IPC increase.