- Joined
- Apr 24, 2020
- Messages
- 2,875 (1.54/day)
With Intel Itanium getting its last shipment, I've taken some time to reflect upon the VLIW methodology. How did Itanium go so wrong? The embedded world has plenty of DSPs (even those inside of your cell phone right now) that perform all kinds of useful calculations in the VLIW style and yet... VLIW never really gained mainstream acceptance outside of AMD Terrascale. There's all sorts of theories and hypotheticals why this happened: the mainstream is that Intel bet too much upon "magic compilers" that never existed or could exist. That these magic compilers could turn your code into a form that VLIW CPUs could execute efficiently.
I argue that the mainstream thought is narrowminded. Lets start at the basics.
Fine grained parallelism: the CPU (and compiler's) #1 job
Lets start at the very beginning, and unlike other articles that discuss pipelines, Out-of-Order execution, or Turing machines... I'll try to keep this nontechnical. The fundamental job of a CPU is to 1. Discover latent parallelism in code, and 2. Execute said parallelism. That's all a pipeline is: a mechanism for executing sequential code, but in parallel. (A pipeline stall is the CPU detecting that some instructions can't execute in parallel safely, so it waits until the 1st instruction is done instead of executing them simultaneously). Out-of-order execution, same thing except more complicated. VLIW took a slightly different approach: compilers would have to discover the parallelism ahead of time, and the CPU core itself would then "simply" execute this pre-figured out parallelism. That's what VLIW essentially is: a very-long instruction word, the explicit ability for a compiler to tell the CPU "hey, this set of 3-instructions can be executed in parallel".
That's the dirty little secret: whenever a programmer writes code, there's fine-grained parallelism, often called "instruction level parallelism" that the compiler can figure out that most programmers don't know or care about. Automatically finding, and executing, this code in parallel leads to improved performance. As it turns out, AMD Opteron was able to "discover" parallelism using its decoder and out-of-order execution methodology to a better degree than Intel's VLIW Itanium.
Despite the fact that Itanium was built from the ground up to be a fine-grained parallel system... the compilers to automatically discover parallelism never was competitive against CPU-decoder designs on traditional CPUs. As such AMD Opteron's decoder "beat" most compilers in practice.
Why did Itanium fail?
So bringing it back to the beginning: why did Itanium fail? I think the main issue is that traditional CPUs (in particular: AMD Opteron back in the 00s) learned to extract parallelism from code at far higher levels than ever thought possible. Even today, we have the Apple M1 with 8-way decoding (that is: 8-instructions executed per clock tick), demonstrating the huge parallel girth that today's processors have. Even worse: some "latencies" cannot be predicted at compile time. Fetches from DDR4 can take more, or less, time depending on the nature of MESI state (the CPU-core-to-core communication framework. When one core is reading or writing to DDR4 RAM, no other core can touch it. Other cores must wait, in order to avoid race conditions).
Variable length latency is the absolute killer of VLIW compilers. The compiler doesn't know how long any memory read or write will take, and that makes scheduling instructions difficult. In contrast, the traditional CPU decoder knows that information (!!!), because CPU decoders are trying to schedule the parallelism as the code is running. Its a complete rout: it seems like a traditional CPU decoder is better than a VLIW ever could be at extracting fine grained parallelism out of code!
So how can VLIW ever possibly see the light of day?
Well, the current VLIW niche does show some hope for this weird ISA style. DSP inside of our cellphones are being used to perform camera-filters / HDR calculations at outstanding speed and efficiency. Because of the regular structure of camera filters, there's very few memory reads. With a more dependable and easier to predict schedule, VLIW suddenly becomes usable again, and the decoder (of traditional CPUs) becomes a hot, power-hungry, unnecessary appendage.
There's a second wildcard: today's compilers are in fact, much smarter than compilers 20 years ago. It is very possible that the magic compilers that Intel wanted to make Itanium do exist today. NVidia PTX's compiler for example, attaches read/write dependencies to every instruction. I don't know the internals of NVidia's decoder architecture on their GPU... but the read/write dependencies are very clear in the NVidia assembly language!! Just under our noses, it seems like NVidia has created the compilers needed to extract fine grained parallelism, and are kind of keeping it a soft-secret (they're certainly not shouting out this feature from the rooftops, despite it being in the last 4 or so GPU generations).
So maybe it is possible to design a modern, VLIW ISA superior to all others today? Maybe VLIW suddenly has a chance to rise from the ashes? But only if there's some CPU-manufacturer who is willing to make a big bet on it once more, after the disaster of Intel's Itanium. The lessons learned would ironically come from cell-phone camera filters, and NVidia GPU assembly however, and not really from the Itanium.
I argue that the mainstream thought is narrowminded. Lets start at the basics.
Fine grained parallelism: the CPU (and compiler's) #1 job
Lets start at the very beginning, and unlike other articles that discuss pipelines, Out-of-Order execution, or Turing machines... I'll try to keep this nontechnical. The fundamental job of a CPU is to 1. Discover latent parallelism in code, and 2. Execute said parallelism. That's all a pipeline is: a mechanism for executing sequential code, but in parallel. (A pipeline stall is the CPU detecting that some instructions can't execute in parallel safely, so it waits until the 1st instruction is done instead of executing them simultaneously). Out-of-order execution, same thing except more complicated. VLIW took a slightly different approach: compilers would have to discover the parallelism ahead of time, and the CPU core itself would then "simply" execute this pre-figured out parallelism. That's what VLIW essentially is: a very-long instruction word, the explicit ability for a compiler to tell the CPU "hey, this set of 3-instructions can be executed in parallel".
That's the dirty little secret: whenever a programmer writes code, there's fine-grained parallelism, often called "instruction level parallelism" that the compiler can figure out that most programmers don't know or care about. Automatically finding, and executing, this code in parallel leads to improved performance. As it turns out, AMD Opteron was able to "discover" parallelism using its decoder and out-of-order execution methodology to a better degree than Intel's VLIW Itanium.
Despite the fact that Itanium was built from the ground up to be a fine-grained parallel system... the compilers to automatically discover parallelism never was competitive against CPU-decoder designs on traditional CPUs. As such AMD Opteron's decoder "beat" most compilers in practice.
Why did Itanium fail?
So bringing it back to the beginning: why did Itanium fail? I think the main issue is that traditional CPUs (in particular: AMD Opteron back in the 00s) learned to extract parallelism from code at far higher levels than ever thought possible. Even today, we have the Apple M1 with 8-way decoding (that is: 8-instructions executed per clock tick), demonstrating the huge parallel girth that today's processors have. Even worse: some "latencies" cannot be predicted at compile time. Fetches from DDR4 can take more, or less, time depending on the nature of MESI state (the CPU-core-to-core communication framework. When one core is reading or writing to DDR4 RAM, no other core can touch it. Other cores must wait, in order to avoid race conditions).
Variable length latency is the absolute killer of VLIW compilers. The compiler doesn't know how long any memory read or write will take, and that makes scheduling instructions difficult. In contrast, the traditional CPU decoder knows that information (!!!), because CPU decoders are trying to schedule the parallelism as the code is running. Its a complete rout: it seems like a traditional CPU decoder is better than a VLIW ever could be at extracting fine grained parallelism out of code!
So how can VLIW ever possibly see the light of day?
Well, the current VLIW niche does show some hope for this weird ISA style. DSP inside of our cellphones are being used to perform camera-filters / HDR calculations at outstanding speed and efficiency. Because of the regular structure of camera filters, there's very few memory reads. With a more dependable and easier to predict schedule, VLIW suddenly becomes usable again, and the decoder (of traditional CPUs) becomes a hot, power-hungry, unnecessary appendage.
There's a second wildcard: today's compilers are in fact, much smarter than compilers 20 years ago. It is very possible that the magic compilers that Intel wanted to make Itanium do exist today. NVidia PTX's compiler for example, attaches read/write dependencies to every instruction. I don't know the internals of NVidia's decoder architecture on their GPU... but the read/write dependencies are very clear in the NVidia assembly language!! Just under our noses, it seems like NVidia has created the compilers needed to extract fine grained parallelism, and are kind of keeping it a soft-secret (they're certainly not shouting out this feature from the rooftops, despite it being in the last 4 or so GPU generations).
So maybe it is possible to design a modern, VLIW ISA superior to all others today? Maybe VLIW suddenly has a chance to rise from the ashes? But only if there's some CPU-manufacturer who is willing to make a big bet on it once more, after the disaster of Intel's Itanium. The lessons learned would ironically come from cell-phone camera filters, and NVidia GPU assembly however, and not really from the Itanium.