VLIW: Will this ISA style ever get its time in the sun?

dragontamer5788 · Aug 9, 2021

With Intel Itanium getting its last shipment, I've taken some time to reflect upon the VLIW methodology. How did Itanium go so wrong? The embedded world has plenty of DSPs (even those inside of your cell phone right now) that perform all kinds of useful calculations in the VLIW style and yet... VLIW never really gained mainstream acceptance outside of AMD Terrascale. There's all sorts of theories and hypotheticals why this happened: the mainstream is that Intel bet too much upon "magic compilers" that never existed or could exist. That these magic compilers could turn your code into a form that VLIW CPUs could execute efficiently.

I argue that the mainstream thought is narrowminded. Lets start at the basics.

Fine grained parallelism: the CPU (and compiler's) #1 job

Lets start at the very beginning, and unlike other articles that discuss pipelines, Out-of-Order execution, or Turing machines... I'll try to keep this nontechnical. The fundamental job of a CPU is to 1. Discover latent parallelism in code, and 2. Execute said parallelism. That's all a pipeline is: a mechanism for executing sequential code, but in parallel. (A pipeline stall is the CPU detecting that some instructions can't execute in parallel safely, so it waits until the 1st instruction is done instead of executing them simultaneously). Out-of-order execution, same thing except more complicated. VLIW took a slightly different approach: compilers would have to discover the parallelism ahead of time, and the CPU core itself would then "simply" execute this pre-figured out parallelism. That's what VLIW essentially is: a very-long instruction word, the explicit ability for a compiler to tell the CPU "hey, this set of 3-instructions can be executed in parallel".

That's the dirty little secret: whenever a programmer writes code, there's fine-grained parallelism, often called "instruction level parallelism" that the compiler can figure out that most programmers don't know or care about. Automatically finding, and executing, this code in parallel leads to improved performance. As it turns out, AMD Opteron was able to "discover" parallelism using its decoder and out-of-order execution methodology to a better degree than Intel's VLIW Itanium.

Despite the fact that Itanium was built from the ground up to be a fine-grained parallel system... the compilers to automatically discover parallelism never was competitive against CPU-decoder designs on traditional CPUs. As such AMD Opteron's decoder "beat" most compilers in practice.

Why did Itanium fail?

So bringing it back to the beginning: why did Itanium fail? I think the main issue is that traditional CPUs (in particular: AMD Opteron back in the 00s) learned to extract parallelism from code at far higher levels than ever thought possible. Even today, we have the Apple M1 with 8-way decoding (that is: 8-instructions executed per clock tick), demonstrating the huge parallel girth that today's processors have. Even worse: some "latencies" cannot be predicted at compile time. Fetches from DDR4 can take more, or less, time depending on the nature of MESI state (the CPU-core-to-core communication framework. When one core is reading or writing to DDR4 RAM, no other core can touch it. Other cores must wait, in order to avoid race conditions).

Variable length latency is the absolute killer of VLIW compilers. The compiler doesn't know how long any memory read or write will take, and that makes scheduling instructions difficult. In contrast, the traditional CPU decoder knows that information (!!!), because CPU decoders are trying to schedule the parallelism as the code is running. Its a complete rout: it seems like a traditional CPU decoder is better than a VLIW ever could be at extracting fine grained parallelism out of code!

So how can VLIW ever possibly see the light of day?

Well, the current VLIW niche does show some hope for this weird ISA style. DSP inside of our cellphones are being used to perform camera-filters / HDR calculations at outstanding speed and efficiency. Because of the regular structure of camera filters, there's very few memory reads. With a more dependable and easier to predict schedule, VLIW suddenly becomes usable again, and the decoder (of traditional CPUs) becomes a hot, power-hungry, unnecessary appendage.

There's a second wildcard: today's compilers are in fact, much smarter than compilers 20 years ago. It is very possible that the magic compilers that Intel wanted to make Itanium do exist today. NVidia PTX's compiler for example, attaches read/write dependencies to every instruction. I don't know the internals of NVidia's decoder architecture on their GPU... but the read/write dependencies are very clear in the NVidia assembly language!! Just under our noses, it seems like NVidia has created the compilers needed to extract fine grained parallelism, and are kind of keeping it a soft-secret (they're certainly not shouting out this feature from the rooftops, despite it being in the last 4 or so GPU generations).

So maybe it is possible to design a modern, VLIW ISA superior to all others today? Maybe VLIW suddenly has a chance to rise from the ashes? But only if there's some CPU-manufacturer who is willing to make a big bet on it once more, after the disaster of Intel's Itanium. The lessons learned would ironically come from cell-phone camera filters, and NVidia GPU assembly however, and not really from the Itanium.

v12dock · Aug 9, 2021

I believe ATI/AMDs TeraScale architecture was VLIW that would be 2000-6000 series cards.

mtcn77 · Aug 9, 2021

Thanks for bringing this up.

First of all, let me remind that the launch of RDNA was heralded as the second coming of TeraScale architecture. Albeit being erroneous, this also indicated that 'TeraScale' and VLIW was to get the recognition which it never did fully. The reason being, its replacement by GCN Architecture was nominated as the definitive 'Out of Order' performance upgrade that would resolve compiler issues. Let me reinstate that it did not. It only allowed scheduling by a latency of 4 which is a lot in gpu standards; execute latency thoroughput was not the consideration in this design. RDNA and GCN for instance have 1 cycle clock latency. This has the potential of streamlining buffers because streamlining is by default, paramount - in the case of TeraScale - and automatic by RDNA Architecture. There are no 4 times cache buffering necessitated in order to run a 'single' instruction every consecutive cycle.

TeraScale's designers knew this was the best method, however it was only until RDNA got developed how scalarization was to be issued best in software.

These days, gpus don't just load 400% buffers and throw away what little memory independence they can keep up with...

GorbazTheDragon · Aug 9, 2021

I will preface this by saying that VLIW is flawed from the get go, it will never be good for general purpose processors.

Here are my points regarding the OP:

dragontamer5788 said:
That's all a pipeline is: a mechanism for executing sequential code, but in parallel.

A pipeline doesn't necessarily deal with sequential code, it takes independent instructions, splits them into smaller units of work, and completes one step on one instruction while the hardware for the previous step is working on another instruction. Any group of instructions can be processed in a pipelined way, as long as they can be divided into those smaller chunks of work and the instructions are not dependent on each other.

dragontamer5788 said:
Out-of-order execution, same thing except more complicated.

I think this is important to elaborate on: Out of order improves the performance of a pipeline processing instructions which have dependencies. If we have instruction 1 and 2, where 2 is dependent on the output of 1, we cannot compute 2 until 1 has entirely finished (some ways to slightly get around this, but not considered for simplicity). Out of order allows other instructions to proceed down the pipeline while instruction 2 is waiting, then once the data it needs is ready, instruction 2 will go through the pipeline and execute.

dragontamer5788 said:
CPU core itself would then "simply" execute this pre-figured out parallelism.

I would add to this that the compiler assigns instructions to "slots" of the execution units in the core. So rather than giving the CPU 50 shaped wooden blocks to put through a square, circle, and triangle hole, the compiler tries to find a square, circle, and triangle block to give to the CPU in each cycle.

dragontamer5788 said:
That's the dirty little secret

For both superscalar out of order and (superscalar) VLIW machines

Both of these are techniques to improve performance by using ILP, the difference is that VLIW requires the compiler to correctly place all dependent instructions before execution while a superscalar out of order machine does dependency checking and, where necessary, handling in the processor.

dragontamer5788 said:
As it turns out, AMD Opteron was able to "discover" parallelism using its decoder and out-of-order execution methodology to a better degree than Intel's VLIW Itanium.

I am not particularly familiar with the performance figures of these processors with well optimised code, but I would be hesitant to believe Itanium didn't perform well if the code was designed and compiled around that ISA. The same way if you just ran some arm programs after shoddily compiling them for x86, the performance would be lackluster, except that arm and x86 are much more similar than x86 and IA-64.

dragontamer5788 said:
As such AMD Opteron's decoder "beat" most compilers in practice.

It's not really down to the decoder or even opteron in particular I would say. Compiling for superscalar out of order processors at the time was already quite a developed field, as such compilers were already capable of optimising code reasonably well for those contemporary x86 architectures. x86 and other CISC compilers are far from dumb, code optimisation is a huge part of performance. Despite the fact that I strongly believe that avoiding increased programmer burden is the biggest obstacle in building ultra efficient computer systems nowadays, I would argue that modern uArchs actually bank on good coding practices and good compilers to keep delivering performance improvements.

dragontamer5788 said:
Even today, we have the Apple M1 with 8-way decoding (that is: 8-instructions executed per clock tick), demonstrating the huge parallel girth that today's processors have.

This is a bad example because arm does less work on average per instruction than x86. Not to mention that you can have a pretty big range of performance from a certain width of decode by improving other aspects of the processor. Core 2 had a 4 wide decoder, and today Zen 3 still runs on that 4 wide decoder (ice lake also appears to be 4 wide despite Intel's slides, see Agner).

dragontamer5788 said:
Even worse: some "latencies" cannot be predicted at compile time.

Bingo, this is the big killer

dragontamer5788 said:
When one core is reading or writing to DDR4 RAM, no other core can touch it. Other cores must wait, in order to avoid race conditions

I think this is poorly worded, you can make simultaneous reads and writes, the problem is you can't do it to same addresses. Irrelevant either way since it would apply to VLIW as well if it was the case.

I believe you are focusing on the wrong thing here though, the big thing here is caches... Caches make memory accesses inherently variable, depending on which level of cache (or if it is in a cache at all) the data you are interested in is located you can get from under 5 cycles to over 300 cycles of access latency. And this is impossible to predict without knowing exactly what is running on the system (as other executing programs can cause cache evictions) and what data is being processed (different data set may cause an eviction at a different time).

dragontamer5788 said:
In contrast, the traditional CPU decoder knows that information (!!!), because CPU decoders are trying to schedule the parallelism as the code is running.

The CPU decoder doesn't actually schedule anything, it just decodes the x86 instructions and transforms them into micro ops which are generally single units of computation in the back end.

The reason superscalar out of order is less susceptible to performance loss with variable memory latency is simply because it is out of order. If one instruction gets hung up, eg on a memory access, the following instructions will still be decoded and passed into the pipeline, any independent instructions will be executed and any instructions dependent on that memory access wait will pile up until the reorder buffer is full, once the memory access is complete, those instructions dependent on it will get worked on again and the pile up will get cleaned out. I will place some links on good lectures talking about this at the end of the post.

dragontamer5788 said:
So how can VLIW ever possibly see the light of day?

Hold on, we're getting ahead of ourselves here.

There's another problem with VLIW: changing microarchitecture requires recompilation.

Because the VLIW compiler is handing the CPU those shaped blocks to match the shapes of the execution hardware within the CPU, when we change the execution hardware available within the CPU we need to change the shaped blocks the compiler is giving it. Depending on how drastic the changes on the uArch are, not doing so can have anywhere from a substantial performance impact (one or several execution units unutilised in a faster uArch) to basically being unusable (internal latencies are incompatible, processor has entirely different shaped slots than before). While optimising for specific superscalar out of order uArchs is a thing, you will barely ever end up with one execution unit being left entirely idle just because you didn't re-optimise the code for the new uArch.

dragontamer5788 said:
decoder (of traditional CPUs)

Decoder isn't that big of a burden, rather that superscalar and out of order requires a bunch of other hardware to check for dependencies and resolve them.

dragontamer5788 said:
There's a second wildcard: today's compilers are in fact, much smarter than compilers 20 years ago

This is actually because of VLIW... VLIW spurred a lot of research in compilers which lead to many new forms of optimisation that were needed to make true VLIW work, but because superscalar out of order also uses execution units and can only handle a limited amount of out of order-ness without losing throughput, apply to superscalar out of order ISAs as well: x86 and arm most notably.

dragontamer5788 said:
Just under our noses, it seems like NVidia has created the compilers needed to extract fine grained parallelism, and are kind of keeping it a soft-secret (they're certainly not shouting out this feature from the rooftops, despite it being in the last 4 or so GPU generations).

These kinds of optimisations are abundant in x86 compilers, that's why Nvidia aren't shouting about it from the rooftops

dragontamer5788 said:
So maybe it is possible to design a modern, VLIW ISA superior to all others today?

It won't unfortunately...

I'll elaborate on my general thoughts here...

VLIW is great under certain conditions: Your code is sufficiently simple and regular not to require conventional caches (as opposed to directly managed scratchpads), you know what code you will run, you know your processor/hardware configuration, and you won't change the processor microarchitecture or hardware configuration.

General purpose processors (where modern x86 and arm processors are used) work in an environment that doesn't fulfill any of these requirements: Code is complex and everyone is trying to use the processor for a different purpose, caches are not directly managed, you don't know what code may be running elsewhere on the machine, there is a large variety of different hardware that you need to run on, and hardware configurations and microarchitectures are changing on a very regular basis.

VLIW for general purpose processors is like trying to replace a family car VW Passat with a (3 seat) Ford Transit based on the argument that you can more easily take all the furniture you buy in IKEA home yourself.

Links and stuff:

Agner: https://www.agner.org/optimize/
The microarchitecture optimisation guide (3) "contains details about the internal working of various microprocessors from Intel, AMD and VIA. Topics include: Out-of-order execution, register renaming, pipeline structure, execution unit organization and branch prediction algorithms for each type of microprocessor. Describes many details that cannot be found in manuals from microprocessor vendors or anywhere else. The information is based on my own research and measurements rather than on official sources. This information will be useful to programmers who want to make CPU-specific optimizations as well as to compiler makers and students of microarchitecture."

CMU computer architecture lecture series (spring 2015):

https://youtube.com/playlist?list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq

See lectures 5 to 16 for microarchitecture and in particular 12-15 for VLIW. Overall a great "traditional" computer architecture course.

ETH digital design and computer architecture lecture series (spring 2021):

https://youtube.com/playlist?list=PL5Q2soXY2Zi_uej3aY39YB5pfW4SJ7LlN

Also as above, but a bit more expanded focus away from general purpose processors, also more focus on branch prediction.

ETH (Advanced) computer architecture lecture series (late 2020):

https://youtube.com/playlist?list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN

A more advanced, research/future focused look at computer architecture, covers a lot of interconnects and such which are relevant for growing multicore designs.

AES lecture 2 2 1 (Basic VLIW Approach)

A short overview of VLIW principles, advantages, and disadvantages. Rest of this course may also be interesting.

eidairaman1 · Aug 10, 2021

Tldr, What Killed i64 was AMD64 (X86-64)

Wasnt vliw used in GPUs for AMD before HD 7000?

GorbazTheDragon · Aug 10, 2021

eidairaman1 said:
Tldr, What Killed i64 was AMD64 (X86-64)

surprisedpikachu when the only competitor in the market kills off a design that is inherently badly suited for that market lol

Cheese_On_tsaot · Aug 10, 2021

GorbazTheDragon said:
I will preface this by saying that VLIW is flawed from the get go, it will never be good for general purpose processors.

Here are my points regarding the OP:

A pipeline doesn't necessarily deal with sequential code, it takes independent instructions, splits them into smaller units of work, and completes one step on one instruction while the hardware for the previous step is working on another instruction. Any group of instructions can be processed in a pipelined way, as long as they can be divided into those smaller chunks of work and the instructions are not dependent on each other.

I think this is important to elaborate on: Out of order improves the performance of a pipeline processing instructions which have dependencies. If we have instruction 1 and 2, where 2 is dependent on the output of 1, we cannot compute 2 until 1 has entirely finished (some ways to slightly get around this, but not considered for simplicity). Out of order allows other instructions to proceed down the pipeline while instruction 2 is waiting, then once the data it needs is ready, instruction 2 will go through the pipeline and execute.

I would add to this that the compiler assigns instructions to "slots" of the execution units in the core. So rather than giving the CPU 50 shaped wooden blocks to put through a square, circle, and triangle hole, the compiler tries to find a square, circle, and triangle block to give to the CPU in each cycle.

For both superscalar out of order and (superscalar) VLIW machines

Both of these are techniques to improve performance by using ILP, the difference is that VLIW requires the compiler to correctly place all dependent instructions before execution while a superscalar out of order machine does dependency checking and, where necessary, handling in the processor.

I am not particularly familiar with the performance figures of these processors with well optimised code, but I would be hesitant to believe Itanium didn't perform well if the code was designed and compiled around that ISA. The same way if you just ran some arm programs after shoddily compiling them for x86, the performance would be lackluster, except that arm and x86 are much more similar than x86 and IA-64.

It's not really down to the decoder or even opteron in particular I would say. Compiling for superscalar out of order processors at the time was already quite a developed field, as such compilers were already capable of optimising code reasonably well for those contemporary x86 architectures. x86 and other CISC compilers are far from dumb, code optimisation is a huge part of performance. Despite the fact that I strongly believe that avoiding increased programmer burden is the biggest obstacle in building ultra efficient computer systems nowadays, I would argue that modern uArchs actually bank on good coding practices and good compilers to keep delivering performance improvements.

This is a bad example because arm does less work on average per instruction than x86. Not to mention that you can have a pretty big range of performance from a certain width of decode by improving other aspects of the processor. Core 2 had a 4 wide decoder, and today Zen 3 still runs on that 4 wide decoder (ice lake also appears to be 4 wide despite Intel's slides, see Agner).

Bingo, this is the big killer

I think this is poorly worded, you can make simultaneous reads and writes, the problem is you can't do it to same addresses. Irrelevant either way since it would apply to VLIW as well if it was the case.

I believe you are focusing on the wrong thing here though, the big thing here is caches... Caches make memory accesses inherently variable, depending on which level of cache (or if it is in a cache at all) the data you are interested in is located you can get from under 5 cycles to over 300 cycles of access latency. And this is impossible to predict without knowing exactly what is running on the system (as other executing programs can cause cache evictions) and what data is being processed (different data set may cause an eviction at a different time).

The CPU decoder doesn't actually schedule anything, it just decodes the x86 instructions and transforms them into micro ops which are generally single units of computation in the back end.

The reason superscalar out of order is less susceptible to performance loss with variable memory latency is simply because it is out of order. If one instruction gets hung up, eg on a memory access, the following instructions will still be decoded and passed into the pipeline, any independent instructions will be executed and any instructions dependent on that memory access wait will pile up until the reorder buffer is full, once the memory access is complete, those instructions dependent on it will get worked on again and the pile up will get cleaned out. I will place some links on good lectures talking about this at the end of the post.

Hold on, we're getting ahead of ourselves here.

There's another problem with VLIW: changing microarchitecture requires recompilation.

Because the VLIW compiler is handing the CPU those shaped blocks to match the shapes of the execution hardware within the CPU, when we change the execution hardware available within the CPU we need to change the shaped blocks the compiler is giving it. Depending on how drastic the changes on the uArch are, not doing so can have anywhere from a substantial performance impact (one or several execution units unutilised in a faster uArch) to basically being unusable (internal latencies are incompatible, processor has entirely different shaped slots than before). While optimising for specific superscalar out of order uArchs is a thing, you will barely ever end up with one execution unit being left entirely idle just because you didn't re-optimise the code for the new uArch.

Decoder isn't that big of a burden, rather that superscalar and out of order requires a bunch of other hardware to check for dependencies and resolve them.

This is actually because of VLIW... VLIW spurred a lot of research in compilers which lead to many new forms of optimisation that were needed to make true VLIW work, but because superscalar out of order also uses execution units and can only handle a limited amount of out of order-ness without losing throughput, apply to superscalar out of order ISAs as well: x86 and arm most notably.

These kinds of optimisations are abundant in x86 compilers, that's why Nvidia aren't shouting about it from the rooftops

It won't unfortunately...

I'll elaborate on my general thoughts here...

VLIW is great under certain conditions: Your code is sufficiently simple and regular not to require conventional caches (as opposed to directly managed scratchpads), you know what code you will run, you know your processor/hardware configuration, and you won't change the processor microarchitecture or hardware configuration.

General purpose processors (where modern x86 and arm processors are used) work in an environment that doesn't fulfill any of these requirements: Code is complex and everyone is trying to use the processor for a different purpose, caches are not directly managed, you don't know what code may be running elsewhere on the machine, there is a large variety of different hardware that you need to run on, and hardware configurations and microarchitectures are changing on a very regular basis.

VLIW for general purpose processors is like trying to replace a family car VW Passat with a (3 seat) Ford Transit based on the argument that you can more easily take all the furniture you buy in IKEA home yourself.

Links and stuff:

Agner: https://www.agner.org/optimize/
The microarchitecture optimisation guide (3) "contains details about the internal working of various microprocessors from Intel, AMD and VIA. Topics include: Out-of-order execution, register renaming, pipeline structure, execution unit organization and branch prediction algorithms for each type of microprocessor. Describes many details that cannot be found in manuals from microprocessor vendors or anywhere else. The information is based on my own research and measurements rather than on official sources. This information will be useful to programmers who want to make CPU-specific optimizations as well as to compiler makers and students of microarchitecture."

CMU computer architecture lecture series (spring 2015):
https://youtube.com/playlist?list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq
See lectures 5 to 16 for microarchitecture and in particular 12-15 for VLIW. Overall a great "traditional" computer architecture course.

ETH digital design and computer architecture lecture series (spring 2021):
https://youtube.com/playlist?list=PL5Q2soXY2Zi_uej3aY39YB5pfW4SJ7LlN
Also as above, but a bit more expanded focus away from general purpose processors, also more focus on branch prediction.

ETH (Advanced) computer architecture lecture series (late 2020):
https://youtube.com/playlist?list=PL5Q2soXY2Zi9xidyIgBxUz7xRPS-wisBN
A more advanced, research/future focused look at computer architecture, covers a lot of interconnects and such which are relevant for growing multicore designs.

AES lecture 2 2 1 (Basic VLIW Approach)

A short overview of VLIW principles, advantages, and disadvantages. Rest of this course may also be interesting.

We can address the point that VLIW is bad at compute as Fermi absolutely oblitorated VLIW in compute whilst the at the time 5000 and 6000 series on paper look better spec wise.
AMD got their act together with GCN but they did not put the effort that Nvidia did into tesselation.

GorbazTheDragon · Aug 10, 2021

Cheese_On_tsaot said:
We can address the point that VLIW is bad at compute as Fermi absolutely oblitorated VLIW in compute whilst the at the time 5000 and 6000 series on paper look better spec wise.

I am not certain that was entirely down to VLIW (though interestingly the only other VLIW-like GPU architecture, Kepler, happened to also be bad at compute), I expect other parts of the implementation of terascale made it suboptimal for compute, that said my understanding of GPUs isn't good enough for me to really unpick that question. Though let's also not forget that at the time Nvidia was full steam ahead with developing CUDA as a compute platform, so it would be natural that they had a product that pushed hard on that front.

Cheese_On_tsaot · Aug 10, 2021

GorbazTheDragon said:
I am not certain that was entirely down to VLIW (though interestingly the only other VLIW-like GPU architecture, Kepler, happened to also be bad at compute), I expect other parts of the implementation of terascale made it suboptimal for compute. Let's also not forget that at the time Nvidia was full steam ahead with developing CUDA as a compute platform, so it would be natural that they had a product that pushed hard on that front.

You make good valid points.

I saw your edit.

dragontamer5788 · Aug 10, 2021

GorbazTheDragon said:
These kinds of optimisations are abundant in x86 compilers, that's why Nvidia aren't shouting about it from the rooftops

Oh, I'm talking about NVidia SASS. Give this page a read:

https://arxiv.org/pdf/1804.06826.pdf

Seem familiar? "Bundles" of instructions, 3-instructions + 1 control block (Pascal). The most recent NVidia GPUs even have the purported ability to perform a floating-point instruction simultaneously with an integer-instruction. The control information contains read / write barrier information published by the compiler.

Sure, Volta has control info + 1 instruction (so hardly VLIW), but we're starting to see the compiler figure out some pretty interesting information, almost certainly associated with the internal details of the microarchitectural pipelines. The compiler is statically scheduling something, I don't know what, but... these barriers and stalls are indicating something to the NVidia "decoder".

There's another problem with VLIW: changing microarchitecture requires recompilation.

Indeed. But are we ahead of ourselves? NVidia PTX is recompiled every generation. The PTX pseudo-assembly code is reoptimized by the PTX compiler to create microarchitecture-specific SASS assembly.

The VLIW golden age might be slowly dawning. NVidia's Volta / Turing machines are kind of taking some weird cues from VLIW. I'm not sure if I'd call them a VLIW instruction set, but the similarities are uncanny.

What if all VLIW needed with a pseudo-assembly language layer, that was recompiled every time a new GPU came out? What if this layer was PTX (or something like it) ?

This NVidia PTX technology, or something like it, is probably what's needed to make VLIW reasonable to use in practice. I'm not quite calling NVidia's GPUs a VLIW architecture... but they seem to have solved many problems associated with VLIW.

GorbazTheDragon · Aug 10, 2021

dragontamer5788 said:
Seem familiar? "Bundles" of instructions, 3-instructions + 1 control block (Pascal).

Bundling instructions isn't VLIW, in fact a smart x86 or ARM compiler will also try to ensure certain placement of instructions to make the job of the decoder easier (many x86 decoders for example could only decode 1 complex instruction per cycle, so it made sense to group instructions into groups of 1 complex and 3 simple where possible). Sending control information is also not necessarily VLIW, it is just avoided in CPU ISAs because it generally ties your program to a specific hardware configuration.

The biggest aspect of VLIW is that you tie the organisation of instructions in the word to the configuration of your execution units within the core, this enables you to cull off a lot of the complexity related to routing instructions to different parts of the core and queueing instructions which are stalled by dependencies.

This cost factor is why they are so attractive for embedded devices, the code and hardware are locked in when something goes into production, and any code updates can just be compiled for that hardware. And since you don't care if people can or cannot run other programs on your embedded device, you don't care about whether other code needs to be recompiled to run on it. So in the end the lack of flexibility is not a problem, and you gain a big (design) cost and power advantage because of reducing the complexity of the processor.

dragontamer5788 said:
What if all VLIW needed with a pseudo-assembly language layer, that was recompiled every time a new GPU came out?

This is already kinda how GPU drivers work, they transform your rendering code into stuff that works on your GPU architecture, but it's not something that enables VLIW per se.

Unfortunately for something intended to be general purpose like a CPU, you can't expect this to survive in the market, it has been attempted before but it just adds more effort and everyone ends up going down the path of least resistance with general purpose stuff.

R-T-B · Aug 10, 2021

GorbazTheDragon said:
surprisedpikachu when the only competitor in the market kills off a design that is inherently badly suited for that market lol

Not so much that it was bad for that market, but that it was terrible for all existing code without a recompile. And we know how much existing x86 code there is, and how likely you are to get a "free" recompile.

Peter1986C · Aug 10, 2021

v12dock said:
I believe ATI/AMDs TeraScale architecture was VLIW that would be 2000-6000 series cards.

Correct.

BTW, Transmeta tried to make "regular" VLIW CPUs as well (for the mobile market) under the names Crusoe and Efficeon. They did however "translate" from x86 code running some form of virtualization on the things, If I understood correctly.

mtcn77 · Aug 10, 2021

Cheese_On_tsaot said:
We can address the point that VLIW is bad at compute as Fermi absolutely oblitorated VLIW in compute whilst the at the time 5000 and 6000 series on paper look better spec wise.
AMD got their act together with GCN but they did not put the effort that Nvidia did into tesselation.

That is a joke, right? Quite since HD5000 were the first graphics card series with built in D3D11 tessellation, it couldn't have been any other way...

Cheese_On_tsaot · Aug 10, 2021

mtcn77 said:
That is a joke, right? Quite since HD5000 were the first graphics card series with built in D3D11 tessellation, it couldn't have been any other way...

No it's really not a joke at all.

dragontamer5788 · Aug 10, 2021

GorbazTheDragon said:
This is already kinda how GPU drivers work, they transform your rendering code into stuff that works on your GPU architecture, but it's not something that enables VLIW per se.

Ish. The thing about PTX is that its a portable-assembly language for NVidia's GPUs.

As you can see in the document: the raw assembly language (aka: SASS) for NVidia Kepler, Pascal, and Volta all differed dramatically. The reason why that fascinates me so, is that the GPU-world has built up what appears to be an impossibility: a high-performance, architecture agnostic, recompiled set of code to... calculate the color of Gordon Freeman's eyebrows from multiple angles in realtime. (Gordon Freeman is a character from the video game "Half Life").

But ignoring the silly video game graphics for a sec... this seems to have implications on how we could hypothetically build a general purpose CPU and/or computer model that achieves portable high-performance compute. And maybe, that future solution would leverage VLIW.

mtcn77 · Aug 10, 2021

Cheese_On_tsaot said:
No it's really not a joke at all.

You know it was proprietary junk right? I don't want to lose precious time, but tessellation realigns textures and allows for "free antialiasing"(I cannot search for the sphere out of chubby triangles photo found in AMD's launch presentation right now) with regard to how textures are sampled. That is it. It is not used to generate textures which if you listen to yourself would notice how stupid that is.

Cheese_On_tsaot · Aug 10, 2021

mtcn77 said:
You know it was proprietary junk right? I don't want to lose precious time, but tessellation realigns textures and allows for "free antialiasing" with regard to how textures are sampled. That is it. It is not used to generate textures which if you listen to yourself would notice how stupid that is.

Tesselation is not properietary at all it was and has always been a feature within DX11 and onwards, ATi at the time just did not put that much effort into it like Nvidia did, but Nvidia released a card that was far more inefficient for gaming until the real compute work happens then Fermi would flog the VLIW 5000 series.

GorbazTheDragon · Aug 10, 2021

dragontamer5788 said:
Ish. The thing about PTX is that its a portable-assembly language for NVidia's GPUs.

As you can see in the document: the raw assembly language (aka: SASS) for NVidia Kepler, Pascal, and Volta all differed dramatically. The reason why that fascinates me so, is that the GPU-world has built up what appears to be an impossibility: a high-performance, architecture agnostic, recompiled set of code to... calculate the color of Gordon Freeman's eyebrows from multiple angles in realtime. (Gordon Freeman is a character from the video game "Half Life").

But ignoring the silly video game graphics for a sec... this seems to have implications on how we could hypothetically build a general purpose CPU and/or computer model that achieves portable high-performance compute. And maybe, that future solution would leverage VLIW.

I honestly think it is a grave mistake to consider existing GPU approaches architecture agnostic... Especially when it comes to graphics.

Also, the paper you linked above is exactly proof of that, it's written to analytically discover the latencies and other characteristics of the architecture to optimise performance.

mtcn77 · Aug 10, 2021

Cheese_On_tsaot said:
ATi at the time just did not put that much effort into it like Nvidia did, but Nvidia released a card that was far more inefficient for gaming until the real compute work happens then Fermi would flog the VLIW 5000 series.

That is lame. If Nvidia couldn't be the first, nor do it better in efficiency how is it of any significance that their late comer(which lemme remind you failed for 2 iterations!480&560) can claim to meet the features of 'the' first D3D11 tessellator card...

TheinsanegamerN · Aug 10, 2021

Peter1986C said:
Correct.

BTW, Transmeta tried to make "regular" VLIW CPUs as well (for the mobile market) under the names Crusoe and Efficeon. They did however "translate" from x86 code running some form of virtualization on the things, If I understood correctly.

Nvidia also attempted a VILV CPU with the parker chip, turns out while it was good in benchmarks the moment it met spaghetti code of any kind it ground to a halt.

Cheese_On_tsaot · Aug 10, 2021

mtcn77 said:
That is lame. If Nvidia couldn't be the first, nor do it better in efficiency how is it of any significance that their late comer(which lemme remind you failed for 2 iterations!480&560) can claim to meet the features of 'the' first D3D11 tessellator card...

ATi and AMD did not put the effort in, it did not magically stop with the Fermi series, Radeon cards lagged behind in tesselation performance up until roughly Polaris where they worked on it closer and were attempting to basically making 390X with great efficiency and improvements to tesselation and rendering.

The GTX 580 which was a refresh of the 480 with the full die unlocked was up the 7970's ass and this was AMD's second generation DX11 architecture (GCN)
The 680 whilst missing it's compute chops still beat out the compute orientated 7970 whilst Nvidia were aimning for efficiency earlier on than AMD.

This gap would mostly remain the same up until they moved away from GCN and onto Navi.

The AMD Radeon R9 Fury X Review: Aiming For the Top

www.anandtech.com

dragontamer5788 · Aug 10, 2021

GorbazTheDragon said:
I honestly think it is a grave mistake to consider existing GPU approaches architecture agnostic... Especially when it comes to graphics.

Also, the paper you linked above is exactly proof of that, it's written to analytically discover the latencies and other characteristics of the architecture to optimise performance.

I'm not saying you reach optimal levels of performance from high level code.

But what I'm saying is: OpenCL / DirectX shaders / CUDA have demonstrated the advances in compiler technology over the last 20 years. Its now possible to achieve higher levels of performance than ever before, even across diverse sets of hardware (AMD Terascale, AMD GCN, AMD RDNA, NVidia Kepler, Pascal, and Volta).

The microarchitectural details always matter. But this advancement cannot be denied... especially as we look at the fundamental SASS assembly language of Kepler and see the details of dependencies / write barriers / etc. etc. being added by the PTX compiler. These GPUs are exposing details of its pipeline to the ISA like never before, and yet GPU programmers are able to keep some semblance of portability when writing DirectX shaders.

mtcn77 · Aug 10, 2021

Cheese_On_tsaot said:
ATi and AMD did not put the effort in, it did not magically stop with the Fermi series, Radeon cards lagged behind in tesselation performance

How can you be behind when it was featured on AMD(ATi) series even before Directx 10? Something doesn't add up in your story, maybe somebody is sweeping something under the rug...

Cheese_On_tsaot · Aug 10, 2021

mtcn77 said:
How can you be behind when it was featured on AMD(ATi) series even before Directx 10? Something doesn't add up in your story, maybe somebody is sweeping something under the rug...

You can be first to anything and smug enough to think you already have what it takes to compete with an aging architecture.
Then give ease of way for a competitor to leave you in the dust due to complacency and arrogance.
Intel and AMD right now, AMD and Nvidia back then.

System Name	Indis the Fair (cursed edition)
Processor	11900k 5.1/4.9 undervolted.
Motherboard	MSI Z590 Unify-X
Cooling	Heatkiller VI Pro, VPP755 V.3, XSPC TX360 slim radiator, 3xA12x25, 4x Arctic P14 case fans
Memory	G.Skill Ripjaws V 2x16GB 4000 16-19-19 (b-die@3600 14-14-14 1.45v)
Video Card(s)	EVGA 2080 Super Hybrid (T30-120 fan)
Storage	970EVO 1TB, 660p 1TB, WD Blue 3D 1TB, Sandisk Ultra 3D 2TB
Display(s)	BenQ XL2546K, Dell P2417H
Case	FD Define 7
Audio Device(s)	DT770 Pro, Topping A50, Focusrite Scarlett 2i2, Røde VXLR+, Modmic 5
Power Supply	Seasonic 860w Platinum
Mouse	Razer Viper Mini, Odin Infinity mousepad
Keyboard	GMMK Fullsize v2 (Boba U4Ts)
Software	Win10 x64/Win7 x64/Ubuntu

System Name	PCGOD
Processor	AMD FX 8350@ 5.0GHz
Motherboard	Asus TUF 990FX Sabertooth R2 2901 Bios
Cooling	Scythe Ashura, 2×BitFenix 230mm Spectre Pro LED (Blue,Green), 2x BitFenix 140mm Spectre Pro LED
Memory	16 GB Gskill Ripjaws X 2133 (2400 OC, 10-10-12-20-20, 1T, 1.65V)
Video Card(s)	AMD Radeon 290 Sapphire Vapor-X
Storage	Samsung 840 Pro 256GB, WD Velociraptor 1TB
Display(s)	NEC Multisync LCD 1700V (Display Port Adapter)
Case	AeroCool Xpredator Evil Blue Edition
Audio Device(s)	Creative Labs Sound Blaster ZxR
Power Supply	Seasonic 1250 XM2 Series (XP3)
Mouse	Roccat Kone XTD
Keyboard	Roccat Ryos MK Pro
Software	Windows 7 Pro 64

System Name	Indis the Fair (cursed edition)
Processor	11900k 5.1/4.9 undervolted.
Motherboard	MSI Z590 Unify-X
Cooling	Heatkiller VI Pro, VPP755 V.3, XSPC TX360 slim radiator, 3xA12x25, 4x Arctic P14 case fans
Memory	G.Skill Ripjaws V 2x16GB 4000 16-19-19 (b-die@3600 14-14-14 1.45v)
Video Card(s)	EVGA 2080 Super Hybrid (T30-120 fan)
Storage	970EVO 1TB, 660p 1TB, WD Blue 3D 1TB, Sandisk Ultra 3D 2TB
Display(s)	BenQ XL2546K, Dell P2417H
Case	FD Define 7
Audio Device(s)	DT770 Pro, Topping A50, Focusrite Scarlett 2i2, Røde VXLR+, Modmic 5
Power Supply	Seasonic 860w Platinum
Mouse	Razer Viper Mini, Odin Infinity mousepad
Keyboard	GMMK Fullsize v2 (Boba U4Ts)
Software	Win10 x64/Win7 x64/Ubuntu

System Name	Jamesbondosaurus
Processor	Ryzen 5 5600x @ 4.75ghz PBO
Motherboard	MSI B550i MPG Gaming Edge MAX Wifi
Cooling	Arctic Esport Duo White / Black
Memory	Patriot Viper Steel @ 3733 C16
Video Card(s)	Inno3D RTX 3070
Storage	120GB PNY CS900. 1TB Crucial BX500. 1TB WD Blue M.2. 8TB WD Red HDD.
Display(s)	Gigabyte M27Q
Case	Phanteks P200A PE
Audio Device(s)	Onboard
Power Supply	Super Flower Leadex III
Mouse	G305
Keyboard	Some cheap mechanical keeb
Software	Windows
Benchmark Scores	Like 5 FPS if I play Dying Light 2 at 8K.

System Name	Indis the Fair (cursed edition)
Processor	11900k 5.1/4.9 undervolted.
Motherboard	MSI Z590 Unify-X
Cooling	Heatkiller VI Pro, VPP755 V.3, XSPC TX360 slim radiator, 3xA12x25, 4x Arctic P14 case fans
Memory	G.Skill Ripjaws V 2x16GB 4000 16-19-19 (b-die@3600 14-14-14 1.45v)
Video Card(s)	EVGA 2080 Super Hybrid (T30-120 fan)
Storage	970EVO 1TB, 660p 1TB, WD Blue 3D 1TB, Sandisk Ultra 3D 2TB
Display(s)	BenQ XL2546K, Dell P2417H
Case	FD Define 7
Audio Device(s)	DT770 Pro, Topping A50, Focusrite Scarlett 2i2, Røde VXLR+, Modmic 5
Power Supply	Seasonic 860w Platinum
Mouse	Razer Viper Mini, Odin Infinity mousepad
Keyboard	GMMK Fullsize v2 (Boba U4Ts)
Software	Win10 x64/Win7 x64/Ubuntu

VLIW: Will this ISA style ever get its time in the sun?

dragontamer5788

v12dock

Block Caption of Rainey Street

mtcn77

GorbazTheDragon

eidairaman1

The Exiled Airman

GorbazTheDragon

Cheese_On_tsaot

GorbazTheDragon

Cheese_On_tsaot

dragontamer5788

GorbazTheDragon

R-T-B

Peter1986C

mtcn77

Cheese_On_tsaot

dragontamer5788

mtcn77

Cheese_On_tsaot

GorbazTheDragon

mtcn77

TheinsanegamerN

Cheese_On_tsaot

The AMD Radeon R9 Fury X Review: Aiming For the Top

dragontamer5788

mtcn77

Cheese_On_tsaot

System Name	Pioneer
Processor	Ryzen 9 9950X
Motherboard	MSI MAG X670E Tomahawk Wifi
Cooling	Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory	64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6200(Running 1T no GDM)
Video Card(s)	XFX RX 7900 XTX Speedster Merc 310
Storage	Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64, other office machines run Windows 11 Enterprise

System Name	AP201 \| N2Plus \| NUC
Processor	AMD Ryzen 5 3600 \| Amlogic S922X \| Intel Core i5-7260
Motherboard	Gigabyte B550M DS3H \|Odroid N2+ \| NUC Board 7
Cooling	Inter-Tech Argus SU-200, 3x Arctic P12 & 1x Corsair 140mm case fan \| stock heatsink+fan \| stock HSF
Memory	Gskill Aegis DDR4 32GB \| 4 GB DDR4 \| 16 GB DDR4
Video Card(s)	Sapphire Pulse RX 6600 (8GB) \| Arm Mali G52 \| Iris Plus 640
Storage	SK Hynix 240GB, Sam. 840 + 850 EVO (2x (250 GB)\| Samsung 850 Evo 500GB \| WD Green 240 GB
Display(s)	AOC G2260VWQ6 \| LG 24MT57D \|
Case	Asus Prime 201 \| Stock case (black version) \| Stock case
Audio Device(s)	integrated
Power Supply	BeQuiet! Pure Power 11 400W \| 12v barrel jack \| 19V laptop brick (Asus)
Mouse	Logitech G500 \|(runs mostly headless) \| no-name ergo mouse
Keyboard	Qpad MK-50 (Cherry MX brown)\| (runs mostly headless) \| Blaze Keyboard
Software	Fedora, Windows 10 \| Gentoo Linux \| EndeavourOS

System Name	Skunkworks 3.0
Processor	5800x3d
Motherboard	x570 unify
Cooling	Noctua NH-U12A
Memory	32GB 3600 mhz
Video Card(s)	asrock 6800xt challenger D
Storage	Sabarent rocket 4.0 2TB, MX 500 2TB
Display(s)	Asus 1440p144 27"
Case	Old arse cooler master 932
Power Supply	Corsair 1200w platinum
Mouse	squeak
Keyboard	Some old office thing
Software	Manjaro