• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

VLIW: Will this ISA style ever get its time in the sun?

Joined
Apr 24, 2020
Messages
2,875 (1.54/day)
With Intel Itanium getting its last shipment, I've taken some time to reflect upon the VLIW methodology. How did Itanium go so wrong? The embedded world has plenty of DSPs (even those inside of your cell phone right now) that perform all kinds of useful calculations in the VLIW style and yet... VLIW never really gained mainstream acceptance outside of AMD Terrascale. There's all sorts of theories and hypotheticals why this happened: the mainstream is that Intel bet too much upon "magic compilers" that never existed or could exist. That these magic compilers could turn your code into a form that VLIW CPUs could execute efficiently.

I argue that the mainstream thought is narrowminded. Lets start at the basics.

Fine grained parallelism: the CPU (and compiler's) #1 job

Lets start at the very beginning, and unlike other articles that discuss pipelines, Out-of-Order execution, or Turing machines... I'll try to keep this nontechnical. The fundamental job of a CPU is to 1. Discover latent parallelism in code, and 2. Execute said parallelism. That's all a pipeline is: a mechanism for executing sequential code, but in parallel. (A pipeline stall is the CPU detecting that some instructions can't execute in parallel safely, so it waits until the 1st instruction is done instead of executing them simultaneously). Out-of-order execution, same thing except more complicated. VLIW took a slightly different approach: compilers would have to discover the parallelism ahead of time, and the CPU core itself would then "simply" execute this pre-figured out parallelism. That's what VLIW essentially is: a very-long instruction word, the explicit ability for a compiler to tell the CPU "hey, this set of 3-instructions can be executed in parallel".

That's the dirty little secret: whenever a programmer writes code, there's fine-grained parallelism, often called "instruction level parallelism" that the compiler can figure out that most programmers don't know or care about. Automatically finding, and executing, this code in parallel leads to improved performance. As it turns out, AMD Opteron was able to "discover" parallelism using its decoder and out-of-order execution methodology to a better degree than Intel's VLIW Itanium.

Despite the fact that Itanium was built from the ground up to be a fine-grained parallel system... the compilers to automatically discover parallelism never was competitive against CPU-decoder designs on traditional CPUs. As such AMD Opteron's decoder "beat" most compilers in practice.

Why did Itanium fail?

So bringing it back to the beginning: why did Itanium fail? I think the main issue is that traditional CPUs (in particular: AMD Opteron back in the 00s) learned to extract parallelism from code at far higher levels than ever thought possible. Even today, we have the Apple M1 with 8-way decoding (that is: 8-instructions executed per clock tick), demonstrating the huge parallel girth that today's processors have. Even worse: some "latencies" cannot be predicted at compile time. Fetches from DDR4 can take more, or less, time depending on the nature of MESI state (the CPU-core-to-core communication framework. When one core is reading or writing to DDR4 RAM, no other core can touch it. Other cores must wait, in order to avoid race conditions).

Variable length latency is the absolute killer of VLIW compilers. The compiler doesn't know how long any memory read or write will take, and that makes scheduling instructions difficult. In contrast, the traditional CPU decoder knows that information (!!!), because CPU decoders are trying to schedule the parallelism as the code is running. Its a complete rout: it seems like a traditional CPU decoder is better than a VLIW ever could be at extracting fine grained parallelism out of code!

So how can VLIW ever possibly see the light of day?

Well, the current VLIW niche does show some hope for this weird ISA style. DSP inside of our cellphones are being used to perform camera-filters / HDR calculations at outstanding speed and efficiency. Because of the regular structure of camera filters, there's very few memory reads. With a more dependable and easier to predict schedule, VLIW suddenly becomes usable again, and the decoder (of traditional CPUs) becomes a hot, power-hungry, unnecessary appendage.

There's a second wildcard: today's compilers are in fact, much smarter than compilers 20 years ago. It is very possible that the magic compilers that Intel wanted to make Itanium do exist today. NVidia PTX's compiler for example, attaches read/write dependencies to every instruction. I don't know the internals of NVidia's decoder architecture on their GPU... but the read/write dependencies are very clear in the NVidia assembly language!! Just under our noses, it seems like NVidia has created the compilers needed to extract fine grained parallelism, and are kind of keeping it a soft-secret (they're certainly not shouting out this feature from the rooftops, despite it being in the last 4 or so GPU generations).

So maybe it is possible to design a modern, VLIW ISA superior to all others today? Maybe VLIW suddenly has a chance to rise from the ashes? But only if there's some CPU-manufacturer who is willing to make a big bet on it once more, after the disaster of Intel's Itanium. The lessons learned would ironically come from cell-phone camera filters, and NVidia GPU assembly however, and not really from the Itanium.
 
Thanks for bringing this up.

First of all, let me remind that the launch of RDNA was heralded as the second coming of TeraScale architecture. Albeit being erroneous, this also indicated that 'TeraScale' and VLIW was to get the recognition which it never did fully. The reason being, its replacement by GCN Architecture was nominated as the definitive 'Out of Order' performance upgrade that would resolve compiler issues. Let me reinstate that it did not. It only allowed scheduling by a latency of 4 which is a lot in gpu standards; execute latency thoroughput was not the consideration in this design. RDNA and GCN for instance have 1 cycle clock latency. This has the potential of streamlining buffers because streamlining is by default, paramount - in the case of TeraScale - and automatic by RDNA Architecture. There are no 4 times cache buffering necessitated in order to run a 'single' instruction every consecutive cycle.

TeraScale's designers knew this was the best method, however it was only until RDNA got developed how scalarization was to be issued best in software.

These days, gpus don't just load 400% buffers and throw away what little memory independence they can keep up with...
 
Last edited:
I will preface this by saying that VLIW is flawed from the get go, it will never be good for general purpose processors.

Here are my points regarding the OP:
That's all a pipeline is: a mechanism for executing sequential code, but in parallel.
A pipeline doesn't necessarily deal with sequential code, it takes independent instructions, splits them into smaller units of work, and completes one step on one instruction while the hardware for the previous step is working on another instruction. Any group of instructions can be processed in a pipelined way, as long as they can be divided into those smaller chunks of work and the instructions are not dependent on each other.


Out-of-order execution, same thing except more complicated.
I think this is important to elaborate on: Out of order improves the performance of a pipeline processing instructions which have dependencies. If we have instruction 1 and 2, where 2 is dependent on the output of 1, we cannot compute 2 until 1 has entirely finished (some ways to slightly get around this, but not considered for simplicity). Out of order allows other instructions to proceed down the pipeline while instruction 2 is waiting, then once the data it needs is ready, instruction 2 will go through the pipeline and execute.
CPU core itself would then "simply" execute this pre-figured out parallelism.
I would add to this that the compiler assigns instructions to "slots" of the execution units in the core. So rather than giving the CPU 50 shaped wooden blocks to put through a square, circle, and triangle hole, the compiler tries to find a square, circle, and triangle block to give to the CPU in each cycle.
That's the dirty little secret
For both superscalar out of order and (superscalar) VLIW machines ;)

Both of these are techniques to improve performance by using ILP, the difference is that VLIW requires the compiler to correctly place all dependent instructions before execution while a superscalar out of order machine does dependency checking and, where necessary, handling in the processor.
As it turns out, AMD Opteron was able to "discover" parallelism using its decoder and out-of-order execution methodology to a better degree than Intel's VLIW Itanium.
I am not particularly familiar with the performance figures of these processors with well optimised code, but I would be hesitant to believe Itanium didn't perform well if the code was designed and compiled around that ISA. The same way if you just ran some arm programs after shoddily compiling them for x86, the performance would be lackluster, except that arm and x86 are much more similar than x86 and IA-64.


As such AMD Opteron's decoder "beat" most compilers in practice.
It's not really down to the decoder or even opteron in particular I would say. Compiling for superscalar out of order processors at the time was already quite a developed field, as such compilers were already capable of optimising code reasonably well for those contemporary x86 architectures. x86 and other CISC compilers are far from dumb, code optimisation is a huge part of performance. Despite the fact that I strongly believe that avoiding increased programmer burden is the biggest obstacle in building ultra efficient computer systems nowadays, I would argue that modern uArchs actually bank on good coding practices and good compilers to keep delivering performance improvements.

Even today, we have the Apple M1 with 8-way decoding (that is: 8-instructions executed per clock tick), demonstrating the huge parallel girth that today's processors have.
This is a bad example because arm does less work on average per instruction than x86. Not to mention that you can have a pretty big range of performance from a certain width of decode by improving other aspects of the processor. Core 2 had a 4 wide decoder, and today Zen 3 still runs on that 4 wide decoder (ice lake also appears to be 4 wide despite Intel's slides, see Agner).
Even worse: some "latencies" cannot be predicted at compile time.
Bingo, this is the big killer

When one core is reading or writing to DDR4 RAM, no other core can touch it. Other cores must wait, in order to avoid race conditions
I think this is poorly worded, you can make simultaneous reads and writes, the problem is you can't do it to same addresses. Irrelevant either way since it would apply to VLIW as well if it was the case.

I believe you are focusing on the wrong thing here though, the big thing here is caches... Caches make memory accesses inherently variable, depending on which level of cache (or if it is in a cache at all) the data you are interested in is located you can get from under 5 cycles to over 300 cycles of access latency. And this is impossible to predict without knowing exactly what is running on the system (as other executing programs can cause cache evictions) and what data is being processed (different data set may cause an eviction at a different time).
In contrast, the traditional CPU decoder knows that information (!!!), because CPU decoders are trying to schedule the parallelism as the code is running.
The CPU decoder doesn't actually schedule anything, it just decodes the x86 instructions and transforms them into micro ops which are generally single units of computation in the back end.

The reason superscalar out of order is less susceptible to performance loss with variable memory latency is simply because it is out of order. If one instruction gets hung up, eg on a memory access, the following instructions will still be decoded and passed into the pipeline, any independent instructions will be executed and any instructions dependent on that memory access wait will pile up until the reorder buffer is full, once the memory access is complete, those instructions dependent on it will get worked on again and the pile up will get cleaned out. I will place some links on good lectures talking about this at the end of the post.
So how can VLIW ever possibly see the light of day?
Hold on, we're getting ahead of ourselves here.

There's another problem with VLIW: changing microarchitecture requires recompilation.

Because the VLIW compiler is handing the CPU those shaped blocks to match the shapes of the execution hardware within the CPU, when we change the execution hardware available within the CPU we need to change the shaped blocks the compiler is giving it. Depending on how drastic the changes on the uArch are, not doing so can have anywhere from a substantial performance impact (one or several execution units unutilised in a faster uArch) to basically being unusable (internal latencies are incompatible, processor has entirely different shaped slots than before). While optimising for specific superscalar out of order uArchs is a thing, you will barely ever end up with one execution unit being left entirely idle just because you didn't re-optimise the code for the new uArch.

decoder (of traditional CPUs)
Decoder isn't that big of a burden, rather that superscalar and out of order requires a bunch of other hardware to check for dependencies and resolve them.
There's a second wildcard: today's compilers are in fact, much smarter than compilers 20 years ago
This is actually because of VLIW... VLIW spurred a lot of research in compilers which lead to many new forms of optimisation that were needed to make true VLIW work, but because superscalar out of order also uses execution units and can only handle a limited amount of out of order-ness without losing throughput, apply to superscalar out of order ISAs as well: x86 and arm most notably.
Just under our noses, it seems like NVidia has created the compilers needed to extract fine grained parallelism, and are kind of keeping it a soft-secret (they're certainly not shouting out this feature from the rooftops, despite it being in the last 4 or so GPU generations).
These kinds of optimisations are abundant in x86 compilers, that's why Nvidia aren't shouting about it from the rooftops ;)
So maybe it is possible to design a modern, VLIW ISA superior to all others today?
It won't unfortunately...

I'll elaborate on my general thoughts here...

VLIW is great under certain conditions: Your code is sufficiently simple and regular not to require conventional caches (as opposed to directly managed scratchpads), you know what code you will run, you know your processor/hardware configuration, and you won't change the processor microarchitecture or hardware configuration.

General purpose processors (where modern x86 and arm processors are used) work in an environment that doesn't fulfill any of these requirements: Code is complex and everyone is trying to use the processor for a different purpose, caches are not directly managed, you don't know what code may be running elsewhere on the machine, there is a large variety of different hardware that you need to run on, and hardware configurations and microarchitectures are changing on a very regular basis.

VLIW for general purpose processors is like trying to replace a family car VW Passat with a (3 seat) Ford Transit based on the argument that you can more easily take all the furniture you buy in IKEA home yourself.

Links and stuff:

Agner: https://www.agner.org/optimize/
The microarchitecture optimisation guide (3) "contains details about the internal working of various microprocessors from Intel, AMD and VIA. Topics include: Out-of-order execution, register renaming, pipeline structure, execution unit organization and branch prediction algorithms for each type of microprocessor. Describes many details that cannot be found in manuals from microprocessor vendors or anywhere else. The information is based on my own research and measurements rather than on official sources. This information will be useful to programmers who want to make CPU-specific optimizations as well as to compiler makers and students of microarchitecture."

CMU computer architecture lecture series (spring 2015): See lectures 5 to 16 for microarchitecture and in particular 12-15 for VLIW. Overall a great "traditional" computer architecture course.

ETH digital design and computer architecture lecture series (spring 2021): Also as above, but a bit more expanded focus away from general purpose processors, also more focus on branch prediction.

ETH (Advanced) computer architecture lecture series (late 2020): A more advanced, research/future focused look at computer architecture, covers a lot of interconnects and such which are relevant for growing multicore designs.

AES lecture 2 2 1 (Basic VLIW Approach)
A short overview of VLIW principles, advantages, and disadvantages. Rest of this course may also be interesting.
 
Last edited:
Tldr, What Killed i64 was AMD64 (X86-64)

Wasnt vliw used in GPUs for AMD before HD 7000?
 
Tldr, What Killed i64 was AMD64 (X86-64)
surprisedpikachu when the only competitor in the market kills off a design that is inherently badly suited for that market lol
 
Last edited:
I will preface this by saying that VLIW is flawed from the get go, it will never be good for general purpose processors.

Here are my points regarding the OP:

A pipeline doesn't necessarily deal with sequential code, it takes independent instructions, splits them into smaller units of work, and completes one step on one instruction while the hardware for the previous step is working on another instruction. Any group of instructions can be processed in a pipelined way, as long as they can be divided into those smaller chunks of work and the instructions are not dependent on each other.



I think this is important to elaborate on: Out of order improves the performance of a pipeline processing instructions which have dependencies. If we have instruction 1 and 2, where 2 is dependent on the output of 1, we cannot compute 2 until 1 has entirely finished (some ways to slightly get around this, but not considered for simplicity). Out of order allows other instructions to proceed down the pipeline while instruction 2 is waiting, then once the data it needs is ready, instruction 2 will go through the pipeline and execute.

I would add to this that the compiler assigns instructions to "slots" of the execution units in the core. So rather than giving the CPU 50 shaped wooden blocks to put through a square, circle, and triangle hole, the compiler tries to find a square, circle, and triangle block to give to the CPU in each cycle.

For both superscalar out of order and (superscalar) VLIW machines ;)

Both of these are techniques to improve performance by using ILP, the difference is that VLIW requires the compiler to correctly place all dependent instructions before execution while a superscalar out of order machine does dependency checking and, where necessary, handling in the processor.

I am not particularly familiar with the performance figures of these processors with well optimised code, but I would be hesitant to believe Itanium didn't perform well if the code was designed and compiled around that ISA. The same way if you just ran some arm programs after shoddily compiling them for x86, the performance would be lackluster, except that arm and x86 are much more similar than x86 and IA-64.



It's not really down to the decoder or even opteron in particular I would say. Compiling for superscalar out of order processors at the time was already quite a developed field, as such compilers were already capable of optimising code reasonably well for those contemporary x86 architectures. x86 and other CISC compilers are far from dumb, code optimisation is a huge part of performance. Despite the fact that I strongly believe that avoiding increased programmer burden is the biggest obstacle in building ultra efficient computer systems nowadays, I would argue that modern uArchs actually bank on good coding practices and good compilers to keep delivering performance improvements.


This is a bad example because arm does less work on average per instruction than x86. Not to mention that you can have a pretty big range of performance from a certain width of decode by improving other aspects of the processor. Core 2 had a 4 wide decoder, and today Zen 3 still runs on that 4 wide decoder (ice lake also appears to be 4 wide despite Intel's slides, see Agner).

Bingo, this is the big killer


I think this is poorly worded, you can make simultaneous reads and writes, the problem is you can't do it to same addresses. Irrelevant either way since it would apply to VLIW as well if it was the case.

I believe you are focusing on the wrong thing here though, the big thing here is caches... Caches make memory accesses inherently variable, depending on which level of cache (or if it is in a cache at all) the data you are interested in is located you can get from under 5 cycles to over 300 cycles of access latency. And this is impossible to predict without knowing exactly what is running on the system (as other executing programs can cause cache evictions) and what data is being processed (different data set may cause an eviction at a different time).

The CPU decoder doesn't actually schedule anything, it just decodes the x86 instructions and transforms them into micro ops which are generally single units of computation in the back end.

The reason superscalar out of order is less susceptible to performance loss with variable memory latency is simply because it is out of order. If one instruction gets hung up, eg on a memory access, the following instructions will still be decoded and passed into the pipeline, any independent instructions will be executed and any instructions dependent on that memory access wait will pile up until the reorder buffer is full, once the memory access is complete, those instructions dependent on it will get worked on again and the pile up will get cleaned out. I will place some links on good lectures talking about this at the end of the post.

Hold on, we're getting ahead of ourselves here.

There's another problem with VLIW: changing microarchitecture requires recompilation.

Because the VLIW compiler is handing the CPU those shaped blocks to match the shapes of the execution hardware within the CPU, when we change the execution hardware available within the CPU we need to change the shaped blocks the compiler is giving it. Depending on how drastic the changes on the uArch are, not doing so can have anywhere from a substantial performance impact (one or several execution units unutilised in a faster uArch) to basically being unusable (internal latencies are incompatible, processor has entirely different shaped slots than before). While optimising for specific superscalar out of order uArchs is a thing, you will barely ever end up with one execution unit being left entirely idle just because you didn't re-optimise the code for the new uArch.


Decoder isn't that big of a burden, rather that superscalar and out of order requires a bunch of other hardware to check for dependencies and resolve them.

This is actually because of VLIW... VLIW spurred a lot of research in compilers which lead to many new forms of optimisation that were needed to make true VLIW work, but because superscalar out of order also uses execution units and can only handle a limited amount of out of order-ness without losing throughput, apply to superscalar out of order ISAs as well: x86 and arm most notably.

These kinds of optimisations are abundant in x86 compilers, that's why Nvidia aren't shouting about it from the rooftops ;)

It won't unfortunately...

I'll elaborate on my general thoughts here...

VLIW is great under certain conditions: Your code is sufficiently simple and regular not to require conventional caches (as opposed to directly managed scratchpads), you know what code you will run, you know your processor/hardware configuration, and you won't change the processor microarchitecture or hardware configuration.

General purpose processors (where modern x86 and arm processors are used) work in an environment that doesn't fulfill any of these requirements: Code is complex and everyone is trying to use the processor for a different purpose, caches are not directly managed, you don't know what code may be running elsewhere on the machine, there is a large variety of different hardware that you need to run on, and hardware configurations and microarchitectures are changing on a very regular basis.

VLIW for general purpose processors is like trying to replace a family car VW Passat with a (3 seat) Ford Transit based on the argument that you can more easily take all the furniture you buy in IKEA home yourself.

Links and stuff:

Agner: https://www.agner.org/optimize/
The microarchitecture optimisation guide (3) "contains details about the internal working of various microprocessors from Intel, AMD and VIA. Topics include: Out-of-order execution, register renaming, pipeline structure, execution unit organization and branch prediction algorithms for each type of microprocessor. Describes many details that cannot be found in manuals from microprocessor vendors or anywhere else. The information is based on my own research and measurements rather than on official sources. This information will be useful to programmers who want to make CPU-specific optimizations as well as to compiler makers and students of microarchitecture."

CMU computer architecture lecture series (spring 2015): See lectures 5 to 16 for microarchitecture and in particular 12-15 for VLIW. Overall a great "traditional" computer architecture course.

ETH digital design and computer architecture lecture series (spring 2021): Also as above, but a bit more expanded focus away from general purpose processors, also more focus on branch prediction.

ETH (Advanced) computer architecture lecture series (late 2020): A more advanced, research/future focused look at computer architecture, covers a lot of interconnects and such which are relevant for growing multicore designs.

AES lecture 2 2 1 (Basic VLIW Approach)
A short overview of VLIW principles, advantages, and disadvantages. Rest of this course may also be interesting.
We can address the point that VLIW is bad at compute as Fermi absolutely oblitorated VLIW in compute whilst the at the time 5000 and 6000 series on paper look better spec wise.
AMD got their act together with GCN but they did not put the effort that Nvidia did into tesselation.
 
We can address the point that VLIW is bad at compute as Fermi absolutely oblitorated VLIW in compute whilst the at the time 5000 and 6000 series on paper look better spec wise.
I am not certain that was entirely down to VLIW (though interestingly the only other VLIW-like GPU architecture, Kepler, happened to also be bad at compute), I expect other parts of the implementation of terascale made it suboptimal for compute, that said my understanding of GPUs isn't good enough for me to really unpick that question. Though let's also not forget that at the time Nvidia was full steam ahead with developing CUDA as a compute platform, so it would be natural that they had a product that pushed hard on that front.
 
I am not certain that was entirely down to VLIW (though interestingly the only other VLIW-like GPU architecture, Kepler, happened to also be bad at compute), I expect other parts of the implementation of terascale made it suboptimal for compute. Let's also not forget that at the time Nvidia was full steam ahead with developing CUDA as a compute platform, so it would be natural that they had a product that pushed hard on that front.
You make good valid points.

I saw your edit.
 
These kinds of optimisations are abundant in x86 compilers, that's why Nvidia aren't shouting about it from the rooftops ;)

Oh, I'm talking about NVidia SASS. Give this page a read:

1628564626548.png



Seem familiar? "Bundles" of instructions, 3-instructions + 1 control block (Pascal). The most recent NVidia GPUs even have the purported ability to perform a floating-point instruction simultaneously with an integer-instruction. The control information contains read / write barrier information published by the compiler.

Sure, Volta has control info + 1 instruction (so hardly VLIW), but we're starting to see the compiler figure out some pretty interesting information, almost certainly associated with the internal details of the microarchitectural pipelines. The compiler is statically scheduling something, I don't know what, but... these barriers and stalls are indicating something to the NVidia "decoder".

There's another problem with VLIW: changing microarchitecture requires recompilation.

Indeed. But are we ahead of ourselves? NVidia PTX is recompiled every generation. The PTX pseudo-assembly code is reoptimized by the PTX compiler to create microarchitecture-specific SASS assembly.

The VLIW golden age might be slowly dawning. NVidia's Volta / Turing machines are kind of taking some weird cues from VLIW. I'm not sure if I'd call them a VLIW instruction set, but the similarities are uncanny.

What if all VLIW needed with a pseudo-assembly language layer, that was recompiled every time a new GPU came out? What if this layer was PTX (or something like it) ?

This NVidia PTX technology, or something like it, is probably what's needed to make VLIW reasonable to use in practice. I'm not quite calling NVidia's GPUs a VLIW architecture... but they seem to have solved many problems associated with VLIW.
 
Last edited:
Seem familiar? "Bundles" of instructions, 3-instructions + 1 control block (Pascal).
Bundling instructions isn't VLIW, in fact a smart x86 or ARM compiler will also try to ensure certain placement of instructions to make the job of the decoder easier (many x86 decoders for example could only decode 1 complex instruction per cycle, so it made sense to group instructions into groups of 1 complex and 3 simple where possible). Sending control information is also not necessarily VLIW, it is just avoided in CPU ISAs because it generally ties your program to a specific hardware configuration.

The biggest aspect of VLIW is that you tie the organisation of instructions in the word to the configuration of your execution units within the core, this enables you to cull off a lot of the complexity related to routing instructions to different parts of the core and queueing instructions which are stalled by dependencies.

This cost factor is why they are so attractive for embedded devices, the code and hardware are locked in when something goes into production, and any code updates can just be compiled for that hardware. And since you don't care if people can or cannot run other programs on your embedded device, you don't care about whether other code needs to be recompiled to run on it. So in the end the lack of flexibility is not a problem, and you gain a big (design) cost and power advantage because of reducing the complexity of the processor.


What if all VLIW needed with a pseudo-assembly language layer, that was recompiled every time a new GPU came out?
This is already kinda how GPU drivers work, they transform your rendering code into stuff that works on your GPU architecture, but it's not something that enables VLIW per se.

Unfortunately for something intended to be general purpose like a CPU, you can't expect this to survive in the market, it has been attempted before but it just adds more effort and everyone ends up going down the path of least resistance with general purpose stuff.
 
surprisedpikachu when the only competitor in the market kills off a design that is inherently badly suited for that market lol
Not so much that it was bad for that market, but that it was terrible for all existing code without a recompile. And we know how much existing x86 code there is, and how likely you are to get a "free" recompile.
 
I believe ATI/AMDs TeraScale architecture was VLIW that would be 2000-6000 series cards.
Correct.

BTW, Transmeta tried to make "regular" VLIW CPUs as well (for the mobile market) under the names Crusoe and Efficeon. They did however "translate" from x86 code running some form of virtualization on the things, If I understood correctly.
 
We can address the point that VLIW is bad at compute as Fermi absolutely oblitorated VLIW in compute whilst the at the time 5000 and 6000 series on paper look better spec wise.
AMD got their act together with GCN but they did not put the effort that Nvidia did into tesselation.
That is a joke, right? Quite since HD5000 were the first graphics card series with built in D3D11 tessellation, it couldn't have been any other way...
 
That is a joke, right? Quite since HD5000 were the first graphics card series with built in D3D11 tessellation, it couldn't have been any other way...
No it's really not a joke at all.

 
This is already kinda how GPU drivers work, they transform your rendering code into stuff that works on your GPU architecture, but it's not something that enables VLIW per se.

Ish. The thing about PTX is that its a portable-assembly language for NVidia's GPUs.

As you can see in the document: the raw assembly language (aka: SASS) for NVidia Kepler, Pascal, and Volta all differed dramatically. The reason why that fascinates me so, is that the GPU-world has built up what appears to be an impossibility: a high-performance, architecture agnostic, recompiled set of code to... calculate the color of Gordon Freeman's eyebrows from multiple angles in realtime. (Gordon Freeman is a character from the video game "Half Life").

But ignoring the silly video game graphics for a sec... this seems to have implications on how we could hypothetically build a general purpose CPU and/or computer model that achieves portable high-performance compute. And maybe, that future solution would leverage VLIW.
 
No it's really not a joke at all.

You know it was proprietary junk right? I don't want to lose precious time, but tessellation realigns textures and allows for "free antialiasing"(I cannot search for the sphere out of chubby triangles photo found in AMD's launch presentation right now) with regard to how textures are sampled. That is it. It is not used to generate textures which if you listen to yourself would notice how stupid that is.
 
You know it was proprietary junk right? I don't want to lose precious time, but tessellation realigns textures and allows for "free antialiasing" with regard to how textures are sampled. That is it. It is not used to generate textures which if you listen to yourself would notice how stupid that is.
Tesselation is not properietary at all it was and has always been a feature within DX11 and onwards, ATi at the time just did not put that much effort into it like Nvidia did, but Nvidia released a card that was far more inefficient for gaming until the real compute work happens then Fermi would flog the VLIW 5000 series.
 
Ish. The thing about PTX is that its a portable-assembly language for NVidia's GPUs.

As you can see in the document: the raw assembly language (aka: SASS) for NVidia Kepler, Pascal, and Volta all differed dramatically. The reason why that fascinates me so, is that the GPU-world has built up what appears to be an impossibility: a high-performance, architecture agnostic, recompiled set of code to... calculate the color of Gordon Freeman's eyebrows from multiple angles in realtime. (Gordon Freeman is a character from the video game "Half Life").

But ignoring the silly video game graphics for a sec... this seems to have implications on how we could hypothetically build a general purpose CPU and/or computer model that achieves portable high-performance compute. And maybe, that future solution would leverage VLIW.
I honestly think it is a grave mistake to consider existing GPU approaches architecture agnostic... Especially when it comes to graphics.

Also, the paper you linked above is exactly proof of that, it's written to analytically discover the latencies and other characteristics of the architecture to optimise performance.
 
ATi at the time just did not put that much effort into it like Nvidia did, but Nvidia released a card that was far more inefficient for gaming until the real compute work happens then Fermi would flog the VLIW 5000 series.
That is lame. If Nvidia couldn't be the first, nor do it better in efficiency how is it of any significance that their late comer(which lemme remind you failed for 2 iterations!480&560) can claim to meet the features of 'the' first D3D11 tessellator card...
 
Correct.

BTW, Transmeta tried to make "regular" VLIW CPUs as well (for the mobile market) under the names Crusoe and Efficeon. They did however "translate" from x86 code running some form of virtualization on the things, If I understood correctly.
Nvidia also attempted a VILV CPU with the parker chip, turns out while it was good in benchmarks the moment it met spaghetti code of any kind it ground to a halt.
 
That is lame. If Nvidia couldn't be the first, nor do it better in efficiency how is it of any significance that their late comer(which lemme remind you failed for 2 iterations!480&560) can claim to meet the features of 'the' first D3D11 tessellator card...
ATi and AMD did not put the effort in, it did not magically stop with the Fermi series, Radeon cards lagged behind in tesselation performance up until roughly Polaris where they worked on it closer and were attempting to basically making 390X with great efficiency and improvements to tesselation and rendering.

1628633903868.png


The GTX 580 which was a refresh of the 480 with the full die unlocked was up the 7970's ass and this was AMD's second generation DX11 architecture (GCN)
The 680 whilst missing it's compute chops still beat out the compute orientated 7970 whilst Nvidia were aimning for efficiency earlier on than AMD.


This gap would mostly remain the same up until they moved away from GCN and onto Navi.

 
I honestly think it is a grave mistake to consider existing GPU approaches architecture agnostic... Especially when it comes to graphics.

Also, the paper you linked above is exactly proof of that, it's written to analytically discover the latencies and other characteristics of the architecture to optimise performance.

I'm not saying you reach optimal levels of performance from high level code.

But what I'm saying is: OpenCL / DirectX shaders / CUDA have demonstrated the advances in compiler technology over the last 20 years. Its now possible to achieve higher levels of performance than ever before, even across diverse sets of hardware (AMD Terascale, AMD GCN, AMD RDNA, NVidia Kepler, Pascal, and Volta).

The microarchitectural details always matter. But this advancement cannot be denied... especially as we look at the fundamental SASS assembly language of Kepler and see the details of dependencies / write barriers / etc. etc. being added by the PTX compiler. These GPUs are exposing details of its pipeline to the ISA like never before, and yet GPU programmers are able to keep some semblance of portability when writing DirectX shaders.
 
ATi and AMD did not put the effort in, it did not magically stop with the Fermi series, Radeon cards lagged behind in tesselation performance
How can you be behind when it was featured on AMD(ATi) series even before Directx 10? Something doesn't add up in your story, maybe somebody is sweeping something under the rug...
 
How can you be behind when it was featured on AMD(ATi) series even before Directx 10? Something doesn't add up in your story, maybe somebody is sweeping something under the rug...
You can be first to anything and smug enough to think you already have what it takes to compete with an aging architecture.
Then give ease of way for a competitor to leave you in the dust due to complacency and arrogance.
Intel and AMD right now, AMD and Nvidia back then.
 
Back
Top