Editorial x86 Lacks Innovation, Arm is Catching up. Enough to Replace the Giant?

dragontamer5788 · Jul 27, 2020

Vya Domus said:
you can't change the hardware otherwise you need to recompile or reinterpret in some way the instructions at the silicon level which more or less negates the advantage of not having to add complex scheduling logic on chip.

Opcode 0x1 was "ReadLane" in GCN1.0/1.1.

But opcode 0x1 is "Floating-point 32-bit Add" in GCN 1.2 (aka: Polaris / 400-series / 500-series.)

This sort of change requires a recompile. A summary of the opcode changes between GCN 1.0 and 1.2 can be found here: https://clrx.nativeboinc.org/wiki2/wiki/wiki/GcnInstrsVop2

This dramatic change in opcodes requires a recompile. Surely the reason you suggest is mistaken. Both AMD and NVidia regularly change the hardware assembly language associated with their GPUs regularly, without much announcement either. Indeed, that's why NVidia has PTX, so that they can have an assembly-like language that retargets many different machines (be it Pascal, Volta, or Turing).

DrJ · Aug 1, 2020

I wrote a very long and complex answer, but found in order to make it coherent it had too much stuff along the lines of the article, so it's getting saved for another day, as the OP mighty feel "hey I said that" too many times. Here's a couple of short bits.

I think making a super-fast ARM will be an issue as it's a complicated business once you get to the very high performance area. However software issues are as much of a pain (i.e. making best use of the hardware someone has). Apple could be helped here by having less combinations.

The x86 instruction set is best thought of as a compressed RISC instruction set, so you get better use out of memory, memory bandwidth and caches. That's a plus. (Although ARM seem to sometimes add new instruction sets on Tuesdays... well -ish.)

ARM is a teeny company, it would be "interesting" if someone unexpected bought them out of petty cash and changed the business model completely (it's not like they make so much money it would be a big dent). BTW I considered using a PA-Risc CPU back in the day, remember them... unfortunate for some... (Apple bought them and killed the CPU line.) I believe Apple has an "Architecture" ARM license, which if open-ended (date-wise) would certainly help with any ARM issues.

It's been a long time since Intel last did anything extremely clever (Pentium 4 BTW). Can they make a big jump again? (Plus this time not be screwed by the process technology not hitting the promised 7-8GHz and taking too much power. Although strictly some parts were double clocked so going that fast.)

Final thought - how much better would it be not to spend a ton of effort on the CPU's internal architecture but to speed everything that's an actual roadblock up, especially the off-chip stuff? ( E.g. more memory and I/O interfaces.) Is having a ton of chip-PCB contacts and a few extra PCB layers that much of an issue at the high end of PCs these days? (Me and Seymour, separated at birth...)

*** Oh, my posts got combined, this was supposed to be a completely separate answer to one thing, sorry...

medi01 said:
I did.
It would apply if RISC CPUs were faster, but more expensive. They used to be faster. At some point they have become slower.
I am not buying "but that's because of R&D money" argument.

As for having savings in the server market, by selling desktop chips: heck, just have a look at AMD. The market is so huge, you can have decent R&D while having only tiny fraction of the market.

The whole "RISC beats CISC" was largely based on CISC being much harder to scale up by implementing multiple ops ahead, at once, since instruction set was so rich. But hey, as transistor counts went up, suddenly it was doable, on the other hand, RISCs could not go much further ahead in the execution queue, and, flop, no RISCs.

And, curiously, no EPIC took off either.

Note I edited the second part of this answer as I wrote it super-quickly and combined some stuff. Hopefully it's more correct now. The stuff above this line hasn't been changed.

x86 used to be CISC. Following some analysis (this is the very short version) of compilers it was found they used very few of the instructions (less true today BTW). So some groups tried to make CPUs just executing those instructions, but doing it very quickly (and often very messily, but that's another story). These out-performed the contemporary CISC CPUs (e.g. 386, 68030). This led to WindowsNT supporting several of them (so you could run full/standard WinNT on a DEC Alpha, for example).

Intel's (really excellent BTW) solution was the 486. This was a RISC-ish x86 (edit IMHO). It worked by reading the CISC instructions from memory and (edit) executing the simple ones in a single clock (I cocked this up in the first pass as I wrote this reply way too quickly and combined two generations, apologies). This boosted x86 processing speed up to the same territory as the RISC chips, who declined after that. Also WinNT became x86 only.

Aside - I was looking at designing a board using the Fairchild Clipper RISC chip. (IMHO the only one with a complete and sophisticated architecture - the patents made a lot of money for many years after the chips stopped production, as everyone used the technology.) This beat-up the previous Intel 386 chip very well, but the 486 came along and was a problem for it, so the project died (probably a good idea, O/S licencing for RISC was a nightmare back then). (The Clipper also suffered from Fairchild's process technology, with the caches in separate chips.)

Anyway all x86 for a long time have RISC cores (edit, but the next big change was somewhat later converting all instructions to micro-ops) and basically use the x86 instruction set as a compressed instruction set, so you can store more work in less bytes of memory, requiring less memory fetches and less cache space to store the same amount of functionality. The RISC chips would need more space all through the system due to using larger instruction (BTW ARM's later Thumb set of alternate instructions was intended to shrink the instruction size). As processor speed gets so far ahead of external memory speed (where is can take vast numbers of clocks to do a memory fetch) this is even more important. Of course the catch is you need some very clever instruction decoders for x86, but that has had vast amounts of work optimising it.

The original ARM CPU was mostly interesting as, following on from path of the Mostek 6502, Acorn designed a simple instruction set (with some annoyances) and used a minimal number of transistors in the design, when most others were using lots more. This kept the chip price well down, plus also allowed a low enough complexity for them to actually design a working CPU. (The 6502 was probably the biggest early home computing CPU, it was an 8-bit microprocessor designed with a lot less transistors than its competition, so was noticeably cheaper back when CPUs were really expensive - Acorn, the A in ARM, used the 6502 in their BBC and Electron computers.)

The big problem with the ARM CPUs was, IMHO, the instruction set wasn't great, so they've been adding assorted stuff ever since (e.g. Thumb, basically a complete different set of instructions).

The brilliant thing about ARM (over time) is they licensed it cheaply and at levels down to the gates, the synthesis files or the architecture (no-one else would give you that sort of stuff - well, Intel let AMD make exact x86 copies for a while back when the market was a lot smaller). People loved it. The down-side is they weren't making billions selling chips so are a much much smaller company, even now, than most people realise. (It is a bit of a risk that someone awkward could buy them.)

lexluthermiester · Aug 1, 2020

DrJ said:
Anyway all x86 since the 486 have RISC cores and basically use the x86 instruction set as a compressed instruction set

Citation?

londiste · Aug 1, 2020

I seem to remember this came later - in Pentium or Pentium Pro - but this is effectively true. x86 ISA is CISC but the implementation in microarchitecture is really RISC. Obviously there is a big subset of x86 instructions that are simple enough to be executed directly. But for the ones that are not that jump from one to another is mainly in instruction decoder (and maybe with help from scheduler) that separates a complex instruction into several simple micro-operations. These micro-operations that are simpler are then the actual parts that are executed.

CPUs themselves are not really RISC because the ISA you are using is x86 which is classically CISC. However, the stuff happening in execution units is not (always) x86 instructions but a different set of operations that is much closer to RISC. Much closer because there are clearly some operations in hardware that would not fit the classical RISC definitions for being too complex.

A hybrid, much like everything these days. At the same time, ARM has been extended considerably over the years and would likely not completely fit in the classical narrow RISC definition. For example think about any extended instruction sets like VFP or Neon.

Edit:
So, why keep x86? All the existing software is definitely one thing. x86 itself is another - it is a stable and mature ISA. You could think of x86 CPUs today as a sort of virtual machines - they take in x86 instructions and execute them however they want while outputting stuff you'd expect from x86. This is probably not feasible in the exact described way by interpreting or translating instructions (Transmeta comes to mind from while ago and ARM x86 layers from recent times) because of the huge performance hit but when that change is embedded in the microarchitecture itself, if there even is any speed penalty it does not outweigh the mature ISA and ready-made software.

By the way, Nvidia is doing something similar in GPU space. All the close-to-metal stuff like drivers use PTX ISA that purportedly is the one GPUs execute. That... just is not the case. PTX is a middle layer between hardware and anything else and PTX is translated into whatever GPU actually does. I bet that is exactly what is behind their relatively stable drivers as well.

TheoneandonlyMrK · Aug 1, 2020

lexluthermiester said:
Citation?

What did you asume a micro Op was?.

A good video to watch shedding some light from the architect point would be Lex Fridman interviewing the man Jim Keller.
Keller is clear of speech ,direct and intellectually stimulating and has an engineer's way with words, he makes it quite clear what the realities of it are.

dragontamer5788 · Aug 1, 2020

theoneandonlymrk said:
What did you asume a micro Op was?.

x86 has a micro-op called "AES Encode" and another for "SHA256".

"Microcode" is far more complex than RISC or CISC. I dare say the RISC vs CISC issue is dead, neither ARM, x86, nor RISC-V (ironically) follow RISC or CISC concepts anymore. ARM has highly specialized instructions like AES-encode. Heck, all architectures do. All instruction sets have single-uops that cover very complicated SIMD concepts.

From x86's perspective: the only thing that "microcode" does these days is convert x86's register-memory architecture into a load/store architecture. PEXT / PDEP on Intel are single microops (executing in one clock tick). I guess division and vgather remain as a sequence of microcode... but the vast majority of x86's instruction set is implemented in one uop.

Case in point: the entire set of x86 "LEA" instructions is implemented as a singular micro-op (I dare you to tell me that LEA is RISC). Furthermore, some of x86's instruction pairs are converted into one uop. (Ex: cmp / jz pairs are converted into one uop). Mind you, ARM follows in x86's footsteps here. (AESE / AESMC are fused together in most ARM cores today: two instructions become one uop for faster performance)

----------

Load/Store is the winner, which is a very small piece of the RISC vs CISC debate. The CISC concept of creating new instructions whenever you want to speed up applications (PDEP, PEXT on x86. ARM's AESE, AESD. SIMD-instructions. AVX512. Etc. etc.) is very much alive today. Modern cores have take the best bits of RISC (which as far as I can tell, is just the load/store architecture), as well as many concepts from CISC.

Case in point: FJCVTZS leads to far faster Javascript code on ARM systems. (Floating point Javascript convert to Signed Fixed Point). Yeah, an instruction invented to literally make Javascript code faster. And lets not forget about ARM/Jazelle either (even though no one uses Jazelle anymore, it was highly popular in the 00s on ARM devices. Specific instructions that implement Java's bytecode machine, speeding up Java code on ARM)

DrJ · Aug 1, 2020

Note edited the first para for clarity.

This is weird, as I didn't think the 486 being a RISC core executing decoded CISC instructions (edit - note this only applies to the simple instructions, not the whole instruction set) was news to anyone...
The performance step from 386 to 486 was very large due to this, 2 clock instructions down to 1.

Pentium Pro was the next big step as it was super-scalar (well, to a modest degree, plus that means whoever owned the Clipper patents then made a few more bob) and allowed out-of-order instruction execution, with in-order retirement. Also registers ceased to be physical. In the P54C (its predecessor, ran at 100MHz) the EAX register was a particular bunch of flip-flops. In the P6 it could be any one of a pool of registers at a particular time and somewhere else a fraction of a second later.

I was amazed they could make that level of complexity work. Although at only 60-66MHz. Also quite pleased that my P54C board was usually faster than my colleagues P6 board. (Partly as you needed the compilers optimised for the P6, which they weren't.)

bug · Aug 1, 2020

DrJ said:
This is weird, as I didn't think the 486 being a RISC core executing decoded CISC instructions was news to anyone... I have some rev A0 ones knocking around somewhere, maybe I should get paperweights made up saying "Yes, I'm RISC"... (A0 was the first silicon, will boot DOS provided you aren't fussed about having all the columns in a directory listing correct, plus has some notable heating issues so need a lot of air blown their way...)
The performance step from 386 to 486 was very large due to this, 3 clock instructions down to 1.

Pentium Pro was the next big step as it was super-scalar (well, to a modest degree, plus that means whoever owned the Clipper patents then made a few more bob) and allowed out-of-order instruction execution, with in-order retirement. Also registers ceased to be physical. In the P54C (its predecessor, ran at 100MHz) the EAX register was a particular bunch of flip-flops. In the P6 it could be any one of a pool of registers at a particular time and somewhere else a fraction of a second later.

I was amazed they could make that level of complexity work. Although at only 60-66MHz. Also quite pleased that my P54C board was usually faster than my colleagues P6 board. (Partly as you needed the compilers optimised for the P6, which they weren't.)

Oh and hey, this is a P54C board I did back in the day...

ISA SINGLE BOARD COMPUTER SBC EMBEDDED SYSTEM PC MICROBUS MAT-818 fd3f2 | eBay

Find many great new & used options and get the best deals for ISA SINGLE BOARD COMPUTER SBC EMBEDDED SYSTEM PC MICROBUS MAT-818 fd3f2 at the best online prices at eBay! Free delivery for many products.

www.ebay.co.uk

486? Wasn't Pentium the first that broke down instructions to feed micro-ops to the pipeline?

DrJ · Aug 1, 2020

bug said:
486? Wasn't Pentium the first that broke down instructions to feed micro-ops to the pipeline?

This was a scary long time ago and I'm under caffeinated. The 486 did execute many simple x86 instructions in a single clock, so RISC-speeds and that bit of the processor being like a RISC CPU for those instructions. But I don't recall it broke down the complicated ones in the way the later CPUs did, just executed them differently, so sorry for that bit (it's too hot). So take your pick. I see it as containing a RISC CPU as it ran the simple instructions in a clock. It didn't (as far as I recall) convert all instructions to micro-ops and that was me shrinking the history and combining what should have been two bits of the answer, as I wrote it super quickly, so many apologies.

dragontamer5788 · Aug 1, 2020

londiste said:
By the way, Nvidia is doing something similar in GPU space. All the close-to-metal stuff like drivers use PTX ISA that purportedly is the one GPUs execute. That... just is not the case. PTX is a middle layer between hardware and anything else and PTX is translated into whatever GPU actually does. I bet that is exactly what is behind their relatively stable drivers as well.

PTX is closer to Java Bytecode. PTX is recompiled into a lower-level assembly language by the drivers.

https://arxiv.org/pdf/1804.06826.pdf

This article gives you an idea of what Volta's actual assembly language is like. Its pretty different from Pascal. Since NVidia changes the assembly language very few generations, its better to have compilers target PTX, and then have PTX recompile to the specific assembly language of a GPU.

ARF · Aug 1, 2020

yeeeeman said:
What could they say.
Arm always wanted to get into High Performance computing, whereas x86 manufacturers always wanted to get into ultra low power devices. They never quite made it, because they develop optimal tools for completely different scenarios.

x86 lacks innovation towards low power consumption. That is a fundamental flaw in CISC and you know it.
It has never been addressed. Actually, quite the opposite - Intel pursues ultra power with the hungry 10900KS, 9900KS, etc, with 200-watt and higher power consumption.

Today, you can use a smartphone with negligible power consumption and running on its battery for days, instead of a power-hungry office PC.

lexluthermiester · Aug 1, 2020

DrJ said:
This is weird, as I didn't think the 486 being a RISC core executing decoded CISC instructions (edit - note this only applies to the simple instructions, not the whole instruction set) was news to anyone...

That is because you are mistaken. RISC integration did not being until the Pentium generation and then was very limited in implementation(still is).

DrJ · Aug 1, 2020

lexluthermiester said:
That is because you are mistaken. RISC integration did not being until the Pentium generation and then was very limited in implementation(still is).

The 486 included a single-cycle execution unit for the simpler x86 instructions. I see that as a RISC part of the CPU, perhaps you don't... It certainly gave it the performance to kill off the WinNT RISC Desktop computers - Alpha, MIPS, PowerPC, Itanium and ARM; of which Alpha was the only platform to have a half-decent lead on the 486, but subsequently (some time post WInNT support being removed) got killed by the architects fearing Itanium, which didn't do any of the stuff they feared... ho hum...

bug · Aug 2, 2020

DrJ said:
The 486 included a single-cycle execution unit for the simpler x86 instructions. I see that as a RISC part of the CPU, perhaps you don't... It certainly gave it the performance to kill off the WinNT RISC Desktop computers - Alpha, MIPS, PowerPC, Itanium and ARM; of which Alpha was the only platform to have a half-decent lead on the 486, but subsequently (some time post WInNT support being removed) got killed by the architects fearing Itanium, which didn't do any of the stuff they feared... ho hum...

Maybe that wasn't a RISC part of the CPU, but rather just CISC instructions that happened to only need one execution cycle?
The similarity is certainly there, but the birth of CISC emulating RISC is Pentium and the moment is started breaking down CISC instructions. At least that's how I learned it back in the day and I think that's how it went down in history.

Anyway, I think we all got the same picture by now, we're arguing mostly about terms now

lexluthermiester · Aug 2, 2020

DrJ said:
I see that as a RISC part of the CPU, perhaps you don't...

That's because it's not RISC. Making an execution unit more efficient is NOT the same as making it RISC. Your understanding of the RISC architechure and the differences between RISC vs CISC instruction sets needs further study.

DrJ · Aug 2, 2020

I think we've been back and forth enough for people to get bored. Although I wasn't suggesting it was RISC, just it had a RISC part, where simple instructions were executed very quickly.

I am paying a big price for writing an answer to someone at super speed after a more considered post. But I have my view. (Deleted stuff about my history in computing that was to have followed, as I try to stay away from that sort of stuff. I'd rather make my case than say "but I...". Plus however much experience you have that doesn't stop you saying something iffy, or taking a view on something that is heavily debated rather than a known answer.)

mtcn77 · Aug 2, 2020

You know, this could amount to something since Nvidia dropped all immediate memory targetting. Their instruction targets are similarly local.
Nvidia could have an arm based gpu, or vice versa anytime they wished.

mtcn77 · Mar 29, 2021

Time to necro-post this topic, light up the signs!

Processor	R5 5600X
Motherboard	ASUS ROG STRIX B550-I GAMING
Cooling	Alpenföhn Black Ridge
Memory	2*16GB DDR4-2666 VLP @3800
Video Card(s)	EVGA Geforce RTX 3080 XC3
Storage	1TB Samsung 970 Pro, 2TB Intel 660p
Display(s)	ASUS PG279Q, Eizo EV2736W
Case	Dan Cases A4-SFX
Power Supply	Corsair SF600
Mouse	Corsair Ironclaw Wireless RGB
Keyboard	Corsair K60
VR HMD	HTC Vive

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s)	Powercolour RX7900XT Reference/Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	8726 vega 3dmark timespy/ laptop Timespy 6506

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Editorial x86 Lacks Innovation, Arm is Catching up. Enough to Replace the Giant?

dragontamer5788

DrJ

New Member

lexluthermiester

londiste

TheoneandonlyMrK

dragontamer5788

DrJ

New Member

bug

ISA SINGLE BOARD COMPUTER SBC EMBEDDED SYSTEM PC MICROBUS MAT-818 fd3f2 | eBay

DrJ

New Member

dragontamer5788

ARF

lexluthermiester

DrJ

New Member

bug

lexluthermiester

DrJ

New Member

mtcn77

mtcn77