Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock

sergionography · Apr 28, 2012

Aquinus said:
Except you can't process a regular application through a pipeline like a GPU has because GPU data is all the same where a computer program has multiple different instructions per clock cycle. A GPU is given a large set of data and told to do a single task to all of it, so it does it the same way. A CPU is instruction after instruction, there isn't a whole lot that represents what the GPU can do.

A shader is small because it has a limited number of instructions it can perform and has no control mechanism, no write back. There is no concept of threads in a GPU, it is an array of one or more sets of data that will have the same operation performed on the entire set. A shader is also SIMD, not MIMD as you're describing.

Where a CPU can carry out instructions like "move 10 bytes from memory location A to memory location B," A GPU does something more like "multiply every item in the array by 1.43."

If it is so simple, why hasn't anyone else figured it out, I'm sill convinced that you don't quite know what you're talking about.

I do have a bachelors degree in computer science not to mention I'm employed as a systems admin and a developer.

true, unless u can get the cpu to sort things out and let the gpu do what its best. meaning the cpu can fetch and decode, then execution will be determined whether more efficient on the cpu or gpu
i think thats the approach amd is taking with apu's in the future(HSA)

TheoneandonlyMrK · Apr 28, 2012

Aquinus said:
Originally Posted by Steevo
A single thread on a CPU might run the four, but if we have a hardware scheduler that reads ahead and prefetches data "branching" and then performs the decode at twice the rate, programs shaders to do the work, and then they execute it and store it in the contiguous memory pool what difference does it make if the CPU transistors do it, or if the same instruction is run 5,000 times in the program, the GPU transistors do it.

Except you can't process a regular application through a pipeline like a GPU has because GPU data is all the same where a computer program has multiple different instructions per clock cycle. A GPU is given a large set of data and told to do a single task to all of it, so it does it the same way. A CPU is instruction after instruction, there isn't a whole lot that represents what the GPU can do.

A shader is small because it has a limited number of instructions it can perform and has no control mechanism, no write back. There is no concept of threads in a GPU, it is an array of one or more sets of data that will have the same operation performed on the entire set. A shader is also SIMD, not MIMD as you're describing.

Where a CPU can carry out instructions like "move 10 bytes from memory location A to memory location B," A GPU does something more like "multiply every item in the array by 1.43."

modern Gpu's are Mimd, 7970 and fermi /kepler and the 7970 and 680 have wright back support and hardware , as has been said ,the works being put in to make GPU's more usefull computationally, the hard bit, the thing they are doing well at is maintaining graphics performance

A stacked APU with four Steaming modules of two(8) plus 8gig

pimp:dreamin) of on die memory and dual gcn GPU units wouls be a truely killer App

Steevo · Apr 28, 2012

sergionography said:
true, unless u can get the cpu to sort things out and let the gpu do what its best. meaning the cpu can fetch and decode, then execution will be determined whether more efficient on the cpu or gpu
i think thats the approach amd is taking with apu's in the future(HSA)

That is exactly what I was talking about, fetch and decode and a hardware scheduler to push it to the GPU cores, or CPU cores based on type of work and load. Even if the GPU is twice as slow at one instruction if the CPU cores are busy and its not....

And if the GPU and CPU can read and write to shared cache so the GPU could execute instruction A, C, F, H, I, J, and M, while the CPU runs B which is dependent on A, D & E which are then stored for F to run two iterations of, the checked results are then stored back for the CPU to run G....so on and so forth.

sergionography · Apr 30, 2012

Steevo said:
That is exactly what I was talking about, fetch and decode and a hardware scheduler to push it to the GPU cores, or CPU cores based on type of work and load. Even if the GPU is twice as slow at one instruction if the CPU cores are busy and its not....

And if the GPU and CPU can read and write to shared cache so the GPU could execute instruction A, C, F, H, I, J, and M, while the CPU runs B which is dependent on A, D & E which are then stored for F to run two iterations of, the checked results are then stored back for the CPU to run G....so on and so forth.

yeah and you are totaly right, even intel is sorta taking this approach as far as i know in haswell. however instead of doing it through gpu cores they are working on some sort of scheduler to make use of all the cores on the chip using hardware instead of software being optimized for multithread
so in essence it seems like both amd and intel are taking a new approach in fetching/decoding to let the hardware decide the execution part to be as efficient as possible, that is by far what the future of computing, to try to take advantage of every piece of silicon on the die, cuz even untill now i cant think of one software that will use even a core2quad or a phenom x4 at 100% on all 4 cores, and making cores fatter and fatter also can only get so far, even intel seems to be slowing down on that approach now that they realize fabrication process shrinking is getting to its limits, and is only getting more difficult

Aquinus · Apr 30, 2012

Steevo said:
That is exactly what I was talking about, fetch and decode and a hardware scheduler to push it to the GPU cores, or CPU cores based on type of work and load. Even if the GPU is twice as slow at one instruction if the CPU cores are busy and its not....

And if the GPU and CPU can read and write to shared cache so the GPU could execute instruction A, C, F, H, I, J, and M, while the CPU runs B which is dependent on A, D & E which are then stored for F to run two iterations of, the checked results are then stored back for the CPU to run G....so on and so forth.

Take a computer architecture course and you will understand why this is not feasible. Not just with the hardware but x86 as well. First of all your logic fails when you consider the application. Your GPU does not execute x86 instructions it is informed what to do, it does something and gives it back. The video card doesn't just do something, it does the same thing to everything in the buffer provided to it.

What if Instruction B depends on data from instruction A? or F depends on data from instruction E? Between the PCI-E latency, time it takes for the GPU to execute the instruction, store it, then send it back over PCI-E you just hammered your performance by 10 fold. GPUs aren't designed for procedural code. They're designed to process large amounts of data in a similar fashion and I think you're confusing what a GPU can actually do. GPUs process many kilobytes to many megabytes of data per every instruction, not just two operands.

Learn what you're talking about before you start saying that something can be done when people who do this for a living and have had 8+ years of schooling to do this stuff. Honestly, what you're describing isn't feasible and I think I pointed this out before.

sergionography · Apr 30, 2012

Aquinus said:
Take a computer architecture course and you will understand why this is not feasible. Not just with the hardware but x86 as well. First of all your logic fails when you consider the application. Your GPU does not execute x86 instructions it is informed what to do, it does something and gives it back. The video card doesn't just do something, it does the same thing to everything in the buffer provided to it.

What if Instruction B depends on data from instruction A? or F depends on data from instruction E? Between the PCI-E latency, time it takes for the GPU to execute the instruction, store it, then send it back over PCI-E you just hammered your performance by 10 fold. GPUs aren't designed for procedural code. They're designed to process large amounts of data in a similar fashion and I think you're confusing what a GPU can actually do. GPUs process many kilobytes to many megabytes of data per every instruction, not just two operands.

Learn what you're talking about before you start saying that something can be done when people who do this for a living and have had 8+ years of schooling to do this stuff. Honestly, what you're describing isn't feasible and I think I pointed this out before.

gpus being unable to process x86 is years ago, when gpus only did graphical tasks, with todays architectures gpus are very much capable of gpgpu(general purpose computing), amd designed GCN with compute in mind, so did nvidia with fermi
however you have a point and these are pretty much the challenges that both intel and amd go through, but it is possible though
also you need to remember that we are not talking about using a gpu and forgetting the cpu, we are talking about both working together, meaning certain kinds of instructions like the one u stated would most likely be crunched on the cpu, while things like floating point operations would be done by the gpu
also note that there will be no more pci express linking the gpu and cpu when both are integrated together
it seems like you are talking about having a discrete gpu and a cpu doing the method that steevo mentioned but that is NOT the case, we are talking about gpu/cpu integrated in the architecture level

Aquinus · Apr 30, 2012

sergionography said:
gpus being unable to process x86 is years ago, when gpus only did graphical tasks, with todays architectures gpus are very much capable of gpgpu(general purpose computing), amd designed GCN with compute in mind, so did nvidia with fermi
however you have a point and these are pretty much the challenges that both intel and amd go through, but it is possible though
also you need to remember that we are not talking about using a gpu and forgetting the cpu, we are talking about both working together, meaning certain kinds of instructions like the one u stated would most likely be crunched on the cpu, while things like floating point operations would be done by the gpu
also note that there will be no more pci express linking the gpu and cpu when both are integrated together
it seems like you are talking about having a discrete gpu and a cpu doing the method that steevo mentioned but that is NOT the case, we are talking about gpu/cpu integrated in the architecture level

Okay, so the iGPU is linked to the IMC/North Bridge in a Llano chip. The issue is the CPU is still only seeing the iGPU as a GPU. The CPU has no instructions for telling the GPU what to do, everything still goes through the drivers at the software level.

I'm not saying that this is the way things are moving, but with how GPUs and CPUs exist now, there isn't enough coherency between the two architectures to really be able to do as you describe. The CPU still has to dispatch GPU instructions at the software level because there are no instructions that says, use the GPU cores to calculate blar.

Also keep in mind that for a single-threaded, non-optimized application, you have to wait for the last operation to complete before being able to execute the next one. Now this isn't true for modern super-scalar processors, however if the next instruction requires the output of the last, then you have to wait. You can't just execute all the instructions at once. It's not a matter of can it be done, it's a matter of how practical would it be. Right now with how GPUs are designed, providing large sets of data to be calculated at once is the way to go.

What you're describing is some kind of computational "core" that has the ability to do GPU-like calculations on a CPU core and the closest thing to what you're describing is Bulldozer (and most likely its sucessors which will be more like what you're describing) where a Bulldozer "module" contains a number of components for executing threads concurrently where a CPU can have an instruction that computed a Fast Fourier Transform on a matrix (array of data,) where the CPU would have control over so many ALUs to do the task at once where on the other hand, if multiple different ALU operations are being performed at once, only that number of resources are used.

This is what AMD would like to do and BD was step 1.

The issues are:
No software supports it.
No compiler supports it.
No OS supports it.

Could it be fast: Absolutely.
Could is be feasible in the near future: I don't think so.

I think there will be a middle ground in between a CPU and GPU core when all is said and done, we just haven't got there yet and I think it will be some time before we do. I'm just saying what you're describing between a unique CPU and a unique GPU core isn't feasible with how CPUs with an iGPU work and even GCN doesn't overcome these issues.

Steevo · Apr 30, 2012

CUDA/GCN is a prime example of what we CAN do, and yes, Windows DX11 supports both as run under the drivers, or under OpenCL by itself.

The diea to make a GPU with X86/X64 compute has been realized, now it is just the memory addressing between two independent processors (CPU & GPU) that makes it hard to do, but with both on one die and one hardware controller that runs controlling dispatches for both......

Aquinus · Apr 30, 2012

Steevo said:
one hardware controller that runs controlling dispatches for both......

Memory is becoming generalized, correct. What the instructions are, how they are dispatched, and how long they take to run is not. That is what what you're describing and it cannot be done with current iGPU solutions, at least not well enough to be worth it. The iGPU might be on the same die, but computationally speaking, there is nothing that couples CPU cores to the GPU. Also OpenCL requires drivers that support it, you can't just use OpenCL without telling the computer how to do it, which is what the drivers do. Basically (using Llano as an example,) the iGPU had the PCI-E/Memory bus replaced by the CPU's north bridge and that is it. Like I said before, GPUs don't work the same way as a CPU and they're not close enough to do what you're describing.

Also once again, as I said, if it really was as easy as your describe it, someone who has been working on this for years with doctoral degrees in either Comp Sci or EE would figure it out before you and since it practically doesn't exist, I'm very certain that you're describing is purely theoretically without any real-world background to back up such claims.

Steevo · Apr 30, 2012

What part of a "new hardware controller to perform those functions" do you NOT understand.

We have the technology, we have the capability, it will be hard, and require a few spins to get right but even a 25% IPC improvement is well worth it.

So Hardware that is in the die between the RAM and L3 and can read and write to either, and also allows DMA for both CPU and GPU, while maintaining the TLB, reading the branching and decoding instructions and issuing them to the proper channel, and or even to the next available core, (Think hardware level multithreading of single threaded apps).......

Sounds complex until you realize that half the issues were dealt with when we moved away from PIO and to DMA.

The primary issue of data could be overcome by the controller issuing thread A to core 0 and assigning thread B to GPU subunit 1

Operation X waits on the results of both, the next instruction Y is fetched and core 1 is assigned the two corresponding memory addresses that will hold the result of thread A and Thread B, programmed to perform a multiply of the two results and save to a new memory address.

This clock cycle is over and now core 1 is able to perform the work while the hardware dispatcher marks all the locations dirty involved in the previous operation, starts issuing the next set of commands including wipe the previous locations used.

It would require the use of a X64 environment optimally, resulting in a bit of overallocation of cache, unless we knew that we wouldn't need or use any registers that large.

Anyway, the point is, we could do it, we are moving to it, and it is going to happen.

Aquinus · Apr 30, 2012

Steevo said:
The primary issue of data could be overcome by the controller issuing thread A to core 0 and assigning thread B to GPU subunit 1

The GPU has no concept of threads or sequential code. The CPU dispatches these things and the GPU does it. I'm not saying it doesn't work, I'm saying it doesn't work the way you're describing.

sergionography · Apr 30, 2012

Aquinus said:
Memory is becoming generalized, correct. What the instructions are, how they are dispatched, and how long they take to run is not. That is what what you're describing and it cannot be done with current iGPU solutions, at least not well enough to be worth it. The iGPU might be on the same die, but computationally speaking, there is nothing that couples CPU cores to the GPU. Also OpenCL requires drivers that support it, you can't just use OpenCL without telling the computer how to do it, which is what the drivers do. Basically (using Llano as an example,) the iGPU had the PCI-E/Memory bus replaced by the CPU's north bridge and that is it. Like I said before, GPUs don't work the same way as a CPU and they're not close enough to do what you're describing.

Also once again, as I said, if it really was as easy as your describe it, someone who has been working on this for years with doctoral degrees in either Comp Sci or EE would figure it out before you and since it practically doesn't exist, I'm very certain that you're describing is purely theoretically without any real-world background to back up such claims.

yes but with HSA there won't be no more gpu and gpu, the line between gpu and gpu is blurring u till they become one, mushed together to pretty much build a gpu with gpu capabilities or vice versa, integer cores and graphics cores will be integrated in the architecture level and prolly sharing the same l3-l2 cache for all we know, I'm no expert in the bitty details but I've read about it a bit
Also we have to note that AMD is developing a compiler for this, that's what the research wd heard about some university research getting performance must by software optimization or something, that turned out to be research for making a new compiler

Aquinus · Apr 30, 2012

sergionography said:
yes but with HSA there won't be no more gpu and gpu, the line between gpu and gpu is blurring u till they become one, mushed together to pretty much build a gpu with gpu capabilities or vice versa, integer cores and graphics cores will be integrated in the architecture level and prolly sharing the same l3-l2 cache for all we know, I'm no expert in the bitty details but I've read about it a bit
Also we have to note that AMD is developing a compiler for this, that's what the research wd heard about some university research getting performance must by software optimization or something, that turned out to be research for making a new compiler

But we're not there yet.

You're seeing what happens to AMD when they try to change things, and you see what happens when Intel makes the same thing better. Also you will find that making a true HSA processor is no easy task. Hence why AMD started with Llano and Bulldozer.

sergionography · May 1, 2012

Aquinus said:
But we're not there yet. You're seeing what happens to AMD when they try to change things, and you see what happens when Intel makes the same thing better. Also you will find that making a true HSA processor is no easy task. Hence why AMD started with Llano and Bulldozer.

yes true, but it's good they are taking it a step at a time, and i love how AMD is usually the brave one making the major change even tho Intel got all the money. Notice x86 computing today is pretty much amds making, it was AMD who pushed Intel to take the approach they are taking right now with high ipc, also remember amd 64, it was really good so intel supported it, if HSA turns out good it can also be a game changer for both teams

xenocide · May 1, 2012

sergionography said:
yes true, but it's good they are taking it a step at a time, and i love how AMD is usually the brave one making the major change even tho Intel got all the money. Notice x86 computing today is pretty much amds making, it was AMD who pushed Intel to take the approach they are taking right now with high ipc, also remember amd 64, it was really good so intel supported it, if HSA turns out good it can also be a game changer for both teams

Intel was well on it's way to a hard switch to 64-bit computing, which would have meant around the time the AMD Athlon 64's came out, everyone would make the jump to 64-bit. It was awesome that AMD got x86-64 working for the sake of backwards compatability, but it's also the reason to this day we're still stuck with 32-bit programs when most of us have 64-bit computers.

AMD definitely was on the right path with High IPC, but with Bulldozer they decided to go all Netburst and start using lower IPC with substantially higher clocks. Worked well for Intel back then right?

sergionography · May 1, 2012

xenocide said:
Intel was well on it's way to a hard switch to 64-bit computing, which would have meant around the time the AMD Athlon 64's came out, everyone would make the jump to 64-bit. It was awesome that AMD got x86-64 working for the sake of backwards compatability, but it's also the reason to this day we're still stuck with 32-bit programs when most of us have 64-bit computers.

AMD definitely was on the right path with High IPC, but with Bulldozer they decided to go all Netburst and start using lower IPC with substantially higher clocks. Worked well for Intel back then right?

no amd didnt go "low IPC"
they just decided to not only focus on ipc but also get more efficient execution, in theory a bulldozer module has 25% more compute capabilities than a phenom II core, but due to improper tuning and high latencies it takes a good 30-40% hit which puts its per core ipc behind phenom II, but if things were to work out perfect they would be years ahead, and im sure thats what amd thot aswell when they tested on simulated machines
there is no point for amd to go against intel in the "fat cores" with high IPC battle because if amd does so it will always be behind due to intel having more money which allows it to stay ahead in fabrication process, amd will have to take a more efficient architecture approach to get anywhere and thats where cores with shared resources came from(bulldozer was supposed to share the hardware that isnt always used by the core, and share it between 2 cores, and while integer cores would be crunching data in their cycle, the shared resources would be feeding the second core ) , and i think bulldozer was a good move in theory, just had very bad execution, what piledriver is turning out to be was supposed 2 be bulldozer. also there is nothing fundamentally wrong with bulldozer, they can add more decoders and fpu units to it and increase the modules ipc and with fine tuning sharing will have no affect to performance wat so ever, but it wasnt designed that way because bulldozer was meant to be for 45nm so adding more hardware would only make the modules too big
now if bulldozer was to compete with nehelem it would be seen as a much better cpu, but SB was the problem
remember bulldozer was years late to market

System Name	RyzenGtEvo/ Asus strix scar II
Processor	Amd R5 5900X/ Intel 8750H
Motherboard	Crosshair hero8 impact/Asus
Cooling	360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory	Gskill Trident Z 3900cas18 32Gb in four sticks./16Gb/16GB
Video Card(s)	Asus tuf RX7900XT /Rtx 2060
Storage	Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s)	Samsung UAE28"850R 4k freesync.dell shiter
Case	Lianli 011 dynamic/strix scar2
Audio Device(s)	Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply	corsair 1200Hxi/Asus stock
Mouse	Roccat Kova/ Logitech G wireless
Keyboard	Roccat Aimo 120
VR HMD	Oculus rift
Software	Win 10 Pro
Benchmark Scores	laptop Timespy 6506

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, AirPods Max
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.5

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, AirPods Max
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.5

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

Trinity (Piledriver) Integer/FP Performance Higher Than Bulldozer, Clock-for-Clock

sergionography

TheoneandonlyMrK

Steevo

sergionography

Aquinus

Resident Wat-man

sergionography

Aquinus

Resident Wat-man

Steevo

Aquinus

Resident Wat-man

Steevo

Aquinus

Resident Wat-man

sergionography

Aquinus

Resident Wat-man

sergionography

xenocide

sergionography

Processor	Intel i7-10700k
Motherboard	Gigabyte Aurorus Ultra z490
Cooling	Corsair H100i RGB
Memory	32GB (4x8GB) Corsair Vengeance DDR4-3200MHz
Video Card(s)	MSI Gaming Trio X 3070 LHR
Display(s)	ASUS MG278Q / AOC G2590FX
Case	Corsair X4000 iCue
Audio Device(s)	Onboard
Power Supply	Corsair RM650x 650W Fully Modular
Software	Windows 10