• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AVX vs non-AVX

Joined
Jan 8, 2017
Messages
5,033 (4.05/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
In recent light of the Vulkan vs OpenGL comparison on why supposedly developers don't want to use Vulkan which was posted some time ago I thought it would make sense to show something similar on the CPU side of things :).

Scalar, non-AVX :

Code:
void interpolate(vector<vector<int>>& mat)
{
    for(int i=2; i<mat.size()-1; i=i+2)
        for(int j=0; j<mat[0].size(); j++)
        {
            mat[i][j] = mat[i-1][j] + 0.5f * (mat[i+1][j] - mat[i-1][j]);
        }
}
AVX :

Code:
void interpolate_avx(vector<vector<int>>& mat)
{
    for(int i=2; i<mat.size()-1; i=i+2)
        for(int j=0; j<mat[0].size(); j=j+8)
        {
            _mm256_storeu_si256((__m256i *)&mat[i][j], _mm256_cvtps_epi32(_mm256_add_ps(_mm256_mul_ps(_mm256_sub_ps(_mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i+1][j])), _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i-1][j]))), _mm256_set1_ps(0.5f)), _mm256_cvtepi32_ps(_mm256_loadu_si256((__m256i *)&mat[i-1][j])))));
        }
}
That's why game developers stay away from optimizing games using AVX I guess :). In all seriousness, there is always a price to pay for better performance, the code above runs about 6.5 times faster even though it looks unintelligible. Same thing with Vulkan vs OpenGL vs DirectX or whatever, it makes no sense to compare things like this.
 
Joined
Apr 8, 2010
Messages
518 (0.14/day)
Processor Intel Core i5 8400
Motherboard Gigabyte Z370N-Wifi
Cooling Silverstone AR05
Memory Micron Crucial 16GB DDR4-2400
Video Card(s) Gigabyte GTX1080 G1 Gaming 8G
Storage Micron Crucial MX300 275GB
Display(s) Dell U2415
Case Silverstone RVZ02B
Power Supply Silverstone SSR-SX550
Keyboard Ducky One Red Switch
Software Windows 10 Pro 1909
The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free
 
Joined
Oct 2, 2015
Messages
2,564 (1.50/day)
Location
Argentina
System Name Ciel / Yukino
Processor AMD Ryzen R5 3400G / Intel Core i3 5005U
Motherboard Gigabyte B450 Aorus M / HP 240 G5
Cooling AM3 Wraith + Spire fan / Stock
Memory 2x 8GB Corsair Vengeance LPX DDR4 3200MHz / 2x 4GB Hynix + Kingston DDR3L 1600MHz
Video Card(s) AMD Radeon RX Vega 11 + Nvidia Geforce GT 1030 2GB / Intel HD 5500
Storage SSD WD Green 240GB M.2 + HDD Toshiba 2TB / SSD Kingston A400 120GB SATA
Display(s) Samsung S22F350 @ 75Hz/ Integrated 1366x768 @ 94Hz
Case Generic / Stock
Audio Device(s) Realtek ALC892 / Realtek ALC282
Power Supply Sentey XPP 525W / Power Brick
Mouse Logitech G203 / Elan Touchpad
Keyboard Generic / Stock
Software Windows 10 x64
Shouldn't the compiler handle that on its own?
 

Solaris17

Dainty Moderator
Staff member
Joined
Aug 16, 2005
Messages
21,507 (3.98/day)
Location
Florida
System Name Venslar
Processor I9 7980XE
Motherboard MSI x299 Tomahawk Arctic
Cooling EK Custom
Memory 32GB Corsair DDR4 3000mhz
Video Card(s) Nvidia Titan RTX
Storage 1x 250GB 960 EVO | 1x 500gb Intel 720p | 32TB SAN
Display(s) 3x AOC Q2577PWQ (2k IPS)
Case Inwin 303 White (Thermaltake Ring 120mm Purple accent)
Audio Device(s) Schiit Fulla 3 on Beyerdynamic DT 990 Pros
Power Supply Seasonic 1050W Snow
Mouse Roccat Kone Aimo White
Keyboard Ducky Shine 6 Snow White
Software Windows 10 x64 Pro
Joined
Feb 18, 2020
Messages
33 (0.31/day)
Location
Russian Federation
Processor Ryzen 7 2700 (3900MHz@1.275V, Vsoc 1.025V)
Motherboard ASUS ROG Crosshair VIII Impact (BIOS 1302, Vsb 1.04V, SenseMI Skew 273)
Cooling Noctua NH-U12A (full custom loop building in progress)
Memory 2*8GB G.Skill Trident Z B&W (3600 14-14-14-14-28-42 1T 1.45V, GDM off, ProcODT 53, CAD 24-24-24-24)
Video Card(s) Radeon RX Vega64 Frontier Edition Air 16G (Core 1667MHz@1.07625V, Mem 1330@0.9375V)
Storage SSD: 120GiB + 240GiB + 480GiB. HDD: 1TiB + 2TiB. Backup: Ultrium LTO-6 6.25TiB.
Display(s) AOC C27G1
Case Fractal Design Define Nano S
Audio Device(s) Audio-technica ATH-AD900X
Power Supply BeQuiet! SFX-L Power 600W
Mouse A4Tech XL-747H
Keyboard Zalman ZM-K500
Software Gentoo-latest amd64-hardened, Windows 10 Pro WS x64.
The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free
Nope. It is all on developer's head.
This is why devs reluctantly use AVX or FMA instructions.
This instructions significantly increase performance, but requires more time to debug and optimize code.

Source: four yrs of developing code for board computer of missile engines (РД type)
 
Joined
Nov 13, 2007
Messages
7,936 (1.73/day)
Location
Austin Texas
System Name _
Processor 8700K @ 5.2 Ghz / -1 avx 24/7
Motherboard MSI Z370-A PRO
Cooling 120mm Custom Liquid
Memory 32 GB 3900 Mhz DDR4 17-17-17-34-400 trfc - 2T
Video Card(s) Gigabyte GTX 2080 Ti Windforce (Undervolted OC 1905MHz)
Storage 3x1TB SSDs
Display(s) Alienware 34" 3440x1440 120hz, G-Sync
Case Jonsbo U4
Audio Device(s) Bose Solo
Power Supply Corsair SF750
Mouse logitech hero
Keyboard tenkeyless
Software Windows 10 64 Bit
Benchmark Scores pretty fast!
The compiler and IDE can only babysit you so much.
Challenge accepted. AI IDE/compilers incoming. :p

Seriously though that looks awful lol.
 
Joined
Jan 8, 2017
Messages
5,033 (4.05/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
Shouldn't the compiler handle that on its own?
Theoretically, practically GCC didn't vectorize this using AVX. Probably because of the implicit conversion from integer to float so it decided it shouldn't touch it.


No YMM registers in sight for the non-AVX function which means no AVX instructions.

Seriously though that looks awful lol.
It does but it doesn't necessarily need to, there are libraries that do the same thing under much more legible code. The point was it makes no sense to compare things like that, one does something, the other something else.
 
Last edited:
Joined
Apr 8, 2010
Messages
518 (0.14/day)
Processor Intel Core i5 8400
Motherboard Gigabyte Z370N-Wifi
Cooling Silverstone AR05
Memory Micron Crucial 16GB DDR4-2400
Video Card(s) Gigabyte GTX1080 G1 Gaming 8G
Storage Micron Crucial MX300 275GB
Display(s) Dell U2415
Case Silverstone RVZ02B
Power Supply Silverstone SSR-SX550
Keyboard Ducky One Red Switch
Software Windows 10 Pro 1909
Nope. It is all on developer's head.
This is why devs reluctantly use AVX or FMA instructions.
This instructions significantly increase performance, but requires more time to debug and optimize code.

Source: four yrs of developing code for board computer of missile engines (РД type)
That's exactly what I mean lol, maybe I should replace "more work" with "more time" to make it clearer?
 
Joined
Jan 8, 2017
Messages
5,033 (4.05/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
Here's something funny : https://godbolt.org/z/tKVQXA

With those flags it decided it can use AVX 512 of all things, so probably it figured it wasn't fast enough under AVX2 ? You can't rely on the compiler to do this sort of thing more often than not and keep in mind this computation is extremely simple. Things aren't usually like that.
 
Joined
Aug 18, 2017
Messages
254 (0.25/day)
In previous godbolt link (#7) you didn't set any optimization flags, only instruction set ;) No way it could've ended up vectorized.

-O2 with -ftree-vectorize is enough to see avx+ instructions. O3 generates insane amount of code trying to cover and fully optimize almost every possible scenario and does heavy unrolling. In most cases It isn't needed to see if certain optimizations happen or not.

This code may look obvious and simple, but when it comes to code generation there are many things to be taken into consideration. If something cannot be proven, extra code has to be generated to cover such case.
Can vectors used in inner loop alias?
Do vectors contain uneven numbers of elements, so that few in the end have to be treated one by one?
Are there any language rules that make certain optimizations not possible?

All of these require certain dialogue with compiler to make sure that it can both generate optimal code, and do not generate too much unnecessary code. It takes time and effort.
 
Joined
Jan 8, 2017
Messages
5,033 (4.05/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
In previous godbolt link (#7) you didn't set any optimization flags, only instruction set ;) No way it could've ended up vectorized.

-O2 with -ftree-vectorize is enough to see avx+ instructions. O3 generates insane amount of code trying to cover and fully optimize almost every possible scenario and does heavy unrolling. In most cases It isn't needed to see if certain optimizations happen or not.
But that's the thing compilers are inconsistent and unpredictable with these things. With -march=native it generated AVX512 code, it's impractical that you'd always need to look up the assembly to see if the right instructions are used.

Edit :

I can't actually get it to generate AVX with -O2 -ftree-vectorize https://godbolt.org/z/XMNrNC
 
Last edited:
Joined
Aug 18, 2017
Messages
254 (0.25/day)
It's because you didn't specify an instruction set. Either set it explictly (-mavx...), or set the target architecture that supports it with -march=skylake (or any other).
By the way - don't use -march=native on compiler explorer, as it uses the architecture the compiler runs on. Today it may run on avx512 machine, tomorrow it may not ;)

Major compilers provide additional flags for diagnostics when vectorization happens, and if it doesn't - provide a reason why.
 
Joined
Aug 20, 2007
Messages
12,976 (2.78/day)
System Name Pioneer
Processor Intel i9 9900k
Motherboard ASRock Z390 Taichi
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory G.SKILL TridentZ Series 32GB (4 x 8GB) DDR4-3200 @ 14-14-14-34-2T
Video Card(s) AMD RX 5700 XT (XFX THICC Ultra III)
Storage Mushkin Pilot-E 2TB NVMe SSD w/ EKWB M.2 Heatsink
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) VGA HDMI->Panasonic SC-HTB20/Schiit Modi MB/Asgard 2 DAC/Amp to AKG Pro K7712 Headphones
Power Supply SeaSonic Prime 750W 80Plus Titanium
Mouse ROCCAT Kone EMP
Keyboard WASD CODE 104-Key w/ Cherry MX Green Keyswitches, Doubleshot Vortex PBT White Transluscent Keycaps
Software Windows 10 Enterprise (yes, it's legit.)
Joined
Jan 8, 2017
Messages
5,033 (4.05/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
Damn straight: That's the compilers job.



But I like high level languages. Go home, code ricer.
Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.
 
Joined
Oct 2, 2015
Messages
2,564 (1.50/day)
Location
Argentina
System Name Ciel / Yukino
Processor AMD Ryzen R5 3400G / Intel Core i3 5005U
Motherboard Gigabyte B450 Aorus M / HP 240 G5
Cooling AM3 Wraith + Spire fan / Stock
Memory 2x 8GB Corsair Vengeance LPX DDR4 3200MHz / 2x 4GB Hynix + Kingston DDR3L 1600MHz
Video Card(s) AMD Radeon RX Vega 11 + Nvidia Geforce GT 1030 2GB / Intel HD 5500
Storage SSD WD Green 240GB M.2 + HDD Toshiba 2TB / SSD Kingston A400 120GB SATA
Display(s) Samsung S22F350 @ 75Hz/ Integrated 1366x768 @ 94Hz
Case Generic / Stock
Audio Device(s) Realtek ALC892 / Realtek ALC282
Power Supply Sentey XPP 525W / Power Brick
Mouse Logitech G203 / Elan Touchpad
Keyboard Generic / Stock
Software Windows 10 x64
Nothing some good old machine language can't solve. Bonus points for using punching cards.
 
Joined
Aug 20, 2007
Messages
12,976 (2.78/day)
System Name Pioneer
Processor Intel i9 9900k
Motherboard ASRock Z390 Taichi
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory G.SKILL TridentZ Series 32GB (4 x 8GB) DDR4-3200 @ 14-14-14-34-2T
Video Card(s) AMD RX 5700 XT (XFX THICC Ultra III)
Storage Mushkin Pilot-E 2TB NVMe SSD w/ EKWB M.2 Heatsink
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) VGA HDMI->Panasonic SC-HTB20/Schiit Modi MB/Asgard 2 DAC/Amp to AKG Pro K7712 Headphones
Power Supply SeaSonic Prime 750W 80Plus Titanium
Mouse ROCCAT Kone EMP
Keyboard WASD CODE 104-Key w/ Cherry MX Green Keyswitches, Doubleshot Vortex PBT White Transluscent Keycaps
Software Windows 10 Enterprise (yes, it's legit.)
Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.
If it can figure out the Linux kernel inlining optimizations it can figure out most anything. I daresay it'd be more challenging to write convoluted code it couldn't figure out than to just write normal code.

And that's GCC, the shittiest of the bundle honestly.

My opinion, but of course, it's right because I have no idea what I'm talking about and in this day and age that stops nobody (this is me razzing myself, I do indeed have an idea but not much of one).

I primarily write Java and C# and only dabble in low level, remember. These languages take care of everything, heck even garbage collection. You can be as messy as you want. It's good for lazy people like me.
 
Joined
Jun 10, 2014
Messages
2,032 (0.93/day)
This is why devs reluctantly use AVX or FMA instructions.

This instructions significantly increase performance, but requires more time to debug and optimize code.
AVX is certainly harder to use, and the syntax of the intrinsics is nasty, and quickly can get out of hand for complex stuff, like in the example in post #1. I would have to really study that line to understand it, compared to the plain C++ one.

This might seem like an "impossible" task, but for most applications only a small part of the code is really performance critical, and often only a few lines of AVX intrinsics is needed to give a huge performance boost to the entire application.

Before even considering SIMD (like AVX), a few general optimizations are needed. Firstly, the data should be dense (and this may require restructuring in the non-AVX code too), and the data should be traversed linearly, this is a part of "cache optimizing". Secondly, bloat such as branching, function calls etc. inside the critical path (especially loops) should be reduced to a minimum. These are optimizations which will help a lot even before you use AVX, it also helps with what the compiler can do for you, but it is largely incompatible with the coding paradigms used in most applications today, which is the primary reason why we don't see low-level optimizations in many applications and especially games.

There are also some pitfalls with using some AVX features like FMA; while FMA is super fast, it will give higher precision than not using FMA (since it rounds after the calculation), which may be a problem for some applications, especially if you have two versions of the software.

And since this thread is angled towards games, there are a few other things to consider as well, and these are some of the reasons why AVX is rarely used in games;
- Most games use off-the-shelf engines and the studio do no engine code themselves.
- Most games are rushed "stitched together" code bases which are "discarded" after release. No one usually cares about code quality or maintenance, it's usually just "good enough". This is the reason why most shipping games today are "broken".
- Hardware support; AVX (SB or Bulldozer and newer), but not Celeron, Pentium, Atom and various low-end CPUs (correct me if I'm wrong here)
- In most games, the rendering thread(s) are usually not bottlenecked by math in the CPU code, or the math on the CPU side is not dense enough.
Don't get me wrong, I certainly wish AVX were used extensively, squeezing out every drop of performance we can get. But games are probably one of the use cases where the benefits are the least measurable. One good argument for AVX is energy efficiency, even if it doesn't necessarily help your FPS.
 
Joined
Apr 24, 2020
Messages
79 (1.98/day)
Write sufficiently convoluted code and the compiler will do jack, not everything is easily analyzable like linear algebra.
You might be surprised.


The code you posted in the first post is clearly recognized by GCC's auto-vectorizer, and cleanly compiles into AVX2 instructions, including vaddps. Auto-vectorizers are getting surprisingly good these days, not as good as hand-crafted assembly, but good enough to learn how to use the "-mavx2 -O3" compiler flags at least.

EDIT: Wrong link the first time. It doesn't seem like -O2 actually auto-vectorizes. You need -O3 for that, there's probably something math-unsafe that in the vectorized version of the code that needs the higher-level compiler flag to ram through.
 
Last edited:
Joined
Aug 20, 2007
Messages
12,976 (2.78/day)
System Name Pioneer
Processor Intel i9 9900k
Motherboard ASRock Z390 Taichi
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory G.SKILL TridentZ Series 32GB (4 x 8GB) DDR4-3200 @ 14-14-14-34-2T
Video Card(s) AMD RX 5700 XT (XFX THICC Ultra III)
Storage Mushkin Pilot-E 2TB NVMe SSD w/ EKWB M.2 Heatsink
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) VGA HDMI->Panasonic SC-HTB20/Schiit Modi MB/Asgard 2 DAC/Amp to AKG Pro K7712 Headphones
Power Supply SeaSonic Prime 750W 80Plus Titanium
Mouse ROCCAT Kone EMP
Keyboard WASD CODE 104-Key w/ Cherry MX Green Keyswitches, Doubleshot Vortex PBT White Transluscent Keycaps
Software Windows 10 Enterprise (yes, it's legit.)
Yep. Thanks for proving I did in fact have an inkling of an idea of what I was talking about.

*goes back to writing shitty java code*
 
Joined
Apr 24, 2020
Messages
79 (1.98/day)
Yep. Thanks for proving I did in fact have an inkling of an idea of what I was talking about.

*goes back to writing shitty java code*
The REAL benefit to SIMD-programming requires a paradigm shift. There are specially designed programming languages, called CUDA or OpenCL (or for CPU-programmers, ISPC), which are better high-level representations that more easily compiles into highly optimized SIMD / SSE code.

The sad truth is, C / C++ are awkward to compile into highly optimized SIMD. However, if you just barely change the language (CUDA is almost identical to C++), it becomes pleasant, even easy, to write highly optimized SIMD. Intel clearly got the message, and AVX512 includes a ton of stuff that's useful to the compiler instead of the programmer, kind of putting AVX512 on similar footing as NVidia PTX or AMD GCN.

---------

With that being said, a ton of effort has been put into autovectorizing standard C / C++ code. And its going to get easier, with OpenMP extensions (which are similar to the CUDA-extensions) providing a language-extension to C/C++ to more easily play with SIMD-compilers. This is relatively state-of-the-art stuff, but feel free to play with OpenMP's "#pragma omp for simd" statement in GCC or Clang if you don't believe me. Its surprisingly easy... well... easy if you know the paradigm of OpenMP to start with.
 
Last edited:
Joined
Jan 8, 2017
Messages
5,033 (4.05/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
It doesn't seem like -O2 actually auto-vectorizes. You need -O3 for that, there's probably something math-unsafe that in the vectorized version of the code that needs the higher-level compiler flag to ram through.
Probably the integer-float conversion. This is a basic example though, few things are this explicit and easy to auto vectorize. Compilers also do a performance-cost analysis which might also be wrong and lead to unoptimized code. For instance they'll hardly ever touch memory I/O instructions unless they absolutely have to and by that I mean they'll never really try to optimized data movement as they can't know if they really are faster or not or safe.

However, if you just barely change the language (CUDA is almost identical to C++), it becomes pleasant, even easy, to write highly optimized SIMD.
CUDA isn't really SIMD based as the vectorization isn't exposed anywhere at the software level, there is no exact concept of width or executing everything in lock-step as is the case in SIMD, the warps are just logical groups as far as the software side of things is concerned. In fact the hardware isn't organized in a SIMD fashion either, it's meant to implement per-thread parallelism. For these reasons CUDA and C++ aren't comparable in this regard.
 
Last edited:
Joined
Apr 24, 2020
Messages
79 (1.98/day)
CUDA isn't really SIMD based as the vectorization isn't exposed anywhere at the software level, there is no exact concept of width or executing everything in lock-step as is the case in SIMD, the warps are just logical groups as far as the software side of things is concerned. In fact the hardware isn't organized in a SIMD fashion either, it's meant to implement per-thread parallelism. For these reasons CUDA and C++ aren't comparable in this regard.
PTX is itself width 32. Every PTX assembly instruction is either a scalar (1x width) or 32x width (SIMD).

If you don't believe me, then focus on the CPU-version of high-level SIMD. ISPC applies the lessons of CUDA except it outputs x86 AVX code instead. All the concepts of ISPC and CUDA remain compatible at a high level.

SIMD is almost identical to per-thread parallelism. That's the lesson of ISPC and CUDA. The compiler can convert per-thread parallelism into SIMD very, very easily. And OpenMP 4.5 proves that the approach works in standard C/C++ code.
 
Last edited:

bug

Joined
May 22, 2015
Messages
7,551 (4.11/day)
Processor Intel i5-6600k (AMD Ryzen5 3600 in a box, waiting for a mobo)
Motherboard ASRock Z170 Extreme7+
Cooling Arctic Cooling Freezer i11
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V (@3200)
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 3TB Seagate
Display(s) HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
The question comes down to:
Is it worth putting in XXX% more work to achieve a program that runs YYY% faster. Nothing is free
In the world of programming, few things are worse than making the code harder to read. It is never only about the cost of writing that code, but also maintaining it. The cost snowballs.
 
Joined
Jan 8, 2017
Messages
5,033 (4.05/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
PTX is itself width 32. Every PTX assembly instruction is either a scalar (1x width) or 32x width (SIMD).
PTX is not assembly, there is no physical 32 wide vector register/instruction in the hardware, it's a software abstraction, GPUs aren't SIMD machines. They used to be but not anymore, this is basically a legacy abstraction, it wouldn't surprise me if in the near feature manufactures are going to drop the concept of having a wavefront/warp entirely. It used to be that with OpenCL you had to use actual vector formats because the hardware was indeed configured like that.

SIMD is almost identical to per-thread parallelism.
It just isn't, while you can emulate SIMD architectures with SIMT with practically no performance penalty, the other way around can't be done efficiently because there are things that you simply can't do with SIMD. ISPC is basically a wrapper for SIMD, there is nothing special about it, both CUDA and ISPC are ways to express parallelism while writing essentially scalar code.

Just because the GPU can more independently track warps and dynamically schedule between them doesn't change the SIMD architecture of Pascal, Turing or Volta.
OK, let's clarify this : if there isn't a single register where one decoded instruction operates in lock-step on it until it's done then it's not SIMD.

It looks like it's you who doesn't understand what happens at the lowest level, basically SIMD means 1 instruction stream (aka thread), SIMT means multiple, that's the whole idea behind why GPUs are so fast. With SIMD when the I/O is blocked, nothing happens until the instruction finishes executing because it's part of just one thread.

With SIMT when one or more threads are blocked the SM can pause all threads within that warp and can switch to a different instruction where the data is available within a generic register space. That's right, there is a register space, not just 1 fixed register which is part of a set as it would be the case with SIMD. Moreover a SIMD EU would never proceed unless all data elements are loaded from memory, meanwhile a SIMT EU can, with limited performance, continue execution on a subset of the total threads within the warp (that's how GPUs do branching).

The fundamental difference is : SIMD depends exclusively on the data, SIMT does not.
 
Last edited:
Joined
Apr 24, 2020
Messages
79 (1.98/day)
PTX is not assembly
Fine. I'll talk about SASS Assembly then. SASS assembly matches 1-to-1 with the actual machine code that NVidia GPUs execute, but it changes from generation to generation. SASS is undocumented because PTX is close enough to machine code that most people don't care. But if you care, I'm more than willing to talk about SASS instead. https://arxiv.org/pdf/1804.06826.pdf

there is no physical 32 wide vector register/instruction in the hardware
This is blatantly false. All NVidia Volta and Turing machines have a 32-wide warp, even at the SASS assembly level. The SASS assembly has warp-specific barriers, read/write ordering, and other very low-level details revealed. I suggest you study Chapter 2 of the paper carefully, to see how these 32-wide warps actually execute at the SASS / Machine code level for Volta.

this is basically a legacy abstraction, it wouldn't surprise me if in the near feature manufactures are going to drop the concept of having a wavefront/warp entirely.
You are confused about independent thread scheduling. While a program counter is now tracked on a per-thread basis, execution continues to happen on a 32-wide warp basis. Just because the GPU can more independently track warps and dynamically schedule between them doesn't change the SIMD architecture of Pascal, Turing or Volta.


Starting with the Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity.
Its a bit of a difficult read. But let me emphasize this point:

A schedule optimizer determines how to group active threads from the same warp together into SIMT units.
The SIMT units are still executing in warps, even on Volta / Turing with Independent Thread Scheduling turned on. You are misunderstanding what is happening at the lowest level.

----------

Don't be swayed by NVidia's "SIMT" marketing. SIMT is damn near identical to late 80s SIMD code, including StarLisp or StarC. NVidia certainly is pushing the envelope with some of their designs, but a huge chunk of their literature is just marketing hype. I do think NVidia is the most advanced SIMD chips available today, but AMD GCN and Intel AVX512 are surprisingly close to its capabilities.
 
Last edited:
Top