Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

AleksandarK · Sep 30, 2020

Intel and Heidelberg University Computing Center (URZ) today announced that they have established oneAPI Academic Center of Excellence (CoE) at UZR. The newly established CoE has a goal to further develop Intel's oneAPI standard and enable it to work on AMD GPUs. This information is a bit shocking, however, Intel believes that the technology should work on a wide range of processors, no matter the vendor. The heterogeneous hardware programming is the main goal here. In a Twitter thread, an Intel employee specifies that Intel has also been working with Arm and NVIDIA to bring Data-Parallel C++ (DPC++), a core of oneAPI, to those vendors as well. That should bring this universal programming model to every device and adapt to every platform, which is a goal of heterogeneous programming - whatever you need to program a CPU, GPU, or some other ASIC, it is covered by a single API, specifically oneAPI.

UZR said:
URZ's work as a oneAPI CoE will add advanced DPC++ capabilities into hipSYCL, which supports systems based on AMD GPUs, NVIDIA GPUs, and CPUs. New DPC++ extensions are part of the SYCL 2020 provisional specification that brings features such as unified shared memory to hipSYCL and the platforms it supports - furthering the promise of oneAPI application support across system architectures and vendors.

View at TechPowerUp Main Site

DeathtoGnomes · Sep 30, 2020

oneAPI will fail, the 'vendors' will see to it.

Vya Domus · Sep 30, 2020

There is SYCL, there is also ACC and who knows what else, what makes this special I don't know.

bug · Sep 30, 2020

I would have picked Rust over C++ for this, simply because security across so many different devices is going to be a b*tch.

Mouth of Sauron · Sep 30, 2020

How this is different from HSA, where Intel was very actively disrupting progress?

Voluman · Sep 30, 2020

Mouth of Sauron said:
How this is different from HSA, where Intel was very actively disrupting progress?

HSA wasn't intel idea, this one, is

Cheeseball · Sep 30, 2020

Vya Domus said:
There is SYCL, there is also ACC and who knows what else, what makes this special I don't know.

oneAPI currrently is basically SYCL with certain Intel extensions (which also run on AMD/NVIDIA hardware). The extensions are being able to access certain AVX2 and AVX-512 instructions and its subsets.

gamefoo21 · Sep 30, 2020

If OneAPI gains traction, CUDA is screwed.

CUDA only prospers as long as the only options are the broken and painful OpenCL and Vulkan Compute. Both of which NV makes sure remain broken.

If AMD enables OneAPI on GCN and RDNA/CDNA... The CUDA is easier argument goes away. You get more compute grunt for far less... NV has got to be sweating because CUDA drives a lot of money.

bug · Sep 30, 2020

gamefoo21 said:
If OneAPI gains traction, CUDA is screwed.

CUDA only prospers as long as the only options are the broken and painful OpenCL and Vulkan Compute. Both of which NV makes sure remain broken.

If AMD enables OneAPI on GCN and RDNA/CDNA... The CUDA is easier argument goes away. You get more compute grunt for far less... NV has got to be sweating because CUDA drives a lot of money.

Not sure how Nvidia breaks those. I haven't read about Vulkan Compute, but Nvidia has implemented OpenCL years ago. Still, everybody prefers CUDA.

dragontamer5788 · Oct 1, 2020

bug said:
I would have picked Rust over C++ for this, simply because security across so many different devices is going to be a b*tch.

These GPU-APIs are all about sharing pointers, tree structures, and more between CPU and GPU without skipping a beat. You can have a giant graph created with your CPU code, copied to the GPU, and then the pointers traversed in parallel in the GPU these days. (NVidia, Intel, or AMD).

Rust is good for memory-safety, but really bad for actual pointer arithmetic, pointer-sharing, and other such details. And this kind of pointer-sharing is very important for performance across heterogeneous systems. A C-like language with raw pointer manipulation is absolutely necessary, and C++ is the most advanced language that supports that level of detail.

bug · Oct 1, 2020

dragontamer5788 said:
These GPU-APIs are all about sharing pointers, tree structures, and more between CPU and GPU without skipping a beat. You can have a giant graph created with your CPU code, copied to the GPU, and then the pointers traversed in parallel in the GPU these days. (NVidia, Intel, or AMD).

Rust is good for memory-safety, but really bad for actual pointer arithmetic, pointer-sharing, and other such details. And this kind of pointer-sharing is very important for performance across heterogeneous systems. A C-like language with raw pointer manipulation is absolutely necessary, and C++ is the most advanced language that supports that level of detail.

Exactly. It's that pointer arithmetic that causes all sorts of hard to trace crashes (or worse, silent errors), so why would they still drag it around?

dragontamer5788 · Oct 1, 2020

bug said:
Exactly. It's that pointer arithmetic that causes all sorts of hard to trace crashes (or worse, silent errors), so why would they still drag it around?

Performance. Which is why people are using GPUs in the first place (and why you can now share pointers between GPU code and CPU code).

bug · Oct 1, 2020

dragontamer5788 said:
Performance. Which is why people are using GPUs in the first place (and why you can now share pointers between GPU code and CPU code).

I very much doubt that. But since I'm out of touch with these APIs, I'll take your word for it.

dragontamer5788 · Oct 1, 2020

bug said:
I very much doubt that. But since I'm out of touch with these APIs, I'll take your word for it.

OpenCL™ 2.0 Shared Virtual Memory Overview

Download PDF Contents Introduction SVM Features and Types Detecting the Supported SVM Type Overview of SVM Features I

software.intel.com

Unified Memory for CUDA Beginners | NVIDIA Technical Blog

This post introduces CUDA programming with Unified Memory, a single memory address space that is accessible from any GPU or CPU in a system.

developer.nvidia.com

One of the remarkable features of OpenCL™ 2.0 is shared virtual memory (SVM). This feature enables OpenCL developers to write code with extensive use of pointer-linked data structures like linked lists or trees that are shared between the host and a device side of an OpenCL application.

Its been available on CUDA for even longer. I don't remember the history exactly, but OpenCL 2.0 could do this way back in like 2013 or whatever. This feature is probably 10 years old, if we include whenever it started working on CUDA.

-------

EDIT: Think about raytracing and BVH trees. How do you exactly expect the GPU and CPU to share the same BVH-tree, unless memory-pointers worked exactly the same on both systems? This feature is a necessary precursor to RTX, and other such features now coming up in today's GPUs.

bug · Oct 1, 2020

dragontamer5788 said:
OpenCL™ 2.0 Shared Virtual Memory Overview

Download PDF Contents Introduction SVM Features and Types Detecting the Supported SVM Type Overview of SVM Features I

software.intel.com

Unified Memory for CUDA Beginners | NVIDIA Technical Blog

This post introduces CUDA programming with Unified Memory, a single memory address space that is accessible from any GPU or CPU in a system.

developer.nvidia.com

I don't doubt you can share pointers, I doubt it's not better to leverage Rust while at it (even if using unsafe blocks).

dragontamer5788 · Oct 1, 2020

bug said:
I don't doubt you can share pointers, I doubt it's not better to leverage Rust while at it (even if using unsafe blocks).

If you're going to use unsafe Rust anyway, what's the benefit of Rust?

On both AMD ROCm and CUDA systems, we already have good libraries (like CUB) that are accelerating most GPU-style paradigms (https://nvlabs.github.io/cub/). You could rewrite that all in unsafe Rust, but why? What's the benefit?

I'm not even sure how Rust's "ownership model" applies to SIMD-style or prefix-sum style code. This isn't "normal" code that you see in the CPU world... things are written in a different manner entirely. GPU performance characteristics are very alien, and I don't expect Rust's memory model to be efficient on GPUs at all.

The closest code that works like GPUs is OpenMP (which is only implemented in Fortran, C, and C++). Julia is kind of making progress on its own way too, but Rust is no where close to the realm of usable on GPUs. There's a certain "parallelism mindset" you need to put yourself into to write effective GPU stuff, and I'm just not seeing how Rust supports the mindset in any way.

bug · Oct 1, 2020

dragontamer5788 said:
If you're going to use unsafe Rust anyway, what's the benefit of Rust?

On both AMD ROCm and CUDA systems, we already have good libraries (like CUB) that are accelerating most GPU-style paradigms (https://nvlabs.github.io/cub/). You could rewrite that all in unsafe Rust, but why? What's the benefit?

I'm not even sure how Rust's "ownership model" applies to SIMD-style prefix-sum code. This isn't "normal" code that you see in the CPU world... things are written in a different manner entirely. GPU performance characteristics are very alien, and I don't expect Rust's memory model to be efficient on GPUs at all.

The catch is not to rewrite everything in unsafe Rust. The unsafe blocks are there because when you interface with C/ASM code, you cannot actually guarantee much about it. But you can still enforce the rules for the rest of the code.
I mean, look at this: https://github.com/xiph/rav1e
60% assembly code, but it still makes sense to use Rust for the rest. And encoders are a breed on their own, most projects would do just fine with 10% or less unsafe code.

dragontamer5788 · Oct 1, 2020

bug said:
The catch is not to rewrite everything in unsafe Rust. The unsafe blocks are there because when you interface with C/ASM code, you cannot actually guarantee much about it. But you can still enforce the rules for the rest of the code.
I mean, look at this: https://github.com/xiph/rav1e
60% assembly code, but it still makes sense to use Rust for the rest. And encoders are a breed on their own, most projects would do just fine with 10% or less unsafe code.

Lets get a bit more technical, I think this high-level discussion we're having is getting in the way.

In CUDA, when you want to get a section of memory from the GPU, you call void* gpuPTR = cudaMalloc(size_of_blah); There are many kinds of cudaMalloc, depending on some details which are pretty important to performance. This is all CPU-side still, we haven't even touched GPU-code yet.

Once you've set up the data-structures inside of this gpuPTR as appropriate, you can send the pointer to the GPU with a kernel invocation, such as "fooBar<<<2, 64>>>(gpuPTR)", representing 2x64 cudaThreads of fooBar to be run on the GPU, with gpuPTR being passed to all 128 cudaThreads of them. After you call this, your CPU code is running in parallel with the GPU code.

fooBar is any __global__ specified C++ function, such as:

Code:

__global__ void fooBar(void* gpuPTR){
    // C++ code here
}

Now, I presume you want to write Rust code for fooBar. Where exactly will you be able to ensure memory-safety of the gpuPTR ? Which of the 128-threads has "ownership" of the memory? Or do you leave the CPU with ownership?

Also, "CUDA threads" do not have the same characteristics as "real" CPU Threads. Its an abstraction (one that NVidia keeps getting their GPUs closer and closer to over time... but its still not quite perfect). For example, 32-threads is the minimum practical CUDA-thread count. Intra-block threads can communicate effectively, but inter-block threads cannot communicate very easily (we have two thread blocks of 64: meaning thread 0 through 63 can communicate effectively, but thread0 and thread75 cannot. Thread75 is in the block of threads64 to thread128 block).

EDIT: Ah, to finish my point. It seems to me like the entirety of GPU-code (ie: everything inside of the __global__ fooBar function) will be inherently unsafe. Even if you made a safeCudaMalloc() that was managed on the CPU side, the poor communication mechanisms of GPU-blocks (ie: Thread#0 vs Thread#75) makes any such "memory-safety communication" on the GPU-side a fool's errand. It doesn't seem like GPU-side code could be written in safe-Rust at all, at least by my opinion.

bug · Oct 1, 2020

@dragontamer5788 Well, I'm only a beginner in Rust and my C/C++ days are way behind me, but the thing is, if you want to change that memory, you can't in Rust. Only one pointer can have write access at any given time. So you'd need to split that up somehow. If you only need to read that zone and put your results elsewhere, then you can have shared ownership, no problem.
The upside being, of course, that if only one function can have write access, you don't need to worry about concurrent modifications (hence the motto "fearless concurrency").

dragontamer5788 · Oct 1, 2020

bug said:
The upside being, of course, that if only one function can have write access, you don't need to worry about concurrent modifications (hence the motto "fearless concurrency").

Given the huge thread counts that are common in GPU code (64 is... very tiny. I've made 1024-sized blocks on regular basis, and you need 65536 cudaThreads on a Vega64 before you even have decent performance)... its unlikely that you'll ever be able to guarantee "one function with write access".

Furthermore, a gpuKernel call spawns many identical copies of the same program. There can be a degree of cudaThread communication within a block (and even outside of the block if you're willing to accept severe performance penalties). Its just the efficient execution of code is the primary goal when writing GPU stuffs. Which means you're inherently going to have many of your threads reading, AND writing, these blocks of RAM.

The standard "GPU Style" is prefix-sum to coordinate who is writing, and where, to minimize issues. I suggest reading this paper for GPU-Stream Compaction, which is highly efficient, and allows an entire block of cudaThreads (up to 1024 cudaThreads) to efficiently read/write to the same array without stepping on each other's toes: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf. I simply don't see how this kind of (very common) GPU-style of code can ever be written in "Safe Rust".

bug · Oct 1, 2020

dragontamer5788 said:
Given the huge thread counts that are common in GPU code (64 is... very tiny. I've made 1024-sized blocks on regular basis, and you need 65536 threads on a Vega64 before you even have decent performance)... its unlikely that you'll ever be able to guarantee "one function with write access".

Well, you already do that in C/C++ only you need to synchronize access for that to happen. And when you fail to do that, you get memory corruption and we're back where we started.
But it's true, with Rust you'd need to rethink the code and feed each thread just the data it's supposed to modify.

Somewhat unrelated, but even my novice understanding of Rust's way of thinking has enabled me to model problems much better, even if not programming in Rust.

dragontamer5788 · Oct 1, 2020

bug said:
Well, you already do that in C/C++ only you need to synchronize access for that to happen. And when you fail to do that, you get memory corruption and we're back where we started.

GPU-style is heavily based on GPU-barriers. As long as everyone reads at the same time, then steps together, then writes at the same time, you're actually pretty safe. You also need assurances that all writes are to different locations (which is somewhat straightforward to prove, and happens pretty often in practice). If you cannot prove that all writes are to different locations, you can still have safe concurrent read-modify-writes by using Atomic-operations (with a severe performance penalty. You don't want to use Atomics unless an aliasing issue is at hand).

EDIT:

Just editing a picture from the paper I posted a bit earlier. Showing how an array can be stream-compacted in parallel safely and simply. For clarity, there are 16-threads (0 through 15), compacting this array.

This style is alien to CPU-programmers. But once you get used to it, its surprisingly effective and simple to think about.

bug · Oct 1, 2020

dragontamer5788 said:
GPU-style is heavily based on GPU-barriers. As long as everyone reads at the same time, then steps together, then writes at the same time, you're actually pretty safe.

This style is alien to CPU-programmers. But once you get used to it, its surprisingly effective and simple to think about.

So you only need to get everyone to do three things together, in sequence, at the same time. Super-safe, what can possibly go wrong?

Joking aside, you can use C/C++ for the GPU if that's safe enough, while using Rust for the more mundane code elsewhere. That's all I was trying to say.

dragontamer5788 · Oct 1, 2020

bug said:
So you only need to get everyone to do three things together, in sequence, at the same time. Super-safe, what can possibly go wrong?

Each synchronization is just a __syncthreads() call. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

And its highly efficient on GPUs. Don't knock it until you try it :cool:

System Name	Dumbass
Processor	AMD Ryzen 7800X3D
Motherboard	ASUS TUF gaming B650
Cooling	Artic Liquid Freezer 2 - 420mm
Memory	G.Skill Sniper 32gb DDR5 6000
Video Card(s)	GreenTeam 4070 ti super 16gb
Storage	Samsung EVO 500gb & 1Tb, 2tb HDD, 500gb WD Black
Display(s)	1x Nixeus NX_EDG27, 2x Dell S2440L (16:9)
Case	Phanteks Enthoo Primo w/8 140mm SP Fans
Audio Device(s)	onboard (realtek?) - SPKRS:Logitech Z623 200w 2.1
Power Supply	Corsair HX1000i
Mouse	Steeseries Esports Wireless
Keyboard	Corsair K100
Software	windows 10 H
Benchmark Scores	https://i.imgur.com/aoz3vWY.jpg?2

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

System Name	Titan
Processor	AMD Ryzen™ 7 7950X3D
Motherboard	ASUS ROG Strix X670E-I Gaming WiFi
Cooling	ID-COOLING SE-207-XT Slim Snow
Memory	TEAMGROUP T-Force Delta RGB 2x16GB DDR5-6000 CL30
Video Card(s)	ASRock Radeon RX 7900 XTX 24 GB GDDR6 (MBA)
Storage	2TB Samsung 990 Pro NVMe
Display(s)	AOpen Fire Legend 24" (25XV2Q), Dough Spectrum One 27" (Glossy), LG C4 42" (OLED42C4PUA)
Case	ASUS Prime AP201 33L White
Audio Device(s)	Kanto Audio YU2 and SUB8 Desktop Speakers and Subwoofer, Cloud Alpha Wireless
Power Supply	Corsair SF1000L
Mouse	Logitech Pro Superlight (White), G303 Shroud Edition
Keyboard	Wooting 60HE / NuPhy Air75 v2
VR HMD	Occulus Quest 2 128GB
Software	Windows 11 Pro 64-bit 23H2 Build 22631.3447

System Name	R2V2 *In Progress
Processor	Ryzen 7 2700
Motherboard	Asrock X570 Taichi
Cooling	W2A... water to air
Memory	G.Skill Trident Z3466 B-die
Video Card(s)	Radeon VII repaired and resurrected
Storage	Adata and Samsung NVME
Display(s)	Samsung LCD
Case	Some ThermalTake
Audio Device(s)	Asus Strix RAID DLX upgraded op amps
Power Supply	Seasonic Prime something or other
Software	Windows 10 Pro x64

Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

AleksandarK

News Editor

DeathtoGnomes

Vya Domus

bug

Mouth of Sauron

Voluman

Cheeseball

Not a Potato

gamefoo21

bug

dragontamer5788

bug

dragontamer5788

bug

dragontamer5788

OpenCL™ 2.0 Shared Virtual Memory Overview

Unified Memory for CUDA Beginners | NVIDIA Technical Blog

bug

OpenCL™ 2.0 Shared Virtual Memory Overview

Unified Memory for CUDA Beginners | NVIDIA Technical Blog

dragontamer5788

bug

dragontamer5788

bug

dragontamer5788

bug

dragontamer5788

bug

dragontamer5788