• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Partners with Heidelberg University Computing Center to Establish oneAPI Academic Center of Excellence

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,224 (0.91/day)
Intel and Heidelberg University Computing Center (URZ) today announced that they have established oneAPI Academic Center of Excellence (CoE) at UZR. The newly established CoE has a goal to further develop Intel's oneAPI standard and enable it to work on AMD GPUs. This information is a bit shocking, however, Intel believes that the technology should work on a wide range of processors, no matter the vendor. The heterogeneous hardware programming is the main goal here. In a Twitter thread, an Intel employee specifies that Intel has also been working with Arm and NVIDIA to bring Data-Parallel C++ (DPC++), a core of oneAPI, to those vendors as well. That should bring this universal programming model to every device and adapt to every platform, which is a goal of heterogeneous programming - whatever you need to program a CPU, GPU, or some other ASIC, it is covered by a single API, specifically oneAPI.

UZR said:
URZ's work as a oneAPI CoE will add advanced DPC++ capabilities into hipSYCL, which supports systems based on AMD GPUs, NVIDIA GPUs, and CPUs. New DPC++ extensions are part of the SYCL 2020 provisional specification that brings features such as unified shared memory to hipSYCL and the platforms it supports - furthering the promise of oneAPI application support across system architectures and vendors.


View at TechPowerUp Main Site
 
Joined
Jul 16, 2014
Messages
8,116 (2.28/day)
Location
SE Michigan
System Name Dumbass
Processor AMD Ryzen 7800X3D
Motherboard ASUS TUF gaming B650
Cooling Artic Liquid Freezer 2 - 420mm
Memory G.Skill Sniper 32gb DDR5 6000
Video Card(s) GreenTeam 4070 ti super 16gb
Storage Samsung EVO 500gb & 1Tb, 2tb HDD, 500gb WD Black
Display(s) 1x Nixeus NX_EDG27, 2x Dell S2440L (16:9)
Case Phanteks Enthoo Primo w/8 140mm SP Fans
Audio Device(s) onboard (realtek?) - SPKRS:Logitech Z623 200w 2.1
Power Supply Corsair HX1000i
Mouse Steeseries Esports Wireless
Keyboard Corsair K100
Software windows 10 H
Benchmark Scores https://i.imgur.com/aoz3vWY.jpg?2
oneAPI will fail, the 'vendors' will see to it.
 
Joined
Jan 8, 2017
Messages
8,929 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
There is SYCL, there is also ACC and who knows what else, what makes this special I don't know.
 

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
I would have picked Rust over C++ for this, simply because security across so many different devices is going to be a b*tch.
 

Cheeseball

Not a Potato
Supporter
Joined
Jan 2, 2009
Messages
1,842 (0.33/day)
Location
Pittsburgh, PA
System Name Titan
Processor AMD Ryzen™ 7 7950X3D
Motherboard ASUS ROG Strix X670E-I Gaming WiFi
Cooling ID-COOLING SE-207-XT Slim Snow
Memory TEAMGROUP T-Force Delta RGB 2x16GB DDR5-6000 CL30
Video Card(s) ASRock Radeon RX 7900 XTX 24 GB GDDR6 (MBA)
Storage 2TB Samsung 990 Pro NVMe
Display(s) AOpen Fire Legend 24" (25XV2Q), Dough Spectrum One 27" (Glossy), LG C4 42" (OLED42C4PUA)
Case ASUS Prime AP201 33L White
Audio Device(s) Kanto Audio YU2 and SUB8 Desktop Speakers and Subwoofer, Cloud Alpha Wireless
Power Supply Corsair SF1000L
Mouse Logitech Pro Superlight (White), G303 Shroud Edition
Keyboard Wooting 60HE / NuPhy Air75 v2
VR HMD Occulus Quest 2 128GB
Software Windows 11 Pro 64-bit 23H2 Build 22631.3447
There is SYCL, there is also ACC and who knows what else, what makes this special I don't know.

oneAPI currrently is basically SYCL with certain Intel extensions (which also run on AMD/NVIDIA hardware). The extensions are being able to access certain AVX2 and AVX-512 instructions and its subsets.
 
Joined
Aug 8, 2019
Messages
430 (0.25/day)
System Name R2V2 *In Progress
Processor Ryzen 7 2700
Motherboard Asrock X570 Taichi
Cooling W2A... water to air
Memory G.Skill Trident Z3466 B-die
Video Card(s) Radeon VII repaired and resurrected
Storage Adata and Samsung NVME
Display(s) Samsung LCD
Case Some ThermalTake
Audio Device(s) Asus Strix RAID DLX upgraded op amps
Power Supply Seasonic Prime something or other
Software Windows 10 Pro x64
If OneAPI gains traction, CUDA is screwed.

CUDA only prospers as long as the only options are the broken and painful OpenCL and Vulkan Compute. Both of which NV makes sure remain broken.

If AMD enables OneAPI on GCN and RDNA/CDNA... The CUDA is easier argument goes away. You get more compute grunt for far less... NV has got to be sweating because CUDA drives a lot of money.
 

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
If OneAPI gains traction, CUDA is screwed.

CUDA only prospers as long as the only options are the broken and painful OpenCL and Vulkan Compute. Both of which NV makes sure remain broken.

If AMD enables OneAPI on GCN and RDNA/CDNA... The CUDA is easier argument goes away. You get more compute grunt for far less... NV has got to be sweating because CUDA drives a lot of money.
Not sure how Nvidia breaks those. I haven't read about Vulkan Compute, but Nvidia has implemented OpenCL years ago. Still, everybody prefers CUDA.
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
I would have picked Rust over C++ for this, simply because security across so many different devices is going to be a b*tch.

These GPU-APIs are all about sharing pointers, tree structures, and more between CPU and GPU without skipping a beat. You can have a giant graph created with your CPU code, copied to the GPU, and then the pointers traversed in parallel in the GPU these days. (NVidia, Intel, or AMD).

Rust is good for memory-safety, but really bad for actual pointer arithmetic, pointer-sharing, and other such details. And this kind of pointer-sharing is very important for performance across heterogeneous systems. A C-like language with raw pointer manipulation is absolutely necessary, and C++ is the most advanced language that supports that level of detail.
 

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
These GPU-APIs are all about sharing pointers, tree structures, and more between CPU and GPU without skipping a beat. You can have a giant graph created with your CPU code, copied to the GPU, and then the pointers traversed in parallel in the GPU these days. (NVidia, Intel, or AMD).

Rust is good for memory-safety, but really bad for actual pointer arithmetic, pointer-sharing, and other such details. And this kind of pointer-sharing is very important for performance across heterogeneous systems. A C-like language with raw pointer manipulation is absolutely necessary, and C++ is the most advanced language that supports that level of detail.
Exactly. It's that pointer arithmetic that causes all sorts of hard to trace crashes (or worse, silent errors), so why would they still drag it around?
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
Exactly. It's that pointer arithmetic that causes all sorts of hard to trace crashes (or worse, silent errors), so why would they still drag it around?

Performance. Which is why people are using GPUs in the first place (and why you can now share pointers between GPU code and CPU code).
 

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Performance. Which is why people are using GPUs in the first place (and why you can now share pointers between GPU code and CPU code).
I very much doubt that. But since I'm out of touch with these APIs, I'll take your word for it.
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
I very much doubt that. But since I'm out of touch with these APIs, I'll take your word for it.



One of the remarkable features of OpenCL™ 2.0 is shared virtual memory (SVM). This feature enables OpenCL developers to write code with extensive use of pointer-linked data structures like linked lists or trees that are shared between the host and a device side of an OpenCL application.

Its been available on CUDA for even longer. I don't remember the history exactly, but OpenCL 2.0 could do this way back in like 2013 or whatever. This feature is probably 10 years old, if we include whenever it started working on CUDA.

-------

EDIT: Think about raytracing and BVH trees. How do you exactly expect the GPU and CPU to share the same BVH-tree, unless memory-pointers worked exactly the same on both systems? This feature is a necessary precursor to RTX, and other such features now coming up in today's GPUs.
 

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10

I don't doubt you can share pointers, I doubt it's not better to leverage Rust while at it (even if using unsafe blocks).
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
I don't doubt you can share pointers, I doubt it's not better to leverage Rust while at it (even if using unsafe blocks).

If you're going to use unsafe Rust anyway, what's the benefit of Rust?

On both AMD ROCm and CUDA systems, we already have good libraries (like CUB) that are accelerating most GPU-style paradigms (https://nvlabs.github.io/cub/). You could rewrite that all in unsafe Rust, but why? What's the benefit?

I'm not even sure how Rust's "ownership model" applies to SIMD-style or prefix-sum style code. This isn't "normal" code that you see in the CPU world... things are written in a different manner entirely. GPU performance characteristics are very alien, and I don't expect Rust's memory model to be efficient on GPUs at all.

The closest code that works like GPUs is OpenMP (which is only implemented in Fortran, C, and C++). Julia is kind of making progress on its own way too, but Rust is no where close to the realm of usable on GPUs. There's a certain "parallelism mindset" you need to put yourself into to write effective GPU stuff, and I'm just not seeing how Rust supports the mindset in any way.
 

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
If you're going to use unsafe Rust anyway, what's the benefit of Rust?

On both AMD ROCm and CUDA systems, we already have good libraries (like CUB) that are accelerating most GPU-style paradigms (https://nvlabs.github.io/cub/). You could rewrite that all in unsafe Rust, but why? What's the benefit?

I'm not even sure how Rust's "ownership model" applies to SIMD-style prefix-sum code. This isn't "normal" code that you see in the CPU world... things are written in a different manner entirely. GPU performance characteristics are very alien, and I don't expect Rust's memory model to be efficient on GPUs at all.
The catch is not to rewrite everything in unsafe Rust. The unsafe blocks are there because when you interface with C/ASM code, you cannot actually guarantee much about it. But you can still enforce the rules for the rest of the code.
I mean, look at this: https://github.com/xiph/rav1e
60% assembly code, but it still makes sense to use Rust for the rest. And encoders are a breed on their own, most projects would do just fine with 10% or less unsafe code.
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
The catch is not to rewrite everything in unsafe Rust. The unsafe blocks are there because when you interface with C/ASM code, you cannot actually guarantee much about it. But you can still enforce the rules for the rest of the code.
I mean, look at this: https://github.com/xiph/rav1e
60% assembly code, but it still makes sense to use Rust for the rest. And encoders are a breed on their own, most projects would do just fine with 10% or less unsafe code.

Lets get a bit more technical, I think this high-level discussion we're having is getting in the way.

In CUDA, when you want to get a section of memory from the GPU, you call void* gpuPTR = cudaMalloc(size_of_blah); There are many kinds of cudaMalloc, depending on some details which are pretty important to performance. This is all CPU-side still, we haven't even touched GPU-code yet.

Once you've set up the data-structures inside of this gpuPTR as appropriate, you can send the pointer to the GPU with a kernel invocation, such as "fooBar<<<2, 64>>>(gpuPTR)", representing 2x64 cudaThreads of fooBar to be run on the GPU, with gpuPTR being passed to all 128 cudaThreads of them. After you call this, your CPU code is running in parallel with the GPU code.

fooBar is any __global__ specified C++ function, such as:

Code:
__global__ void fooBar(void* gpuPTR){
    // C++ code here
}

Now, I presume you want to write Rust code for fooBar. Where exactly will you be able to ensure memory-safety of the gpuPTR ? Which of the 128-threads has "ownership" of the memory? Or do you leave the CPU with ownership?

Also, "CUDA threads" do not have the same characteristics as "real" CPU Threads. Its an abstraction (one that NVidia keeps getting their GPUs closer and closer to over time... but its still not quite perfect). For example, 32-threads is the minimum practical CUDA-thread count. Intra-block threads can communicate effectively, but inter-block threads cannot communicate very easily (we have two thread blocks of 64: meaning thread 0 through 63 can communicate effectively, but thread0 and thread75 cannot. Thread75 is in the block of threads64 to thread128 block).

EDIT: Ah, to finish my point. It seems to me like the entirety of GPU-code (ie: everything inside of the __global__ fooBar function) will be inherently unsafe. Even if you made a safeCudaMalloc() that was managed on the CPU side, the poor communication mechanisms of GPU-blocks (ie: Thread#0 vs Thread#75) makes any such "memory-safety communication" on the GPU-side a fool's errand. It doesn't seem like GPU-side code could be written in safe-Rust at all, at least by my opinion.
 
Last edited:

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
@dragontamer5788 Well, I'm only a beginner in Rust and my C/C++ days are way behind me, but the thing is, if you want to change that memory, you can't in Rust. Only one pointer can have write access at any given time. So you'd need to split that up somehow. If you only need to read that zone and put your results elsewhere, then you can have shared ownership, no problem.
The upside being, of course, that if only one function can have write access, you don't need to worry about concurrent modifications (hence the motto "fearless concurrency").
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
The upside being, of course, that if only one function can have write access, you don't need to worry about concurrent modifications (hence the motto "fearless concurrency").

Given the huge thread counts that are common in GPU code (64 is... very tiny. I've made 1024-sized blocks on regular basis, and you need 65536 cudaThreads on a Vega64 before you even have decent performance)... its unlikely that you'll ever be able to guarantee "one function with write access".

Furthermore, a gpuKernel call spawns many identical copies of the same program. There can be a degree of cudaThread communication within a block (and even outside of the block if you're willing to accept severe performance penalties). Its just the efficient execution of code is the primary goal when writing GPU stuffs. Which means you're inherently going to have many of your threads reading, AND writing, these blocks of RAM.

The standard "GPU Style" is prefix-sum to coordinate who is writing, and where, to minimize issues. I suggest reading this paper for GPU-Stream Compaction, which is highly efficient, and allows an entire block of cudaThreads (up to 1024 cudaThreads) to efficiently read/write to the same array without stepping on each other's toes: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf. I simply don't see how this kind of (very common) GPU-style of code can ever be written in "Safe Rust".
 

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Given the huge thread counts that are common in GPU code (64 is... very tiny. I've made 1024-sized blocks on regular basis, and you need 65536 threads on a Vega64 before you even have decent performance)... its unlikely that you'll ever be able to guarantee "one function with write access".
Well, you already do that in C/C++ only you need to synchronize access for that to happen. And when you fail to do that, you get memory corruption and we're back where we started.
But it's true, with Rust you'd need to rethink the code and feed each thread just the data it's supposed to modify.

Somewhat unrelated, but even my novice understanding of Rust's way of thinking has enabled me to model problems much better, even if not programming in Rust.
 
Joined
Apr 24, 2020
Messages
2,560 (1.76/day)
Well, you already do that in C/C++ only you need to synchronize access for that to happen. And when you fail to do that, you get memory corruption and we're back where we started.

GPU-style is heavily based on GPU-barriers. As long as everyone reads at the same time, then steps together, then writes at the same time, you're actually pretty safe. You also need assurances that all writes are to different locations (which is somewhat straightforward to prove, and happens pretty often in practice). If you cannot prove that all writes are to different locations, you can still have safe concurrent read-modify-writes by using Atomic-operations (with a severe performance penalty. You don't want to use Atomics unless an aliasing issue is at hand).

EDIT:
1601591116396.png


Just editing a picture from the paper I posted a bit earlier. Showing how an array can be stream-compacted in parallel safely and simply. For clarity, there are 16-threads (0 through 15), compacting this array.

This style is alien to CPU-programmers. But once you get used to it, its surprisingly effective and simple to think about.
 
Last edited:

bug

Joined
May 22, 2015
Messages
13,213 (4.06/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
GPU-style is heavily based on GPU-barriers. As long as everyone reads at the same time, then steps together, then writes at the same time, you're actually pretty safe.

This style is alien to CPU-programmers. But once you get used to it, its surprisingly effective and simple to think about.
So you only need to get everyone to do three things together, in sequence, at the same time. Super-safe, what can possibly go wrong? :p

Joking aside, you can use C/C++ for the GPU if that's safe enough, while using Rust for the more mundane code elsewhere. That's all I was trying to say.
 
Top