• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

GPGPU API for C# that can use multiple GPUs and CPU at the same time

Joined
Apr 13, 2017
Messages
116 (0.05/day)
System Name AMD System
Processor Ryzen 7900 at 180Watts 5650 MHz, vdroop from 1.37V to 1.24V
Motherboard MSI MAG x670 Tomahawk Wifi
Cooling AIO240 for CPU, Wraith Prism's Fan for RAM but suspended above it without touching anything in case.
Memory 32GB dual channel Gskill DDR6000CL30 tuned for CL28, at 1.42Volts
Video Card(s) Msi Ventus 2x Rtx 4070 and Gigabyte Gaming Oc Rtx 4060 ti
Storage Samsung Evo 970
Display(s) Old 1080p 60FPS Samsung
Case Normal atx
Audio Device(s) Dunno
Power Supply 1200Watts
Mouse wireless & quiet
Keyboard wireless & quiet
VR HMD No
Software Windows 11
Benchmark Scores 1750 points in cinebench 2024 42k 43k gpu cpu points in timespy 50+ teraflops total compute power.
"Cekirdekler API" is an open-source project which I uploaded to github newly.

This API helps developer to rewrite a bottlenecking hotspot loop or somewhat simple algorithm as C99 code and have it run on all selected OpenCL-capable devices at the same time. At each compute iteration, all devices get fair amount of work depending on their performances and capabilities. They can be totally different vendors and different segments' GPUs.

You can find it in github:

(wiki) https://github.com/tugrul512bit/Cekirdekler/wiki

(download) https://github.com/tugrul512bit/Cekirdekler

also there is a short tutorial about it in here:

https://www.codeproject.com/Articles/1181213/Easy-OpenCL-Multiple-Device-Load-Balancing-and-Pip

Traditional hello-world looks like this:

Code:
            ClNumberCruncher cr = new ClNumberCruncher(
                AcceleratorType.GPU, @"
                    __kernel void hello(__global char * arr)
                    {
                        printf(""hello world"");
                    }
                ");

            ClArray<byte> array = new ClArray<byte>(1000);
            array.compute(cr, 1, "hello", 1000, 100);
 
Joined
Apr 13, 2017
Messages
116 (0.05/day)
System Name AMD System
Processor Ryzen 7900 at 180Watts 5650 MHz, vdroop from 1.37V to 1.24V
Motherboard MSI MAG x670 Tomahawk Wifi
Cooling AIO240 for CPU, Wraith Prism's Fan for RAM but suspended above it without touching anything in case.
Memory 32GB dual channel Gskill DDR6000CL30 tuned for CL28, at 1.42Volts
Video Card(s) Msi Ventus 2x Rtx 4070 and Gigabyte Gaming Oc Rtx 4060 ti
Storage Samsung Evo 970
Display(s) Old 1080p 60FPS Samsung
Case Normal atx
Audio Device(s) Dunno
Power Supply 1200Watts
Mouse wireless & quiet
Keyboard wireless & quiet
VR HMD No
Software Windows 11
Benchmark Scores 1750 points in cinebench 2024 42k 43k gpu cpu points in timespy 50+ teraflops total compute power.
Here you can see the load balancer in action

 
Joined
Apr 13, 2017
Messages
116 (0.05/day)
System Name AMD System
Processor Ryzen 7900 at 180Watts 5650 MHz, vdroop from 1.37V to 1.24V
Motherboard MSI MAG x670 Tomahawk Wifi
Cooling AIO240 for CPU, Wraith Prism's Fan for RAM but suspended above it without touching anything in case.
Memory 32GB dual channel Gskill DDR6000CL30 tuned for CL28, at 1.42Volts
Video Card(s) Msi Ventus 2x Rtx 4070 and Gigabyte Gaming Oc Rtx 4060 ti
Storage Samsung Evo 970
Display(s) Old 1080p 60FPS Samsung
Case Normal atx
Audio Device(s) Dunno
Power Supply 1200Watts
Mouse wireless & quiet
Keyboard wireless & quiet
VR HMD No
Software Windows 11
Benchmark Scores 1750 points in cinebench 2024 42k 43k gpu cpu points in timespy 50+ teraflops total compute power.
As of version 1.2.0, device to device pipelining feature is working.

If there are more than one OpenCL kernels that are needed to run consecutively and if none of them are distributable to multiple GPUs, then this new feature can run all of them at the same time as a single pipeline's stages with doublebuffering to overlap both computations and data movements between stages.



each stage is built from a list of kernels, input-output arrays and an OpenCL device. Then stages are added together to create a pipeline that works whenever client code pushes data to entrance of it. Each push makes a new result pop from the end point.


https://github.com/tugrul512bit/Cekirdekler/wiki/Pipelining:-Device-to-Device
 

eidairaman1

The Exiled Airman
Joined
Jul 2, 2007
Messages
40,435 (6.59/day)
Location
Republic of Texas (True Patriot)
System Name PCGOD
Processor AMD FX 8350@ 5.0GHz
Motherboard Asus TUF 990FX Sabertooth R2 2901 Bios
Cooling Scythe Ashura, 2×BitFenix 230mm Spectre Pro LED (Blue,Green), 2x BitFenix 140mm Spectre Pro LED
Memory 16 GB Gskill Ripjaws X 2133 (2400 OC, 10-10-12-20-20, 1T, 1.65V)
Video Card(s) AMD Radeon 290 Sapphire Vapor-X
Storage Samsung 840 Pro 256GB, WD Velociraptor 1TB
Display(s) NEC Multisync LCD 1700V (Display Port Adapter)
Case AeroCool Xpredator Evil Blue Edition
Audio Device(s) Creative Labs Sound Blaster ZxR
Power Supply Seasonic 1250 XM2 Series (XP3)
Mouse Roccat Kone XTD
Keyboard Roccat Ryos MK Pro
Software Windows 7 Pro 64
Joined
Apr 13, 2017
Messages
116 (0.05/day)
System Name AMD System
Processor Ryzen 7900 at 180Watts 5650 MHz, vdroop from 1.37V to 1.24V
Motherboard MSI MAG x670 Tomahawk Wifi
Cooling AIO240 for CPU, Wraith Prism's Fan for RAM but suspended above it without touching anything in case.
Memory 32GB dual channel Gskill DDR6000CL30 tuned for CL28, at 1.42Volts
Video Card(s) Msi Ventus 2x Rtx 4070 and Gigabyte Gaming Oc Rtx 4060 ti
Storage Samsung Evo 970
Display(s) Old 1080p 60FPS Samsung
Case Normal atx
Audio Device(s) Dunno
Power Supply 1200Watts
Mouse wireless & quiet
Keyboard wireless & quiet
VR HMD No
Software Windows 11
Benchmark Scores 1750 points in cinebench 2024 42k 43k gpu cpu points in timespy 50+ teraflops total compute power.
Now it has batch computing option with task-pool and device-pool features.


it uses all pipelines of a GPU and multi GPU scaling is higher than load balancing even with asymmetric GPU setups.
 
Joined
Apr 13, 2017
Messages
116 (0.05/day)
System Name AMD System
Processor Ryzen 7900 at 180Watts 5650 MHz, vdroop from 1.37V to 1.24V
Motherboard MSI MAG x670 Tomahawk Wifi
Cooling AIO240 for CPU, Wraith Prism's Fan for RAM but suspended above it without touching anything in case.
Memory 32GB dual channel Gskill DDR6000CL30 tuned for CL28, at 1.42Volts
Video Card(s) Msi Ventus 2x Rtx 4070 and Gigabyte Gaming Oc Rtx 4060 ti
Storage Samsung Evo 970
Display(s) Old 1080p 60FPS Samsung
Case Normal atx
Audio Device(s) Dunno
Power Supply 1200Watts
Mouse wireless & quiet
Keyboard wireless & quiet
VR HMD No
Software Windows 11
Benchmark Scores 1750 points in cinebench 2024 42k 43k gpu cpu points in timespy 50+ teraflops total compute power.
Now one can use OpenCL 2.0 dynamic parallelism feature.

 
Top