Flow Computing Parallel Processing Unit (PPU) Architecture Achieves End-to-End CPU Operations in Alpha Testing

Nomad76 · May 14, 2025

Flow Computing, the pioneer in licensing on-die, ultra-high-performance parallel computing solutions to CPU vendors of all architectures - today announced it has successfully achieved a critical milestone in Flow's development roadmap towards commercializing its parallel processing ecosystem. Capable of increasing any CPU architecture by up to 100X, the company has been actively developing a compiler that enables source code to take advantage of its acclaimed Parallel Processing Unit (PPU) architecture - that compiler today entered Alpha testing.

Through the first target compilations, it has been determined that simple parallel workloads consist of a massive amount of loops in RISC-V CPU models without PPU assistance. Whereas in RISC-V CPU models incorporating the PPU, the amount of these loops is significantly reduced by recompiling the existing code, demonstrable proof that it is indeed possible to achieve a significant performance boost with a PPU-enhanced CPU design at up to 100X performance.

At its most elemental usage level, the compiler identifies parallel elements in the existing source code that can be effectively sped up by the PPU. Code is analyzed by the compiler to identify what parts can be enhanced by PPU acceleration - the compiler then assigns parallelizable functionality directly to the PPU, bypassing CPU bottlenecks to achieve improved performance over the baseline CPU architecture.

In summary, Flow has now achieved full end-to-end operation by being able to compile high-level programs into extended RISC-V binaries and executing them in its gem5-based simulator modeling PPU integrated into the RISC-V CPU system.

This critical milestone in the company's product development roadmap means the full commercialization of its Parallel Processing Unit (PPU) architecture is on schedule for the next critical milestone, which is the completion of the next set of PPU Performance Modeling over the coming months.

View at TechPowerUp Main Site | Source

ncrs · May 14, 2025

This sounds awful lot like Intel Itanium and its magical compiler that wasn't able to deliver despite years of effort. Or like trying to apply GPU-ish design to classic general purpose code.
Claiming 100x increase due to parallelism seems a bit much, especially when there's Amdahl's law to consider.

FLOW PPU: Latency of memory references is hidden by executing other threads while accessing the memory. No coherency problems since no caches are placed in the front of the network. Scalability is provided via a high-bandwidth network-on-chip.

How is that different from superscalar execution and pipelining that's used by almost every current CPU design? I guess you don't have coherency to think about without any caches, but what about memory latency then?

FLOW PPU: Synchronizations are needed only once per step since the threads are independent of each other within a step

I see, the golden path is only for independent data. That narrows the use cases able to be accelerated.

An interesting idea, but a more complete whitepaper would be useful.

cal5582 · May 14, 2025

sounds like another startup project that will promise the world and ultimately go nowhere.

System Name	Nirn
Processor	Amd Ryzen 7950X3D
Motherboard	MSI MEG ACE X670e
Cooling	Noctua NH-D15
Memory	128 GB Kingston DDR5 6000 (running at 4000)
Video Card(s)	Radeon RX 7900XTX (24G) + Geforce 4070ti (12G) Physx
Storage	SAMSUNG 990 EVO SSD 2TB Gen 5 x2 (OS)+SAMSUNG 980 SSD 1TB PCle 3.0x4 (Primocache) +2X 22TB WD Gold
Display(s)	Samsung UN55NU8000 (Freesync)
Case	Corsair Graphite Series 780T White
Audio Device(s)	Creative Soundblaster AE-7 + Sennheiser GSP600
Power Supply	Seasonic PRIME TX-1000 Titanium
Mouse	Razer Mamba Elite Wired
Keyboard	Razer BlackWidow Chroma v1
VR HMD	Oculus Quest 2
Software	Windows 10

Flow Computing Parallel Processing Unit (PPU) Architecture Achieves End-to-End CPU Operations in Alpha Testing

Nomad76

News Editor

ncrs

cal5582