Monday, June 23rd 2025
Researchers Unveils Real-Time GPU-Only Pipeline for Fully Procedural Trees
A research team from Coburg University of Applied Sciences and Arts in Germany, alongside AMD Germany, introduced a game-changing approach to procedural tree creation that runs entirely on the GPU, delivering both speed and flexibility, unlike anything we've seen before. Showcased at High-Performance Graphics 2025 in Copenhagen, the new pipeline utilizes DirectX 12 work graphs and mesh nodes to construct detailed tree models on the fly, without any CPU muscle. Artists and developers can tweak more than 150 parameters, everything from seasonal leaf color shifts and branch pruning styles to complex animations and automatic level-of-detail adjustments, all in real-time. When tested on an AMD Radeon RX 7900 XTX, the system generated and pushed unique tree geometries into the geometry buffer in just over three milliseconds. It then automatically tunes detail levels to maintain a target frame rate, effortlessly demonstrating stable 120 FPS under heavy workloads.
Wind effects and environmental interactions update seamlessly, and the CPU's only job is to fill a small set of constants (camera matrices, timestamps, and so on) before dispatching a single work graph. There's no need for continuous host-device chatter or asset streaming, which simplifies integration into existing engines. Perhaps the most eye-opening result is how little memory the transient data consumes. A traditional buffer-heavy approach might need tens of GB, but researcher's demo holds onto just 51 KB of persistent state per frame—a mind-boggling 99.9999% reduction compared to conventional methods. A scratch buffer of up to 1.5 GB is allocated for work-graph execution, though actual usage varies by GPU driver and can be released or reused afterward. Static assets, such as meshes and textures, remain unaffected, leaving future opportunities for neural compression or procedural texturing to further enhance memory savings.
The key to this achievement is work graphs, which can orchestrate millions of tasks without exploding dispatch counts. Traditional ExecuteIndirect calls would struggle with trees that can have up to 128^4 leaves (around 268 million), but work graphs handle it with ease. Widespread adoption will take time since current support is limited to AMD's RDNA 3+ and NVIDIA's 30-series and newer GPUs. Full game-engine integration and console support are still on the horizon. Looking forward, the researchers are exploring how to extend this flexible, GPU-driven pipeline into ray tracing, possibly by building on-GPU bounding volume hierarchies with the same work-graph framework.
Source:
HPG Paper
Wind effects and environmental interactions update seamlessly, and the CPU's only job is to fill a small set of constants (camera matrices, timestamps, and so on) before dispatching a single work graph. There's no need for continuous host-device chatter or asset streaming, which simplifies integration into existing engines. Perhaps the most eye-opening result is how little memory the transient data consumes. A traditional buffer-heavy approach might need tens of GB, but researcher's demo holds onto just 51 KB of persistent state per frame—a mind-boggling 99.9999% reduction compared to conventional methods. A scratch buffer of up to 1.5 GB is allocated for work-graph execution, though actual usage varies by GPU driver and can be released or reused afterward. Static assets, such as meshes and textures, remain unaffected, leaving future opportunities for neural compression or procedural texturing to further enhance memory savings.
The key to this achievement is work graphs, which can orchestrate millions of tasks without exploding dispatch counts. Traditional ExecuteIndirect calls would struggle with trees that can have up to 128^4 leaves (around 268 million), but work graphs handle it with ease. Widespread adoption will take time since current support is limited to AMD's RDNA 3+ and NVIDIA's 30-series and newer GPUs. Full game-engine integration and console support are still on the horizon. Looking forward, the researchers are exploring how to extend this flexible, GPU-driven pipeline into ray tracing, possibly by building on-GPU bounding volume hierarchies with the same work-graph framework.
34 Comments on Researchers Unveils Real-Time GPU-Only Pipeline for Fully Procedural Trees
hard decision.
Look at Strix Halo, weakish RDNA 3.5 GPU (compared to what you can otherwise fit in a $2k build, No FSR 4), soldered memory (no way to get high speeds/channels desired without it) and soldered CPU. Is that what you want for desktop? Be careful what you wish for. They aren't chiplets or tiles on M series, at least for the normal/pro tiers, it's on one piece of silicon besides the RAM until you get to the Ultra chips etc, where it's two or even four of them connected.
Should we have a transistor that is both SRAM, TLC storage, CPU logic, GPU logic, cache, NPU etc? Good luck figuring that out.
If you're talking about compute in memory/in memory processing, that's also not what you seem to be wanting, that's just lower power/latency architecture, that still doesn't really work in most use cases.
The concept of one architecture that does it all is both inefficient and a pipedream. Optimization only really works for one or a couple of types of task family.
Yes, GPUs evolved towards a unified arch to avoid specialist parts staying idle, when the other are busy doing something else, but thats not comparable to what we have now with SOCs and GPGPU. Some tasks are still better handled by a CPU arch, and a pita to accelerate by GPU.
And on the contrary, that UBER chip would have to share more of its ressources to handle everything. Unified gpus meant that for the same amount of transistors, the GPU could effectively do more work because the whole silicon can be used at all time
It happened with crypto versus the banking in fiat. Where's that going these days I wonder.
This shit never flies because reality gets in the way. Its why they're called utopian thoughts.
It has nothing to do with human ingenuity that 9 women cannot deliver a baby in 1 month or that 10 pilots cannot make the plane reach the destination 10x faster than a single pilot.
Of course, we can always hallucinate it and get around the entire problem ;) Our brains prove we don't even need Turing completeness to calculate anything.
en.wikipedia.org/wiki/Amdahl's_law
Everything is about tradeoffs, shrinking nodes gives the ability to have dedicated silicon to specific functions gaining efficiency.
Generic computation = inefficient everything. We are in the era of accelerators and dsp to increase efficiency at the tradeoff of die space.
You are suggesting we turn tail and throw away 20yrs of progress its pure idiocy.
Making up analogies of what we currently can’t do in unrelated scenarios doesn’t change anything. By the way, it may be possible to grow multiple human babies in one month in a single incubator. No physical laws of the universe say otherwise. The reason I’m a good scientist is that I don’t stop or have any problem being wrong. I also don’t accept the limited understandings of the professors and textbooks that taught me in school. What came before is a foundation of learning. Its up to us to build something never conceived before on that foundation.
Stop trying to win the internet and go out and create something. It’s fun.
This has already been tried, it miserably failed. Yeah, and there's a REASON we stopped doing that. We hit a wall pretty quickly in terms of capability. Voodoo GPUs were not significantly more advanced then Pentiums but the performance difference for rendering 3d models was night and day.
Some tasks benefit greatly from parallelization, others do not. That is basic computing 101.