Monday, June 23rd 2025

Researchers Unveils Real-Time GPU-Only Pipeline for Fully Procedural Trees

Jun 23rd, 2025 07:54 Discuss (34 Comments)

A research team from Coburg University of Applied Sciences and Arts in Germany, alongside AMD Germany, introduced a game-changing approach to procedural tree creation that runs entirely on the GPU, delivering both speed and flexibility, unlike anything we've seen before. Showcased at High-Performance Graphics 2025 in Copenhagen, the new pipeline utilizes DirectX 12 work graphs and mesh nodes to construct detailed tree models on the fly, without any CPU muscle. Artists and developers can tweak more than 150 parameters, everything from seasonal leaf color shifts and branch pruning styles to complex animations and automatic level-of-detail adjustments, all in real-time. When tested on an AMD Radeon RX 7900 XTX, the system generated and pushed unique tree geometries into the geometry buffer in just over three milliseconds. It then automatically tunes detail levels to maintain a target frame rate, effortlessly demonstrating stable 120 FPS under heavy workloads.

Wind effects and environmental interactions update seamlessly, and the CPU's only job is to fill a small set of constants (camera matrices, timestamps, and so on) before dispatching a single work graph. There's no need for continuous host-device chatter or asset streaming, which simplifies integration into existing engines. Perhaps the most eye-opening result is how little memory the transient data consumes. A traditional buffer-heavy approach might need tens of GB, but researcher's demo holds onto just 51 KB of persistent state per frame—a mind-boggling 99.9999% reduction compared to conventional methods. A scratch buffer of up to 1.5 GB is allocated for work-graph execution, though actual usage varies by GPU driver and can be released or reused afterward. Static assets, such as meshes and textures, remain unaffected, leaving future opportunities for neural compression or procedural texturing to further enhance memory savings.

The key to this achievement is work graphs, which can orchestrate millions of tasks without exploding dispatch counts. Traditional ExecuteIndirect calls would struggle with trees that can have up to 128^4 leaves (around 268 million), but work graphs handle it with ease. Widespread adoption will take time since current support is limited to AMD's RDNA 3+ and NVIDIA's 30-series and newer GPUs. Full game-engine integration and console support are still on the horizon. Looking forward, the researchers are exploring how to extend this flexible, GPU-driven pipeline into ray tracing, possibly by building on-GPU bounding volume hierarchies with the same work-graph framework.

Source: HPG Paper

Add your own comment

34 Comments on Researchers Unveils Real-Time GPU-Only Pipeline for Fully Procedural Trees

Daven

Even more reason to end the silly hardware separation between CPU and GPU and make one compute package that does it all.

cucol

awesome demo, you can chose between Witcher 4 demo trees or this ones.

hard decision.

dgianstefani

TPU Proofreader

DavenEven more reason to end the silly hardware separation between CPU and GPU and make one compute package that does it all.

Go buy a Mac then.

Lionheart

dgianstefaniGo buy a Mac then.

Yeah cause they're great at playing games... :wtf:

Daven

dgianstefaniGo buy a Mac then.

I’m talking full integration that goes beyond separate GPU/CPU ‘tiles’, ‘chiplets’ and SoCs sections. One pipeline for all instructions duplicated up to the desired power level.

dgianstefani

TPU Proofreader

LionheartYeah cause they're great at playing games... :wtf:

They're not that bad these days, but it was more of a joking prod at the zero upgradability zero socket future if you really want PC to go that way.

Look at Strix Halo, weakish RDNA 3.5 GPU (compared to what you can otherwise fit in a $2k build, No FSR 4), soldered memory (no way to get high speeds/channels desired without it) and soldered CPU. Is that what you want for desktop? Be careful what you wish for.

DavenI’m talking full integration that goes beyond separate GPU/CPU ‘tiles’, ‘chiplets’ and SoCs sections. One pipeline for all instructions duplicated up to the desired power level.

They aren't chiplets or tiles on M series, at least for the normal/pro tiers, it's on one piece of silicon besides the RAM until you get to the Ultra chips etc, where it's two or even four of them connected.

Daven

dgianstefaniThey aren't chiplets or tiles on M series, at least for the normal/pro tiers, it's on one piece of silicon besides the RAM until you get to the Ultra chips etc, where it's two or even four of them connected.

But the CPU and GPU are still in separate sections meaning they are still seen as handling different instructions.

dgianstefani

TPU Proofreader

DavenBut the CPU and GPU are still in separate sections meaning they are still seen as handling different instructions.

And? Lmao.

Should we have a transistor that is both SRAM, TLC storage, CPU logic, GPU logic, cache, NPU etc? Good luck figuring that out.

If you're talking about compute in memory/in memory processing, that's also not what you seem to be wanting, that's just lower power/latency architecture, that still doesn't really work in most use cases.

The concept of one architecture that does it all is both inefficient and a pipedream. Optimization only really works for one or a couple of types of task family.

Daven

dgianstefaniAnd? Lmao.

Should we have a transistor that is both SRAM, TLC storage, CPU logic, GPU logic, cache, NPU etc? Good luck figuring that out.

If you're talking about compute in memory/in memory processing, that's also not what you seem to be wanting, that's just lower power/latency architecture, that still doesn't really work in most use cases.

The concept of one architecture that does it all is both inefficient and a pipedream. Optimization only really works for one or a couple of types of task family.

Of course what you said is the current compute scheme in use. It wasn’t that way in very beginning when video display adapters only handled signaling and did no computing. Eventually methods as outlined in this article might see a return to a more integrated computing scheme between high levels of parallalism and complex instruction sets.

#10

mikesg

Grass looks good

#11

TumbleGeorge

DavenI’m talking full integration that goes beyond separate GPU/CPU ‘tiles’, ‘chiplets’ and SoCs sections. One pipeline for all instructions duplicated up to the desired power level.

If this path is followed, the DIY computer market will die.

#12

dyonoctis

DavenBut the CPU and GPU are still in separate sections meaning they are still seen as handling different instructions.

Yhea, no. I understand what you mean, you want a super chip that would be able to handle GPU rendering and general compute with the same hardware at the core level, but compute evolved in that way because general purpose chips aren't efficient. GPU cores are very small and very good at heavily parallelized tasks, but would suck at lightly thread stuff. A lot of very smart people understood that combining several specialist is better than trying to make an uber chip.

Yes, GPUs evolved towards a unified arch to avoid specialist parts staying idle, when the other are busy doing something else, but thats not comparable to what we have now with SOCs and GPGPU. Some tasks are still better handled by a CPU arch, and a pita to accelerate by GPU.

And on the contrary, that UBER chip would have to share more of its ressources to handle everything. Unified gpus meant that for the same amount of transistors, the GPU could effectively do more work because the whole silicon can be used at all time

#13

Daven

dyonoctisYhea, no. I understand what you mean, you want a super chip that would be able to handle GPU rendering and general compute with the same hardware at the core level, but compute evolved in that way because general purpose chips aren't efficient. GPU cores are very small and very good at heavily parallelized tasks, but would suck at lightly thread stuff. A lot of very smart people understood that combining several specialist is better than trying to make an uber chip.

Yes, GPUs evolved towards a unified arch to avoid specialist parts staying idle, when the other are busy doing something else, but thats not comparable to what we have now with SOCs and GPGPU. Some tasks are still better handled by a CPU arch, and a pita to accelerate by GPU.

And on the contrary, that UBER chip would have to share more of its ressources to handle everything. Unified gpus meant that for the same amount of transistors, the GPU could effectively do more work because the whole silicon can be used at all time

I think at least some of what you are saying can be mitigated by ever improving fab nodes. I'm imagining a super small CPU that consumes 100 mWs for complex tasks that can be multitasked. Then there would be 30,000 of these tiny CPUs on a single chip where the overall chip simultaneously completes a highly parallel task using part of each of the 30,000 tiny CPUs on top of the complex multitasking.

#14

TumbleGeorge

Daven30,000 of these tiny CPUs on a single chip wher

Amdahl's law?

#15

Daven

TumbleGeorgeAmdahl's law?

Since I'm a scientist, I only give some credence to scientific laws (gravity, thermodynamics, etc.). All of these made-up laws regarding human ingenuity are just as relevant as Murphy's law.

#16

Vayra86

DavenOf course what you said is the current compute scheme in use. It wasn’t that way in very beginning when video display adapters only handled signaling and did no computing. Eventually methods as outlined in this article might see a return to a more integrated computing scheme between high levels of parallalism and complex instruction sets.

People with these disruptive ideas always tend to glance over a few thousand other reasons we have what we've got today.

It happened with crypto versus the banking in fiat. Where's that going these days I wonder.

This shit never flies because reality gets in the way. Its why they're called utopian thoughts.

#17

Visible Noise

This will surely help AMD increase market share against Nvidia.

#18

Kazioo

DavenSince I'm a scientist, I only give some credence to scientific laws (gravity, thermodynamics, etc.). All of these made-up laws regarding human ingenuity are just as relevant as Murphy's law.

It doesn't matter how you call "Amdahl's law" or if you respect it or not, but the phenomena it describes is basic math.
It has nothing to do with human ingenuity that 9 women cannot deliver a baby in 1 month or that 10 pilots cannot make the plane reach the destination 10x faster than a single pilot.
Of course, we can always hallucinate it and get around the entire problem ;) Our brains prove we don't even need Turing completeness to calculate anything.

#19

Visible Noise

DavenSince I'm a scientist, I only give some credence to scientific laws (gravity, thermodynamics, etc.). All of these made-up laws regarding human ingenuity are just as relevant as Murphy's law.

Scientist of what? Gene Amdahl’s PhD. is in physics.

en.wikipedia.org/wiki/Amdahl's_law

#20

Apocalypsee

Nah every single game devs will use crappy UE because it's 'easier'

#21

Patriot

DavenOf course what you said is the current compute scheme in use. It wasn’t that way in very beginning when video display adapters only handled signaling and did no computing. Eventually methods as outlined in this article might see a return to a more integrated computing scheme between high levels of parallalism and complex instruction sets.

Please stop while you are behind... You are giving the silicon architecture junkies headaches.
Everything is about tradeoffs, shrinking nodes gives the ability to have dedicated silicon to specific functions gaining efficiency.
Generic computation = inefficient everything. We are in the era of accelerators and dsp to increase efficiency at the tradeoff of die space.
You are suggesting we turn tail and throw away 20yrs of progress its pure idiocy.

#22

Daven

KaziooIt doesn't matter how you call "Amdahl's law" or if you respect it or not, but the phenomena it describes is basic math.
It has nothing to do with human ingenuity that 9 women cannot deliver a baby in 1 month or that 10 pilots cannot make the plane reach the destination 10x faster than a single pilot.
Of course, we can always hallucinate it and get around the entire problem ;) Our brains prove we don't even need Turing completeness to calculate anything.

Amdahl’s law is about computers and programming conceived by humans on earth. While gravity exists on all planets, human’s way of implementing computational devices is specific just to our current way of thinking. Its not universal but a limit of our species’ understanding.

Making up analogies of what we currently can’t do in unrelated scenarios doesn’t change anything. By the way, it may be possible to grow multiple human babies in one month in a single incubator. No physical laws of the universe say otherwise.

PatriotPlease stop while you are behind... You are giving the silicon architecture junkies headaches.
Everything is about tradeoffs, shrinking nodes gives the ability to have dedicated silicon to specific functions gaining efficiency.
Generic computation = inefficient everything. We are in the era of accelerators and dsp to increase efficiency at the tradeoff of die space.
You are suggesting we turn tail and throw away 20yrs of progress its pure idiocy.

The reason I’m a good scientist is that I don’t stop or have any problem being wrong. I also don’t accept the limited understandings of the professors and textbooks that taught me in school. What came before is a foundation of learning. Its up to us to build something never conceived before on that foundation.

Stop trying to win the internet and go out and create something. It’s fun.

#23

watzupken

The question I have is how does this translate to performance in actual games. Utilizing GPUs to perform tasks that’s typically performed by CPU is not new. But there is no perfect solution and will always come with its set of tradeoffs. Right off the bat, the idea of leaning on GPUs to process anything extra means less resources for raster performance.

#24

Vayra86

DavenAmdahl’s law is about computers and programming conceived by humans on earth. While gravity exists on all planets, human’s way of implementing computational devices is specific just to our current way of thinking. Its not universal but a limit of our species’ understanding.

Making up analogies of what we currently can’t do in unrelated scenarios doesn’t change anything. By the way, it may be possible to grow multiple human babies in one month in a single incubator. No physical laws of the universe say otherwise.

The reason I’m a good scientist is that I don’t stop or have any problem being wrong. I also don’t accept the limited understandings of the professors and textbooks that taught me in school. What came before is a foundation of learning. Its up to us to build something never conceived before on that foundation.

Stop trying to win the internet and go out and create something. It’s fun.

Conceptual thinking is pretty high level, indeed

#25

TheinsanegamerN

DavenI’m talking full integration that goes beyond separate GPU/CPU ‘tiles’, ‘chiplets’ and SoCs sections. One pipeline for all instructions duplicated up to the desired power level.

This wont work, different instructions benefit from different optimization.

This has already been tried, it miserably failed.

DavenOf course what you said is the current compute scheme in use. It wasn’t that way in very beginning when video display adapters only handled signaling and did no computing. Eventually methods as outlined in this article might see a return to a more integrated computing scheme between high levels of parallalism and complex instruction sets.

Yeah, and there's a REASON we stopped doing that. We hit a wall pretty quickly in terms of capability. Voodoo GPUs were not significantly more advanced then Pentiums but the performance difference for rendering 3d models was night and day.

Some tasks benefit greatly from parallelization, others do not. That is basic computing 101.

Add your own comment

Researchers Unveils Real-Time GPU-Only Pipeline for Fully Procedural Trees

34 Comments on Researchers Unveils Real-Time GPU-Only Pipeline for Fully Procedural Trees

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

Researchers Unveils Real-Time GPU-Only Pipeline for Fully Procedural Trees

Related News

34 Comments on Researchers Unveils Real-Time GPU-Only Pipeline for Fully Procedural Trees

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts