• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

NVIDIA TensorRT Boosts Stable Diffusion 3.5 Performance on NVIDIA GeForce RTX and RTX PRO GPUs

GFreeman

News Editor
Staff member
Joined
Mar 6, 2023
Messages
1,988 (2.39/day)
Generative AI has reshaped how people create, imagine and interact with digital content. As AI models continue to grow in capability and complexity, they require more VRAM, or video random access memory. The base Stable Diffusion 3.5 Large model, for example, uses over 18 GB of VRAM - limiting the number of systems that can run it well. By applying quantization to the model, noncritical layers can be removed or run with lower precision. NVIDIA GeForce RTX 40 Series and the Ada Lovelace generation of NVIDIA RTX PRO GPUs support FP8 quantization to help run these quantized models, and the latest-generation NVIDIA Blackwell GPUs also add support for FP4.

NVIDIA collaborated with Stability AI to quantize its latest model, Stable Diffusion (SD) 3.5 Large, to FP8 - reducing VRAM consumption by 40%. Further optimizations to SD3.5 Large and Medium with the NVIDIA TensorRT software development kit (SDK) double performance. In addition, TensorRT has been reimagined for RTX AI PCs, combining its industry-leading performance with just-in-time (JIT), on-device engine building and an 8x smaller package size for seamless AI deployment to more than 100 million RTX AI PCs. TensorRT for RTX is now available as a standalone SDK for developers.



RTX-Accelerated AI
NVIDIA and Stability AI are boosting the performance and reducing the VRAM requirements of Stable Diffusion 3.5, one of the world's most popular AI image models. With NVIDIA TensorRT acceleration and quantization, users can now generate and edit images faster and more efficiently on NVIDIA RTX GPUs.


Stable Diffusion 3.5 quantized FP8 (right) generates images in half the time with similar quality as FP16 (left). Prompt: A serene mountain lake at sunrise, crystal clear water reflecting snow-capped peaks, lush pine trees along the shore, soft morning mist, photorealistic, vibrant colors, high resolution.

To address the VRAM limitations of SD3.5 Large, the model was quantized with TensorRT to FP8, reducing the VRAM requirement by 40% to 11 GB. This means five GeForce RTX 50 Series GPUs can run the model from memory instead of just one.

SD3.5 Large and Medium models were also optimized with TensorRT, an AI backend for taking full advantage of Tensor Cores. TensorRT optimizes a model's weights and graph - the instructions on how to run a model - specifically for RTX GPUs.


FP8 TensorRT boosts SD3.5 Large performance by 2.3x vs. BF16 PyTorch, with 40% less memory use. For SD3.5 Medium, BF16 TensorRT delivers a 1.7x speedup.

Combined, FP8 TensorRT delivers a 2.3x performance boost on SD3.5 Large compared with running the original models in BF16 PyTorch, while using 40% less memory. And in SD3.5 Medium, BF16 TensorRT provides a 1.7x performance increase compared with BF16 PyTorch.

The optimized models are now available on Stability AI's Hugging Face page.

NVIDIA and Stability AI are also collaborating to release SD3.5 as an NVIDIA NIM microservice, making it easier for creators and developers to access and deploy the model for a wide range of applications. The NIM microservice is expected to be released in July.

TensorRT for RTX SDK Released
Announced at Microsoft Build - and already available as part of the new Windows ML framework in preview - TensorRT for RTX is now available as a standalone SDK for developers.

Previously, developers needed to pre-generate and package TensorRT engines for each class of GPU - a process that would yield GPU-specific optimizations but required significant time.

With the new version of TensorRT, developers can create a generic TensorRT engine that's optimized on device in seconds. This JIT compilation approach can be done in the background during installation or when they first use the feature.

The easy-to-integrate SDK is now 8x smaller and can be invoked through Windows ML - Microsoft's new AI inference backend in Windows. Developers can download the new standalone SDK from the NVIDIA Developer page or test it in the Windows ML preview.

For more details, read this NVIDIA technical blog and this Microsoft Build recap.

Join NVIDIA at GTC Paris
At NVIDIA GTC Paris at VivaTech - Europe's biggest startup and tech event - NVIDIA founder and CEO Jensen Huang yesterday delivered a keynote address on the latest breakthroughs in cloud AI infrastructure, agentic AI and physical AI. Watch a replay.

GTC Paris runs through Thursday, June 12, with hands-on demos and sessions led by industry leaders. Whether attending in person or joining online, there's still plenty to explore at the event.

View at TechPowerUp Main Site | Source
 
So now we can create fake things and frame people for crimes they didn't commit with a higher degree of precision.. Yay.. :twitch::wtf:

AI needs to be banned until very strong regulations are created for it.
 
Is that really worthy of being a news article here? Voicing many complaints that others had before in other topics that it seems like any post on nvidia's blog/news site ends up here without much filtering.

- TensorRT is old af, nothing new in there
- Stability AI's optimizations for certain hardware are not really much news, here's an example from almost 2 years ago.
- That specific ONNX model that's being announced has been available for almost a month now.

Some extra fact checking before copy-pasting whatever is on Nvidia's site would be good to filter noise.
 
Is that really worthy of being a news article here? Voicing many complaints that others had before in other topics that it seems like any post on nvidia's blog/news site ends up here without much filtering.

- TensorRT is old af, nothing new in there
- Stability AI's optimizations for certain hardware are not really much news, here's an example from almost 2 years ago.
- That specific ONNX model that's being announced has been available for almost a month now.

Some extra fact checking before copy-pasting whatever is on Nvidia's site would be good to filter noise.
Yes.

You may not be playing around with Stable Diffusion, but I am.

Although in fairness, it would have been nice if they specified which 5 RTX 5000 series can actually use SD 3.5.
 
You may not be playing around with Stable Diffusion, but I am.
I also run it, I don't see your point.
If you disagree with what I said, what exactly is news for you in there that only became a thing as of today?
Although in fairness, it would have been nice if they specified which 5 RTX 5000 series can actually use SD 3.5.
Any as long as you don't mind offloading some of the models into CPU/RAM. 5090 if you want to run in GPU-only in FP16, and for FP8 you'd need at least 12GB of vram, that's for the large version.
Same applies to 4000 series.

I did not look into the medium models.

This is based on Stable Diffusion...
Not really, SD 3.5 is actually based on Flux. Flux came with the idea of using a transformer for the diffusion steps instead of a control net like previous SD versions used to do, then SD 3.5 followed suit and did the same.
 
I also run it, I don't see your point.
If you disagree with what I said, what exactly is news for you in there that only became a thing as of today?

Any as long as you don't mind offloading some of the models into CPU/RAM. 5090 if you want to run in GPU-only in FP16, and for FP8 you'd need at least 12GB of vram, that's for the large version.
Same applies to 4000 series.

I did not look into the medium models.


Not really, SD 3.5 is actually based on Flux. Flux came with the idea of using a transformer for the diffusion steps instead of a control net like previous SD versions used to do, then SD 3.5 followed suit and did the same.
SD has been evolving toward transformer-based approaches for years, even before Flux introduced its diffusion method. The use of transformers for generative models is a broader trend, not exclusive to Flux.

Flux uses a unique hybrid approach, but SD3.5 integrates transformers in a different way, focusing on efficiency and style adaptability within the established SD framework.

Furthermore, flux itself is based on Stable Diffusion, meaning any architectural similarities between SD3.5 and Flux could stem from their shared SD lineage rather than one being directly based on the other.
 
Another solution would be to just increase the god damn ammount of vram on their gpu's, but no of course we can't have that else people would be able run larger non sponsored models on their own without depending on subscriptions.
 
Back
Top