Friday, September 25th 2020

RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

Igor's Lab has posted an interesting investigative article where he advances a possible reason for the recent crash to desktop problems for RTX 3080 owners. For one, Igor mentions how the launch timings were much tighter than usual, with NVIDIA AIB partners having much less time than would be adequate to prepare and thoroughly test their designs. One of the reasons this apparently happened was that NVIDIA released the compatible driver stack much later than usual for AIB partners; this meant that their actual testing and QA for produced RTX 3080 graphics cards was mostly limited to power on and voltage stability testing, other than actual gaming/graphics workload testing, which might have allowed for some less-than-stellar chip samples to be employed on some of the companies' OC products (which, with higher operating frequencies and consequent broadband frequency mixtures, hit the apparent 2 GHz frequency wall that produces the crash to desktop).

Another reason for this, according to Igor, is the actual "reference board" PG132 design, which is used as a reference, "Base Design" for partners to architecture their custom cards around. The thing here is that apparently NVIDIA's BOM left open choices in terms of power cleanup and regulation in the mounted capacitors. The Base Design features six mandatory capacitors for filtering high frequencies on the voltage rails (NVVDD and MSVDD). There are a number of choices for capacitors to be installed here, with varying levels of capability. POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are generally worse than SP-CAPs (Conductive Polymer-Aluminium-Electrolytic-Capacitors) which are superseded in quality by MLCCs (Multilayer Ceramic Chip Capacitor, which have to be deployed in groups). Below is the circuitry arrangement employed below the BGA array where NVIDIA's GA-102 chip is seated, which corresponds to the central area on the back of the PCB.
In the images below, you can see how NVIDIA and it's AIBs designed this regulator circuitry (NVIDIA Founders' Edition, MSI Gaming X, ZOTAC Trinity, and ASUS TUF Gaming OC in order, from our reviews' high resolution teardowns). NVIDIA in their Founders' Edition designs uses a hybrid capacitor deployment, with four SP-CAPs and two MLCC groups of 10 individual capacitors each in the center. MSI uses a single MLCC group in the central arrangement, with five SP-CAPs guaranteeing the rest of the cleanup duties. ZOTAC went the cheapest way (which may be one of the reasons their cards are also among the cheapest), with a six POSCAP design (which are worse than MLCCs, remember). ASUS, however, designed their TUF with six MLCC arrangements - there were no savings done in this power circuitry area.

It's likely that the crash to desktop problems are related to both these issues - and this would also justify why some cards cease crashing when underclocked by 50-100 MHz, since at lower frequencies (and this will generally lead boost frequencies to stay below the 2 GHz mark) there is lesser broadband frequency mixture happening, which means POSCAP solutions can do their job - even if just barely.
Source: Igor's Lab
Add your own comment

297 Comments on RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

#151
fynxer
windwhirl
I don't get why AIBs went cheap for this board. I mean, it's the second highest-tier GPU! You should never go cheap in that kind of product!
It is in general terms called GREED. Imagine that they save only few cents on a $699 card and still go cheap.

Greed is the one of the most important things we must get rid of in human society if we are to avoid extinction.

TO ALL 3080 OWNERS, DO NOT ACCEPT BIOS UPDATES THAT REDUCES YOUR PERFORMANCE, RETURN YOUR CARD NOW WHILE YOU CAN !!!!!
Posted on Reply
#152
BoboOOZ
lexluthermiester
His are better than yours it would seem...
And there I was thinking that you ran out of dumb arguments.... Ignored.
Posted on Reply
#153
Vya Domus
The quality people seem to find acceptable has fallen to an all time low, that's the root cause of all this. Had people said no these 800$ POS or however much they cost I can guarantee you Nvidia and AIBs would magically find the just right designs and components to use such that this didn't occur because they'd need to impress people.

But when all they see is "unprecedented demand" all that takes a back seat and next time around you'll get an even shittier and more expensive product.
Posted on Reply
#154
TechLurker
I'd like to point out that the only guaranteed performance one is paying for is whatever is written on the box, so one can't really take NVIDIA to court if they wanted to just because a firmware update reduced the max auto-OC limits of early batch GPUs (and OC'ing, whether manual or automatic, is considered operating out of spec). The GPUs are still usable at spec. Only that even under ideal conditions, they will not boost as far as later GPU revisions will due to whatever issues are found and corrected with the next major batch.

That said, it doesn't make for good reputation and hurts the brand when some users lose a bit of performance from a firmware update, and will hurt even further given it'll take some time before revised cards are made and released. Then there is the fact that some companies may or may not allow for RMA/exchanges for revised GPUs (although most probably will bend and allow GPU exchanges, if only for publicity purposes). At the very least, the AIBs might be able to re-market the early batch GPUs either as cheaper SI cards for prebuilts for say, Target, Walmart, Best Buy, Dell (non-gaming-oriented lines), foreign equivalents, and so forth, where they're great as stock GPUs, since they do work at stock and just have a lower max OC ceiling.

Personally, I suspect that any major revisions will not make it to mainstream until November/December, or even not until after the new year, since all the early batches (including the "more incoming stock" being promised) were produced according to the original, finalized setup that's currently in the wild, and it'll take a bit of time to sort out the issues and get that change passed in the manufacturing process. And we're still dealing with a human virus that's still hampering the world's recovery.
Posted on Reply
#155
Vya Domus
TechLurker
I'd like to point out that the only guaranteed performance one is paying for is whatever is written on the box, so one can't really take NVIDIA to court if they wanted to just because a firmware update reduced the max auto-OC limits of early batch GPUs (and OC'ing, whether manual or automatic, is considered operating out of spec). The GPUs are still usable at spec. Only that even under ideal conditions, they will not boost as far as later GPU revisions will due to whatever issues are found and corrected with the next major batch.
That's not how it works, a manufacturer can't just say "clock speed higher than zero" or something like that and get away with every value possible afterward. This BS wouldn't last a second in court.

You were sold a product that had a specific performance characteristic which was diminished after a software update because the item was defective, that alone would be enough to win the case. You don't even need to worry about what's written on the box.

And may I remind you of this : www.theverge.com/2020/3/2/21161271/apple-settlement-500-million-throttling-batterygate-class-action-lawsuit
Posted on Reply
#156
asdkj1740
lexluthermiester
Who's? What response are we talking about?
jayztwocent posted a follow up video on his twitter after his first video on youtube.
Posted on Reply
#157
jayseearr
I did not expect the launch to be good or smooth but seemingly this one has been exceptionally terrible...

Early adopters beware...For those of you who did not figure this out a long time ago.
Posted on Reply
#158
lexluthermiester
asdkj1740
jayztwocent posted a follow up video on his twitter after his first video on youtube.
Please post the link.
Posted on Reply
#159
Khonjel
jayseearr
I did not expect the launch to be good or smooth but seemingly this one has been exceptionally terrible...

Early adopters beware...For those of you who did not figure this out a long time ago.
TBH I'm just waiting for when AMD fucks up their launch somehow and people start shitting on them.
lexluthermiester
Please post the link.
[MEDIA=twitter]1309617232201175040[/MEDIA]
Posted on Reply
#160
jayseearr
Khonjel
TBH I'm just waiting for when AMD fucks up their launch somehow and people start shitting on them.
That wouldn't surprise me at all...I'm hopeful they can do better than this but I certainly wouldn't wager any money on it :D
Posted on Reply
#161
TechLurker
Vya Domus
That's not how it works, a manufacturer can't just say "clock speed higher than zero" or something like that and get away with every value possible afterward. This BS wouldn't last a second in court.

You were sold a product that had a specific performance characteristic which was diminished after a software update because the item was defective, that alone would be enough to win the case. You don't even need to worry about what's written on the box.

And may I remind you of this : www.theverge.com/2020/3/2/21161271/apple-settlement-500-million-throttling-batterygate-class-action-lawsuit
It's a good thing then that on their official pages, the clocks are only officially advertised to a specific range, and that is officially what has been paid for. Take ASUS' Strix 3080 OC for example:
Engine Clock
OC Mode - 1740 MHz (Boost Clock)

Gaming Mode (Default) - GPU Boost Clock : 1710 MHz , GPU Base Clock : 1440 MHz
No where did they promise more than that on the product page. And since most of the crashes seem to be happening above 1800 MHz; closer to the 2 GHz limit, that's already operating "out of spec", and thus they could technically get away with a firmware tweak that hard limits things to say, 1900 MHz since they're only preventing cards from operating too far out of spec. In this instance, they are NOT throttling the performance promised, which is only up to 1740. Your example would have more relevance had the GPUs get a firmware that locks them below the originally advertised clocks, such as a hard limit to 1730 MHz in OC mode.
Posted on Reply
#162
kiriakost
dragontamer5788
While @TiN is being a bit aggressive with his words, I ultimately believe he's making the correct point.

From a PDN perspective, the only thing that matters is the frequencies at which power is drawn. It doesn't matter how NVidia's pipelines or local memory or whatever work. What matters is that they draw power at 2.1GHz increments, generating a 2.1GHz "ring frequency" across the power network... at roughly 100+ Amps.

Which will be roughly:

* 2.1GHz (Clockspeed of GPU)
* 5.25 GHz (rough clockspeed of the GDDR6x)
* 75Hz (The 75-Hz "pulse" every time a 75Hz monitor refreshes: the GPU will suddenly become very active, then stop drawing power waiting for VSync).
* Whatever the GHz is for PCIe communications
* Etc. etc. (Anything else that varies power across time)

Satisfying the needs of both a 5GHz and 75Hz simultaneously (and everything else) is what makes this so difficult. On the one hand, MLCC is traditionally considered great for high-frequency signals (like 5GHz). But guess what's considered best for low-frequency (75Hz) signals? You betcha: huge, high ESR Aluminum caps (and big 470 uF POSCAPs or 220uF caps would similarly better tackle lower-frequency problems).

----------

COULD the issue be the 2.1GHz signal? Do we KNOW FOR SURE that the issue is that the PDN runs out of power inside of a nanosecond?

Or is the card running out of power on VSyncs (~75Hz)? If the card is running out of power at that point, maybe more POSCAPs is better.

I wouldn't be convinced MLCC is "better" until someone posts a sub-nanosecond transient on a $50,000 10GHz oscilloscope showing the problem. Heck, I've seen no actual evidence posted in this entire discussion that even suggests the PDN is the issue yet. (Yeah, EVGA says they had issues when making the card. But other companies clearly have a different power-network design and EVGA's issues may not match theirs).
This is an interesting post and a pack of thoughts!
Me and TIN we have some sort of relation as we are both engaging with electrical test and measurement sector for a decade at least.
I am an maintenance electrician involving mostly with industrial electronics and power supply.
We do differentiate by allot as he has specialization and further understanding of electronic circuits how to.
But he should be the one so to inform this forum that GPU circuit analysis this requiring an 100.000 Euro 10GHz or better oscilloscope and special probes them worth 4000 Euro or more its one.
In simple English ... we are both missing the required very damn expensive tools which they are required for in-depth analysis of what is happening.
Therefore it is wise that all of us, to wait for the findings of the well paid engineers them working at the big brands.
Posted on Reply
#163
Julhes
Hi all,

Recently there has been some discussion about the EVGA GeForce RTX 3080 series.

During our mass production QC testing we discovered a full 6 POSCAPs solution cannot pass the real world applications testing. It took almost a week of R&D effort to find the cause and reduce the POSCAPs to 4 and add 20 MLCC caps prior to shipping production boards, this is why the EVGA GeForce RTX 3080 FTW3 series was delayed at launch. There were no 6 POSCAP production EVGA GeForce RTX 3080 FTW3 boards shipped.

But, due to the time crunch, some of the reviewers were sent a pre-production version with 6 POSCAP’s, we are working with those reviewers directly to replace their boards with production versions.
EVGA GeForce RTX 3080 XC3 series with 5 POSCAPs + 10 MLCC solution is matched with the XC3 spec without issues.


Also note that we have updated the product pictures at EVGA.com to reflect the production components that shipped to gamers and enthusiasts since day 1 of product launch.
Once you receive the card you can compare for yourself, EVGA stands behind its products!


Thanks
EVGA
Posted on Reply
#164
lexluthermiester
Khonjel
TBH I'm just waiting for when AMD fucks up their launch somehow and people start shitting on them.


[MEDIA=twitter]1309617232201175040[/MEDIA]
Thank You.

Desided to reactivate my old Twitter account... Surprised it was still there.
Posted on Reply
#165
OneMoar
There is Always Moar
POSCAP is just a Product from Panasonic IT IS NOT A PARTICULAR TYPE OF CAPACITOR
You need to have two types of Capacitors in the circuit because MLCC's as a General Rule do not have Enough Capacitance and Other types of polymer Capacitors don't have the same filtering/frequency capability

think of it this way the larger poly type caps act as a reservoir and the MLCC handles the heavy lifting of dealing with the extreme changes in demand coming from the gpu core

its entirely possible to build a circuit with nothing but MLCC, But that requires a lot of PCB Space, its expensive, and in the end probly overkill

you can also do it with other types of Caps so long as you keep the Capacitance and frequency requirements in mind

youtubers that don't have electrical degrees should not talk about shit that requires said degree :roll:

what the article should read is this

AIBS screw up PCB design
Posted on Reply
#166
tigger
I'm the only one
OneMoar
POSCAP is just a brand name from Panasonic IT IS NOT A PARTICULAR TYPE OF CAPACITORS
Don't think many actually watched this ^ video. still think Poscap is a type not brand
Posted on Reply
#167
$ReaPeR$
All this arguing should be funny but it really isn't. At the end of the day this is a defective product that is ridiculously expensive. And I do not understand how some people have the nerve to defend nvidia when we all know the level of control they have over their partners. If you want better quality, vote with your wallets and don't buy this. But you won't because at the end of the day rationality is not something that people use. This behavior is literally the reason we can't have nice things.
Posted on Reply
#168
BoboOOZ
tigger
Don't think many actually watched this ^ video. still think Poscap is a type not brand
Just do a Google search, POSCAP are Panasonic only.
Posted on Reply
#169
jayseearr
OneMoar
youtubers that don't have electrical degrees should not talk about shit that requires said degree :roll:
I get your point, but I would still rather they talk about it..more awareness is a good thing and stuff like this can often accelerate a fix.
OneMoar
what the article should read is this

AIBS screw up pcb design, issue revisions
Didn't the same video that you posted explain within the first couple minutes why/how it's just as much of a reflection on nvidia as the AIBs if not more so?

"The thing is, as far as I'm concerned this is like nvidia screwing up the design guidelines, I wouldn't really throw this on the board partners because as far as I'm aware you can't even ship an nvidia gpu without running it through nvidia's green light program. So if nvidia doesn't approve you pcb design, you can't sell it."
Posted on Reply
#170
mtcn77
$ReaPeR$
All this arguing should be funny but it really isn't. At the end of the day this is a defective product that is ridiculously expensive. And I do not understand how some people have the nerve to defend nvidia when we all know the level of control they have over their partners.
Well, it is halo product plus planned obsolescence all rolled into one. Two birds with one stone. And people don't appreciate enough already.
Posted on Reply
#171
Vya Domus
TechLurker
It's a good thing then that on their official pages, the clocks are only officially advertised to a specific range, and that is officially what has been paid for. Take ASUS' Strix 3080 OC for example:



No where did they promise more than that on the product page. And since most of the crashes seem to be happening above 1800 MHz; closer to the 2 GHz limit, that's already operating "out of spec", and thus they could technically get away with a firmware tweak that hard limits things to say, 1900 MHz since they're only preventing cards from operating too far out of spec. In this instance, they are NOT throttling the performance promised, which is only up to 1740. Your example would have more relevance had the GPUs get a firmware that locks them below the originally advertised clocks, such as a hard limit to 1730 MHz in OC mode.
A vendor must provide specifications that match how the real product sold behaves no matter what. The only real "spec" is written in the BIOS of the card which allows the card to operate within certain parameters, i.e close to 2000Mhz, if it can't do that then it's defective and it can't match it's specification. It's as simple as that. There is no auto-OC, it's all default stock settings that Nvidia and AIBs came up with.

Again, they are simply pushing the limits of advertising, in a real legal matter those figures would never hold up just like it didn't in the Apple case. If the world operated how you think it does, then you could for example sometimes get water from the gas station in your car's tank because the oil company claimed "Up to 95% petrol content" and so that would be fine and no one could ever take them to court. That'd obliviously be completely absurd, companies can't just claim what they want and sell crap with no repercussions.
Posted on Reply
#172
semantics
Clearly this was done to snipe all the ebay scalpers.
Posted on Reply
#173
mtcn77
semantics
Clearly this was done to snipe all the ebay scalpers.
Right? It is a feature.
Posted on Reply
#174
Vya Domus
semantics
Clearly this was done to snipe all the ebay scalpers.
I hope that's sarcasm. :kookoo:
Posted on Reply
#175
mtcn77
Vya Domus
I hope that's sarcasm. :kookoo:
Don't u want competition, progress? How very not liberal minded of you...
This creates a new echelon of enthusiast - the ones who are privileged enough to be able to run the cards.
Posted on Reply
Add your own comment