Friday, September 25th 2020

RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

Igor's Lab has posted an interesting investigative article where he advances a possible reason for the recent crash to desktop problems for RTX 3080 owners. For one, Igor mentions how the launch timings were much tighter than usual, with NVIDIA AIB partners having much less time than would be adequate to prepare and thoroughly test their designs. One of the reasons this apparently happened was that NVIDIA released the compatible driver stack much later than usual for AIB partners; this meant that their actual testing and QA for produced RTX 3080 graphics cards was mostly limited to power on and voltage stability testing, other than actual gaming/graphics workload testing, which might have allowed for some less-than-stellar chip samples to be employed on some of the companies' OC products (which, with higher operating frequencies and consequent broadband frequency mixtures, hit the apparent 2 GHz frequency wall that produces the crash to desktop).

Another reason for this, according to Igor, is the actual "reference board" PG132 design, which is used as a reference, "Base Design" for partners to architecture their custom cards around. The thing here is that apparently NVIDIA's BOM left open choices in terms of power cleanup and regulation in the mounted capacitors. The Base Design features six mandatory capacitors for filtering high frequencies on the voltage rails (NVVDD and MSVDD). There are a number of choices for capacitors to be installed here, with varying levels of capability. POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are generally worse than SP-CAPs (Conductive Polymer-Aluminium-Electrolytic-Capacitors) which are superseded in quality by MLCCs (Multilayer Ceramic Chip Capacitor, which have to be deployed in groups). Below is the circuitry arrangement employed below the BGA array where NVIDIA's GA-102 chip is seated, which corresponds to the central area on the back of the PCB.
In the images below, you can see how NVIDIA and it's AIBs designed this regulator circuitry (NVIDIA Founders' Edition, MSI Gaming X, ZOTAC Trinity, and ASUS TUF Gaming OC in order, from our reviews' high resolution teardowns). NVIDIA in their Founders' Edition designs uses a hybrid capacitor deployment, with four SP-CAPs and two MLCC groups of 10 individual capacitors each in the center. MSI uses a single MLCC group in the central arrangement, with five SP-CAPs guaranteeing the rest of the cleanup duties. ZOTAC went the cheapest way (which may be one of the reasons their cards are also among the cheapest), with a six POSCAP design (which are worse than MLCCs, remember). ASUS, however, designed their TUF with six MLCC arrangements - there were no savings done in this power circuitry area.

It's likely that the crash to desktop problems are related to both these issues - and this would also justify why some cards cease crashing when underclocked by 50-100 MHz, since at lower frequencies (and this will generally lead boost frequencies to stay below the 2 GHz mark) there is lesser broadband frequency mixture happening, which means POSCAP solutions can do their job - even if just barely.
Source: Igor's Lab
Add your own comment

297 Comments on RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

#126
blobster21
Come on, we need more insightful comments here ! (and i'm bored to death anyway, so keep them coming please :p )
Posted on Reply
#127
lexluthermiester
BoboOOZ
Nvidia never admits being wrong and always blames the partners (TSMC, Apple, etc.), so here they will say that the fault is with the AIB and the fix will be based on downclocking...
It's not NVidia's fault. The AIB's are solely to blame for not following the recommendations and not doing proper testing. The reality is, people will need to do a little bit of downclocking to keep those card stable. It's not the end of the world and likely will not even affect over-all card performance to a noticeable degree.
Posted on Reply
#128
Chomiq
lexluthermiester
It's not NVidia's fault. The AIB's are solely to blame for not following the recommendations and not doing proper testing.
Nvidia has to approve each partner board design. Also, aib partners didn't even get the drivers until review samples were shipped out.
Posted on Reply
#129
lexluthermiester
Chomiq
Nvidia has to approve each partner board design.
The design, yes. That doesn't mean it was tested buy NVidia. That is the responsibility of the AIB's.
Chomiq
Also, aib partners didn't even get the drivers until review samples were shipped out.
And that is still not NVidia's fault. The problem would not exist if the AIB's had followed the recommendations stated by NVidia. That is what recommendations are for.
Posted on Reply
#130
zlobby
roccale
It's beautiful :)
Indeed so, most indeedely!
Posted on Reply
#131
BoboOOZ
lexluthermiester
It's not NVidia's fault. The AIB's are solely to blame for not following the recommendations and not doing proper testing. The reality is, people will need to do a little bit of downclocking to keep those card stable. It's not the end of the world and likely will not even affect over-all card performance to a noticeable degree.
We don't know yet what's happening exactly, but you are already sure Nvidia has no responsibility in this? That's very unbiased of you.
Posted on Reply
#133
EarthDog
asdkj1740
subbed
just an FYI, there is a "watch" button at the top of the page just for subscribing. :)
BoboOOZ
We don't know yet what's happening exactly, but you are already sure Nvidia has no responsibility in this? That's very unbiased of you.
What is Nvidia's role in this?
Posted on Reply
#134
lexluthermiester
BoboOOZ
We don't know yet what's happening exactly, but you are already sure Nvidia has no responsibility in this?
So far, these problems are NOT happening with NVidia's own cards, nor the higher-tier cards from AIB's. It's just the lower tier offerings from AIB's. The responsibility rests with the AIBs. Please review;
BoboOOZ
That's very unbiased of you.
Bias has nothing to do with it. The info out there is showing the problem.
Posted on Reply
#135
theoneandonlymrk
lexluthermiester
So far, these problems are NOT happening with NVidia's own cards, nor the higher-tier card from AIB's. It's just the lower tier offerings from AIB's. The responsibility rests with the AIBs. Please review;

No company shouts more about their work with partners, Devs and AIB.
The reference spec design they passed AIB was different to their own reference card's.
And they compressed development and testing time to near zero.
And they allowed such design variation in their development reference kit instead of both knowing that it needed specific voltage conditioning and informing AIB partners or limiting those AIB designs.

It's not all on Nvidia but they share the blame.
Posted on Reply
#136
BoboOOZ
lexluthermiester
So far, these problems are NOT happening with NVidia's own cards, nor the higher-tier cards from AIB's. It's just the lower tier offerings from AIB's. The responsibility rests with the AIBs.


Bias has nothing to do with it. The info out there is showing the problem.
That's not true, and Jays2c is fun and all, but his technical abilities aren't awesome. he might be onto something, but apparently, FE crashes as well
[MEDIA=twitter]1309659834468298753[/MEDIA]
Most of the time, in this type of situation, the responsibility is shared, but the chances than Nvidia gave very clear and correct specifications and the AIB just blatantly disprespected them are close to 0.

Time will tell, but it looks like we were expecting another Pascal and we got another Fermi... They'll fix it soon, I imagine, if it's just a matter of dropping the frequency a tad should be easily fixable.
Posted on Reply
#137
Dave65
lexluthermiester
It's not NVidia's fault. The AIB's are solely to blame for not following the recommendations and not doing proper testing. The reality is, people will need to do a little bit of downclocking to keep those card stable. It's not the end of the world and likely will not even affect over-all card performance to a noticeable degree.
You GOT to be kidding , right?
This is exactly on Nvidia.:shadedshu:
Posted on Reply
#138
Rado D
Mirrormaster85
So, as an Electronics Engineer and PCB Designer I feel I have to react here.
The point that Igor makes about improper power design causing instability is a very plausible one. Especially with first production runs where it indeed could be the case that they did not have the time/equipment/driver etc to do proper design verification.


However, concluding from this that a POSCAP = bad and MLCC = good is waaay to harsh and a conclusion you cannot make.


Both POSCAPS (or any other 'solid polymer caps' and MLCC's have there own characteristics and use cases.


Some (not all) are ('+' = pos, '-' = neg):
MLCC:
+ cheap
+ small
+ high voltage rating in small package
+ high current rating
+ high temperature rating
+ high capacitance in small package
+ good at high frequencies
- prone to cracking
- prone to piezo effect
- bad temperature characteristics
- DC bias (capacitance changes a lot under different voltages)


POSCAP:
- more expensive
- bigger
- lower voltage rating
+ high current rating
+ high temperature rating
- less good at high frequencies
+ mechanically very strong (no MLCC cracking)
+ not prone to piezo effect
+ very stable over temperature
+ no DC bias (capacitance very stable at different voltages)


As you can see, both have there strengths and weaknesses and one is not particularly better or worse then the other. It all depends.
In this case, most of these 3080 and 3090 boards may use the same GPU (with its requirements) but they also have very different power circuits driving the chips on the cards.
Each power solution has its own characteristics and behavior and thus its own requirements in terms of capacitors used.
Thus, you cannot simply say: I want the card with only MLCC's because that is a good design.
It is far more likely they just could/would not have enough time and/or resources to properly verify their designs and thus where not able to do proper adjustments to their initial component choices.
This will very likely work itself out in time. For now, just buy the card that you like and if it fails, simply claim warranty. Let them fix the problem and down draw to many conclusions based on incomplete information and (educated) guess work.
Amen and thank you!
Dont think I have to look for more informative and unbiased opinion.
Posted on Reply
#139
lexluthermiester
theoneandonlymrk
It's not all on Nvidia but they share the blame.
There's likely some truth to that, but people are acting like it's ALL on NVidia which is a crock of poop.... Example you ask?...
Dave65
You GOT to be kidding , right?
This is exactly on Nvidia.:shadedshu:
There you go..
Posted on Reply
#140
MelonGx
[MEDIA=twitter]1309840810880282625[/MEDIA]
For those people who insisted TUF won't crash, I post an evidence video of my TUF crashed.
Posted on Reply
#141
lexluthermiester
BoboOOZ
That's not true, and Jays2c is fun and all, but his technical abilities aren't awesome.
His are better than yours it would seem...
Posted on Reply
#142
asdkj1740
lexluthermiester
His are better than yours it would seem...
you should check his latest response on his twitter.
Posted on Reply
#144
mtcn77
Rado D
Amen and thank you!
Dont think I have to look for more informative and unbiased opinion.
Agreed. People who show up at such a debate make it almost into a fortune to behold.
Posted on Reply
#145
BigBonedCartman
RTX 2000 series had faulty brand new card randomly dying, RTX 3000 series has AIB partners cheaping out on capacitors, AMD constantly has driver issues..... WTF is wrong with GPU manufacturing?
Posted on Reply
#146
lexluthermiester
asdkj1740
you should check his latest response on his twitter.
Who's? What response are we talking about?
BigBonedCartman
WTF is wrong with GPU manufacturing?
Nothing. They are making ever more complex and powerful cards to push the limits of performance in very tight time constraints. I'm not excusing these problems, only offering explanation. The industry needs to slow it down a little and focus on quality more.
Posted on Reply
#147
Khonjel
I don't get it tbh. POSCAP is supposedly more expensive than MLCC (per that reddit post). So supposedly overbuilt cards aren't performing as intended or something? But damn, people are gonna run towards ASUS now. Both Strix and cheaper TUF use all-MLCC design.
Posted on Reply
#148
lexluthermiester
Khonjel
I don't get it tbh. POSCAP is supposedly more expensive than MLCC (per that reddit post).
The Reddit post was wrong. The whole process of mounting the smaller components is a more expensive one. The components themselves are not all that expensive it's just getting them soldered on that presents the more involved process.
Posted on Reply
#149
blobster21
Khonjel
I don't get it tbh. POSCAP is supposedly more expensive than MLCC (per that reddit post). So supposedly overbuilt cards aren't performing as intended or something? But damn, people are gonna run towards ASUS now. Both Strix and cheaper TUF use all-MLCC design.
No, it's not THAT easy. If anything, those components will to the job within their respective operating range nicely. It's just that the gpu boost is too much to handle for them. 2 contributors wrote it already :
Mirrormaster85
concluding from this that a POSCAP = bad and MLCC = good is waaay to harsh and a conclusion you cannot make.


Both POSCAPS (or any other 'solid polymer caps' and MLCC's have there own characteristics and use cases.


Some (not all) are ('+' = pos, '-' = neg):
MLCC:
+ cheap
+ small
+ high voltage rating in small package
+ high current rating
+ high temperature rating
+ high capacitance in small package
+ good at high frequencies
- prone to cracking
- prone to piezo effect
- bad temperature characteristics
- DC bias (capacitance changes a lot under different voltages)


POSCAP:
- more expensive
- bigger
- lower voltage rating
+ high current rating
+ high temperature rating
- less good at high frequencies
+ mechanically very strong (no MLCC cracking)
+ not prone to piezo effect
+ very stable over temperature
+ no DC bias (capacitance very stable at different voltages)
TiN
Just replacing everything with MLCCs will NOT help the design to reach higher speeds and stability. Why? Because one need to use all different caps in tandem, as their frequency response is different, as well as ESR, ESL and other factors.

Having everything with MLCC like glorified asus does means you have single deep resonance notch, instead of two less prominent notches when use MLCC+POSCAP together. Using three kinds, smaller POSCAP, bigger POSCAP, and some MLCCs gives better figure with 3 notches..
But again, with modern DC-DC controllers lot of this can be tuned from PID control and converter slew rate tweaks. This adjustability is one of big reasons why enthusiast cards often use "digital" that allows tweaking almost on the fly for such parameters. However this is almost never exposed to user, as wrong settings can easily make power phases go brrrrrr with smokes. Don't ask me how I know...


Before looking onto poor 6 capacitors behind the die - why nobody talks about huge POSCAP capacitor bank behind VRM on FE card, eh? Custom AIB cards don't have that, just usual array without much of bulk capacitance. If I'd be designing a card, I'd look on a GPU's power demands and then add enough bulk capacitance first to make sure of good power impedance margin at mid-frequency ranges, while worrying about capacitors for high-frequency decoupling later, as that is relatively easier job to tweak.

After all these wild theories are easy to test, no need any engineering education to prove this wrong or right. Take "bad" crashing card with "bad POSCAPs", test it to confirm crashes... Then desolder "bad POSCAPs", put bunch of 47uF low-ESR MLCCs instead, and test again if its "fixed". Something tells me that it would not be such a simple case and card may still crash, heh. ;-)
Posted on Reply
#150
dragontamer5788
kiriakost
It is not in my priorities of me to discover NVIDIA's magic oompa-loompa inside chip, because I do not make money from VGA card repairs.
I am aware of your measuring gear, but your accident did stop your exploration at the discovery of what an 8846A can do as by far most modern design.
Anyway this is another story, and a boring one for the readers of this forum.
While @TiN is being a bit aggressive with his words, I ultimately believe he's making the correct point.

From a PDN perspective, the only thing that matters is the frequencies at which power is drawn. It doesn't matter how NVidia's pipelines or local memory or whatever work. What matters is that they draw power at 2.1GHz increments, generating a 2.1GHz "ring frequency" across the power network... at roughly 100+ Amps.

Which will be roughly:

* 2.1GHz (Clockspeed of GPU)
* 5.25 GHz (rough clockspeed of the GDDR6x)
* 75Hz (The 75-Hz "pulse" every time a 75Hz monitor refreshes: the GPU will suddenly become very active, then stop drawing power waiting for VSync).
* Whatever the GHz is for PCIe communications
* Etc. etc. (Anything else that varies power across time)

Satisfying the needs of both a 5GHz and 75Hz simultaneously (and everything else) is what makes this so difficult. On the one hand, MLCC is traditionally considered great for high-frequency signals (like 5GHz). But guess what's considered best for low-frequency (75Hz) signals? You betcha: huge, high ESR Aluminum caps (and big 470 uF POSCAPs or 220uF caps would similarly better tackle lower-frequency problems).

----------

COULD the issue be the 2.1GHz signal? Do we KNOW FOR SURE that the issue is that the PDN runs out of power inside of a nanosecond?

Or is the card running out of power on VSyncs (~75Hz)? If the card is running out of power at that point, maybe more POSCAPs is better.

I wouldn't be convinced MLCC is "better" until someone posts a sub-nanosecond transient on a $50,000 10GHz oscilloscope showing the problem. Heck, I've seen no actual evidence posted in this entire discussion that even suggests the PDN is the issue yet. (Yeah, EVGA says they had issues when making the card. But other companies clearly have a different power-network design and EVGA's issues may not match theirs).
Posted on Reply
Add your own comment