Friday, September 25th 2020

RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

Igor's Lab has posted an interesting investigative article where he advances a possible reason for the recent crash to desktop problems for RTX 3080 owners. For one, Igor mentions how the launch timings were much tighter than usual, with NVIDIA AIB partners having much less time than would be adequate to prepare and thoroughly test their designs. One of the reasons this apparently happened was that NVIDIA released the compatible driver stack much later than usual for AIB partners; this meant that their actual testing and QA for produced RTX 3080 graphics cards was mostly limited to power on and voltage stability testing, other than actual gaming/graphics workload testing, which might have allowed for some less-than-stellar chip samples to be employed on some of the companies' OC products (which, with higher operating frequencies and consequent broadband frequency mixtures, hit the apparent 2 GHz frequency wall that produces the crash to desktop).

Another reason for this, according to Igor, is the actual "reference board" PG132 design, which is used as a reference, "Base Design" for partners to architecture their custom cards around. The thing here is that apparently NVIDIA's BOM left open choices in terms of power cleanup and regulation in the mounted capacitors. The Base Design features six mandatory capacitors for filtering high frequencies on the voltage rails (NVVDD and MSVDD). There are a number of choices for capacitors to be installed here, with varying levels of capability. POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are generally worse than SP-CAPs (Conductive Polymer-Aluminium-Electrolytic-Capacitors) which are superseded in quality by MLCCs (Multilayer Ceramic Chip Capacitor, which have to be deployed in groups). Below is the circuitry arrangement employed below the BGA array where NVIDIA's GA-102 chip is seated, which corresponds to the central area on the back of the PCB.
In the images below, you can see how NVIDIA and it's AIBs designed this regulator circuitry (NVIDIA Founders' Edition, MSI Gaming X, ZOTAC Trinity, and ASUS TUF Gaming OC in order, from our reviews' high resolution teardowns). NVIDIA in their Founders' Edition designs uses a hybrid capacitor deployment, with four SP-CAPs and two MLCC groups of 10 individual capacitors each in the center. MSI uses a single MLCC group in the central arrangement, with five SP-CAPs guaranteeing the rest of the cleanup duties. ZOTAC went the cheapest way (which may be one of the reasons their cards are also among the cheapest), with a six POSCAP design (which are worse than MLCCs, remember). ASUS, however, designed their TUF with six MLCC arrangements - there were no savings done in this power circuitry area.

It's likely that the crash to desktop problems are related to both these issues - and this would also justify why some cards cease crashing when underclocked by 50-100 MHz, since at lower frequencies (and this will generally lead boost frequencies to stay below the 2 GHz mark) there is lesser broadband frequency mixture happening, which means POSCAP solutions can do their job - even if just barely.
Source: Igor's Lab
Add your own comment

297 Comments on RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

#276
mtcn77
This still has the odd possibility of being related to Samsung since Nvidia has been following best practices up until now at TSMC. You cannot establish ground rules let alone known good designs at zero hour.
Posted on Reply
#277
BoboOOZ
mtcn77
This still has the odd possibility of being related to Samsung since Nvidia has been following best practices up until now at TSMC. You cannot establish ground rules let alone known good designs at zero hour.
If you mean history, it's not completely true, there was one node in the past where Nvidia didn't follow TSMC spec, the results were sub-mediocre and Nvidia blamed it on TSMC. I can't remember which one, but I'm sure the information is easy to find.
Posted on Reply
#278
dragontamer5788
TiN
After all these wild theories are easy to test, no need any engineering education to prove this wrong or right. Take "bad" crashing card with "bad POSCAPs", test it to confirm crashes... Then desolder "bad POSCAPs", put bunch of 47uF low-ESR MLCCs instead, and test again if its "fixed". Something tells me that it would not be such a simple case and card may still crash, heh. ;-)
This has now been tested:

Gigabyte's board starts with 6x POSCAPs / SP-CAPs... or whatever you wanna call the 470 uF big ones.

der8auer removed 2x 470uF, then replaced them with 20x 47uF MLCCs, achieving a +30MHz clock (0.03 GHz). So yes, it has an effect, but its quite minor.

I think its safe to say that this entire "capacitor" issue has been grossly overblown, based on the practical test from der8auer. The stock 6x 470uF caps were still able to hold a +70MHz overclock and was stable initially. But reaching +100MHz (+30MHz higher than before) with 20x MLCCs does show that there's some degree of benefit to the MLCCs, but nothing major.

I admit that der8auer did a 3090 test instead of the 3080, but I doubt that makes a major difference. The question is what's the effect of "6x Big Caps" vs "60x Small Caps", and that's what the video tests.
Posted on Reply
#279
EarthDog
dragontamer5788
This has now been tested
I've seen this vid posted like 3 times at this site. How did it not go here (first)?! :p
Posted on Reply
#280
nguyen
dragontamer5788
Gigabyte's board starts with 6x POSCAPs / SP-CAPs... or whatever you wanna call the 470 uF big ones.

der8auer removed 2x 470uF, then replaced them with 20x 47uF MLCCs, achieving a +30MHz clock (0.03 GHz). So yes, it has an effect, but its quite minor.

I think its safe to say that this entire "capacitor" issue has been grossly overblown, based on the practical test from der8auer. The stock 6x 470uF caps were still able to hold a +70MHz overclock and was stable initially. But reaching +100MHz (+30MHz higher than before) with 20x MLCCs does show that there's some degree of benefit to the MLCCs, but nothing major.

I admit that der8auer did a 3090 test instead of the 3080, but I doubt that makes a major difference. The question is what's the effect of "6x Big Caps" vs "60x Small Caps", and that's what the video tests.
The biggest offender here is probably Zotac Trinity with 6x 330uF SP-CAP, plenty of news outlets also mention that the Zotac 3080 is the least stable out of the bunch before the new driver update.

Well all this capacitors issue could also be alleviated when the die's power requirement does not change so rapidly, so I guess Nvidia introduced some clocks ramping hysteresis into the driver to improve stability. That doesn't mean Ampere will run at lower clocks like people would have thought though, just that the clocks would react slower, allowing some undershoot of the power target.
Posted on Reply
#281
redeye
asus rtx3080 TUF, for the win...
Posted on Reply
#282
asdkj1740
steve on his latest video reports that evga informed him about the 6 poscaps cards used to be mesed up with the first release driver are now running fine with the latest driver.
Posted on Reply
#284
Mr Ethernet
I'm building a new gaming rig. Bought an i9-10900K and was originally planning to pair it with an RTX 3080. My monitor is 2560x1440 (144 Hz), not 4K, so I'm guessing an RTX 3090 would be overkill for me - hence the 3080.

Do you guys recommend I wait a few months before buying? Sounds like I should wait for these reported crashing issues to be ironed out first. I want to buy my GPU as quickly as possible but I also don't want to be a beta tester for something with known issues.
Posted on Reply
#285
mtcn77
nguyen
allowing some undershoot of the power target.
It is about transient response after all, isn't it?
Posted on Reply
#286
lexluthermiester
Mr Ethernet
so I'm guessing an RTX 3090 would be overkill for me - hence the 3080.
If you have the money, a 3090 would future-proof you for a couple years.
Mr Ethernet
Sounds like I should wait for these reported crashing issues to be ironed out first.
The latest driver update seems to be fixing most of the crashing problems. You should be fine. Waiting a month will not hurt you.

And welcome to TPU!
Posted on Reply
#287
Caring1
lexluthermiester
The latest driver update seems to be fixing most of the crashing problems.
I'd like to see comparison tests between drivers first before stating they appear to have fixed the issues.
Gimped performance is more likely.
Posted on Reply
#288
theoneandonlymrk
Caring1
I'd like to see comparison tests between drivers first before stating they appear to have fixed the issues.
Gimped performance is more likely.
They are not likely to say they might have fixed them really though,, , seams fair to say windows driver's were the issue though since no one running Linux had C2D issues.
So after waying up the mediocre test provisioning Nvidia allowed AIB's with any driver, the rush then to get them out and the apparent ease with which Nvidia seam to have fixed the issues with a driver update, it's clear Nvidia are to blame.
No dramas just many an AIB GPU engineer can seek treatment from bus injuries now :):D.
Posted on Reply
#289
lexluthermiester
theoneandonlymrk
seams fair to say windows driver's were the issue though since no one running Linux had C2D issues.
Did Linux have Day1 support for Ampere? Haven't paid attention...
Posted on Reply
#290
Xzibit
Caring1
I'd like to see comparison tests between drivers first before stating they appear to have fixed the issues.
Gimped performance is more likely.
He does a quick comparison @ 10:30

Posted on Reply
#291
Caring1
Xzibit
He does a quick comparison @ 10:30
Good to see no change in performance.
Posted on Reply
#292
Mr Ethernet
lexluthermiester
If you have the money, a 3090 would future-proof you for a couple years.

The latest driver update seems to be fixing most of the crashing problems. You should be fine. Waiting a month will not hurt you.
Thanks. I think you're right. I'll fork out a bit extra for the 3090 in about a month. I'm waiting a bit anyway to save up a bit more cash - and waiting will also give stores time to get more 3090s in stock (all out of stock near me right now). Plus later batches potentially being improved is an added bonus!

In the meantime, I need to figure out what custom loop I'm going to go with. Never put together one of those before but I think my Lian Li O11 Dynamic is going to struggle to keep temperatures under control if I air cool it.
lexluthermiester
And welcome to TPU!
Thanks! Happy to be here!
Posted on Reply
#293
EarthDog
This thread title didn't age well.... :p
Posted on Reply
#294
lexluthermiester
Xzibit
He does a quick comparison @ 10:30


Yeah a lot of people jumped on the Cap bandwagon likely because they didn't know any better, but the big outlets jumped on because they thought they were looking at a credible story.
EarthDog
This thread title didn't age well.... :p
True
Posted on Reply
#295
dragontamer5788
EarthDog
This thread title didn't age well.... :p
Its good enough. "Likely" means its still a theory.
Igor's Lab has posted an interesting investigative article where he advances a possible reason for the recent crash to desktop problems for RTX 3080 owners
I think that's fine. I know I've been pushing the opposite throughout this thread, but that's mostly because the internet ran away with the idea and started over-hyping the issue. This article, Igor's article, and the title, all make it clear that its a theory, "likely", or "possible reason". Where things got silly was some other youtubers, or Reddit, where people started discussing the issue with certainty.
Posted on Reply
#296
Vayra86
Caps or not, the fix was well predicted I think. Small tweak to GPU boost, some voltage which in turn reduces peak clock automagically.

Good to see they kept the losses at an apparent minimum.
Posted on Reply
#297
MelonGx
MelonGx
[MEDIA=twitter]1309840810880282625[/MEDIA]
For those people who insisted TUF won't crash, I post an evidence video of my TUF crashed.
Seemed that I self-resolved my TUF 3080's CTD.
I downgraded my RAM from DDR4-4266 to DDR4-4000 19-25-25-45.
Then CTD disappeared even I OCed it to Pwr +117% and Core +55.
Posted on Reply
Add your own comment