Friday, September 25th 2020

RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

Igor's Lab has posted an interesting investigative article where he advances a possible reason for the recent crash to desktop problems for RTX 3080 owners. For one, Igor mentions how the launch timings were much tighter than usual, with NVIDIA AIB partners having much less time than would be adequate to prepare and thoroughly test their designs. One of the reasons this apparently happened was that NVIDIA released the compatible driver stack much later than usual for AIB partners; this meant that their actual testing and QA for produced RTX 3080 graphics cards was mostly limited to power on and voltage stability testing, other than actual gaming/graphics workload testing, which might have allowed for some less-than-stellar chip samples to be employed on some of the companies' OC products (which, with higher operating frequencies and consequent broadband frequency mixtures, hit the apparent 2 GHz frequency wall that produces the crash to desktop).

Another reason for this, according to Igor, is the actual "reference board" PG132 design, which is used as a reference, "Base Design" for partners to architecture their custom cards around. The thing here is that apparently NVIDIA's BOM left open choices in terms of power cleanup and regulation in the mounted capacitors. The Base Design features six mandatory capacitors for filtering high frequencies on the voltage rails (NVVDD and MSVDD). There are a number of choices for capacitors to be installed here, with varying levels of capability. POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are generally worse than SP-CAPs (Conductive Polymer-Aluminium-Electrolytic-Capacitors) which are superseded in quality by MLCCs (Multilayer Ceramic Chip Capacitor, which have to be deployed in groups). Below is the circuitry arrangement employed below the BGA array where NVIDIA's GA-102 chip is seated, which corresponds to the central area on the back of the PCB.
In the images below, you can see how NVIDIA and it's AIBs designed this regulator circuitry (NVIDIA Founders' Edition, MSI Gaming X, ZOTAC Trinity, and ASUS TUF Gaming OC in order, from our reviews' high resolution teardowns). NVIDIA in their Founders' Edition designs uses a hybrid capacitor deployment, with four SP-CAPs and two MLCC groups of 10 individual capacitors each in the center. MSI uses a single MLCC group in the central arrangement, with five SP-CAPs guaranteeing the rest of the cleanup duties. ZOTAC went the cheapest way (which may be one of the reasons their cards are also among the cheapest), with a six POSCAP design (which are worse than MLCCs, remember). ASUS, however, designed their TUF with six MLCC arrangements - there were no savings done in this power circuitry area.

It's likely that the crash to desktop problems are related to both these issues - and this would also justify why some cards cease crashing when underclocked by 50-100 MHz, since at lower frequencies (and this will generally lead boost frequencies to stay below the 2 GHz mark) there is lesser broadband frequency mixture happening, which means POSCAP solutions can do their job - even if just barely.
Source: Igor's Lab
Add your own comment

297 Comments on RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

#201
OneMoar
There is Always Moar
lexluthermiester
THAT was awesome!! I love Dr Cox!

@OneMoar & @Rado D

I'm redirecting that video right back at you the two of you and I'm going to suggest that you go do some "moar" reading, paying careful attention to context. There are a few subtleties you both seem to be missing.
O looky here it seems Panasonic is not the only manufacture of
Polymer tantalum electrolytic capacitors

www.vishay.com/docs/40254/t50.pdf (vPolyTan )

No I don't care if the cards in question accually use Panasonics line of Polymer tantalum electrolytic capacitors
Nor do I care what POSCAP stands for Or what particular formulation Panasonic is using

unilaterally declaring all Polymer tantalum electrolytic capacitors as POSCAP is misinformation at best and at worst illegal

POSCAP is a registered trademark of Panasonic. And in accordance with trademark law NOBODY else can refer to their product as 'POSCAP'
./thread



the next f***ing smooth brain that mentions it again is going to get a angry pussy thrown in there face
Posted on Reply
#202
lexluthermiester
Once again, people getting lost in the details and not seeing the context...
Posted on Reply
#203
OneMoar
There is Always Moar
lexluthermiester
Once again, people getting lost in the details and not seeing the context...
/me dunks a cat in warm soapy bathwater and chucks it at @lexluthermiester
Posted on Reply
#204
lexluthermiester
OneMoar
the next f***ing smooth brain that mentions it again is going to get a angry pussy thrown in there face
OneMoar
/me dunks a cat in warm soapy bathwater and chucks it at @lexluthermiester
ROFLMBO!!! :laugh::roll::rockout::clap:

And now ladies and gentlemen, we return everyone to the regularly scheduled thread topic.
Posted on Reply
#205
okbuddy
BOM is very important, everything is about budget, we are not dealing with art here
Posted on Reply
#206
lexluthermiester
okbuddy
BOM is very important, everything is about budget, we are not dealing with art here
True. Finding the optimal balance between cost, quality and value to the end user can be a very serious challenge. Everyone wants to make money and as much as possible. In the case of video card AIBs, they want to make money but also boost their brand. Most actually care about making a quality product and hate it when things like the problems being faced currently happen.
Posted on Reply
#207
Chomiq
asdkj1740
Hi Tin,
the evga statement posted by jacob seems to be supportting the "guesswork" from that blogger, what do you think?
forums.evga.com/Message-about-EVGA-GeForce-RTX-3080-POSCAPs-m3095238.aspx
TiN isn't working for evga for some time now so there's no way he would give you any "official" input on this. Even if he was, he'd probably not address it.

IMHO Evga statement is about last minute change in board design, which came up during all this CTD stuff. Their internal testing showed some issues with previous cap layout on their board and they made change for the final design. This is in no way a confirmation of "guesswork" done by "blogger".

Right now we get similar reports about CTDs on almost every partner design as well as on FEs. Bloggers can speculate whatever they want, take a look at jayz "I've just been told that there are multiple type of tantalum caps". Jayz not a freaking board engineer. He's probably on the same level as anyone on this forum that cares to do some serious research when it comes to power delivery on gpus.

Leave guesswork to people on forum and bloggers and YouTubers that are oriented on clicks and views. They will milk this for as long as they can. Ok, that was a bit harsh, I understand that IgorLab and Jayz want to inform their viewers about an issue with a newly released product. They're not doing it just for clicks and views. But they also get paid based on the type and the amount of content they produce.

Engineers from Nvidia and AiB partners will do the actual work. In the end this will be either solved by driver, firmware or worst case scenario - full blown recall based on the result for each manufacturer RMA and exchange to v2 board design.
Posted on Reply
#208
OneMoar
There is Always Moar
I doubt a recall is likely and remember just because a user has experienced a CTD does not mean its this particular issue at fault there is plenty of other things that can cause a CTD or TDR event
Posted on Reply
#209
Chomiq
OneMoar
I doubt a recall is likely and remember just because a user has experienced a CTD does not mean its this particular issue at fault there is plenty of other things that can cause a CTD or TDR event
Yeah I know, we've got reports from people saying they're running their cards on stock, others that say they run theirs on stock but post screenshot will an overclock applied in Precision X2, etc. There are many variables that need to be considered.

Once again, all of this has to be investigated properly that people that get paid to do it.*

* and it wouldn't be needed if they'd have done it right in the first place.
Posted on Reply
#210
trsskater63
I think this problem is less to do with the cheap caps and more to do with the fact that Nvidia is pushing so much power through this card. Even cards with all 6 of the expensive caps still crash to desktop. I'm pretty sure the companies that went with all 6 cheap caps aren't causing a problem. If anything might make the actual problem even worse. I think the problem is Boost 2.0 allows these cards to push pass where they would be stable since these coolers are over engineered to deal with the extra heat and doing a really good job of it. So they stay cool enough for the gpu boost to think it can push farther since it hasn't hit power limits nor temperature limits. There is probably a lot of noise or cross talk or other things going on causing an error somewhere in the line because of how much power is running through the card with everything so close together. It looks like this card is really being pushed hard why it even need this much power. This could be the card is too close to the limits of how it can function.
Posted on Reply
#211
Shatun_Bear
What a nightmare.

Spend $800-1500 on a graphics card only to find out it's been cheaply made and needs to be sent back for RMA, taking several weeks to get a new one as there is limited stock.

I can't believe the stupidity of some spending best part of a grand of graphics card AIB I have honestly never heard of, like Ventus or Eagle? Crazy risk.
Posted on Reply
#212
GreiverBlade
isn't nvidia using AIB as Scapegoats? because i remember reading somewhere (need to find that back) that founder edition are also having that issues ...
Posted on Reply
#213
OneMoar
There is Always Moar
Shatun_Bear
What a nightmare.

Spend $800-1500 on a graphics card only to find out it's been cheaply made and needs to be sent back for RMA, taking several weeks to get a new one as there is limited stock.

I can't believe the stupidity of some spending best part of a grand of graphics card AIB I have honestly never heard of, like Ventus or Eagle? Crazy risk.
the better cards don't seem to suffer from this particular issue and I have seen no confirmed reports of crashing that where caused by this particular problem on FE cards (despite what some people are claiming because they read it on the internet the FE cards don't have the problem at least NOT this particular issue)

this kind of thing is fairly normal when you ride the bleeding edge of new hardware
Posted on Reply
#214
mtcn77
Shatun_Bear
What a nightmare.
This is like the solar flares coming in every 11 years, it is a rinse repeat of the non-eutectic solder joints in 2008.
Posted on Reply
#215
OneMoar
There is Always Moar
mtcn77
This is like the solar flares coming in every 11 years, it is a rinse repeat of the non-eutectic solder joints in 2008.
a bit overly dramatic
Posted on Reply
#216
steen
Investigation of PCB power stages/decoupling/filtering, fw/drivers/boost behavior makes sense. I noticed some further interesting results in a followup article from Igor's Lab.



FE/Ref/AIB cards have current sensing/balancing circuits for each power rail. Why does PEG exceed the 6.5A hard limit?
Posted on Reply
#217
trsskater63
OneMoar
the better cards don't seem to suffer from this particular issue and I have seen no confirmed reports of crashing that where caused by this particular problem on FE cards (despite what some people are claiming because they read it on the internet the FE cards don't have the problem at least NOT this particular issue)

this kind of thing is fairly normal when you ride the bleeding edge of new hardware
The YouTuber Tech Yes City just did a video about this yesterday and tested his reviewer sample Asus Tuf OC edition vs a production sample Asus Tuf non-OC edition. Both of them have 6 of the better caps and the production Asus Tuf still would crash to desktop. It looks like it's more than the issue of the caps.
Posted on Reply
#218
kiriakost
TiN
100k oscilloscope or 10 GHz oscilloscope is not required again to do power delivery network analysis of the VGA card. It is required for signal integrity measurements and verification, such as that pretty PAM4 eye diagram everybody saw on marketing presentation slides or for testing PCIe 4.0 signal quality and interface health (and 10GHz for that is not enough, gotta need 32GHz+, which together with probe system would go for 300k$+ mark). Why it is not needed for power testing and VRM tuning? Because major (99.9%) amount of frequency bandwidth involved in switching hundreds of amps on large planes like PCB has is limited to few tens MHz tops, and all fast transient at GHz are handled by GPU package, not the PCB.
You seem lost again.
I do not care of any power delivery network analysis.
You should pay and get if you wish your opinion this to be taken seriously of one 100GHz Oscilloscope, so when NVIDIA shown
by saying that they have a fix, you to be able to verify it with measurements.

Regular gamers all that they care about this is the problem to stop appearing in their screen.

My advice ... if you are not part of the solution... then just make a step back.
You acted the same and about the 8846A in the past, you failed to repair it, and even still you are spreading misinformation about it. :banghead:

I will suggest again patience .. patience .. patience .. so the people who are responsible of their work them to deliver their decisions of what next to the buyers of RTX3000 series.
Posted on Reply
#219
OneMoar
There is Always Moar
trsskater63
The YouTuber Tech Yes City just did a video about this yesterday and tested his reviewer sample Asus Tuf OC edition vs a production sample Asus Tuf non-OC edition. Both of them have 6 of the better caps and the production Asus Tuf still would crash to desktop. It looks like it's more than the issue of the caps.
yes but there is more to this then the cap type I never disputed that I also said there is a multitude of OTHER issues at play here everything from it being a new architecture, to driver issues that may cause a Wide Varity of stability issues we went though this exact same phase with turning the drivers where no more stable at launch then ampere

now here is my Opinion on what is going on

more then likely all that needs to happen in terms of hardware stability is the driver needs a more aggressive voltage table if you look at the voltages at load, they are all over the place and are dipping below 1000Mv which is just not enough voltage for 2000Mhz this issue is exasperated by some cards with poorer power designs we know from Turing that they get unstable at about 2000-2050Mz for most samples and they don't scale particularly well even with lots of voltage

this is a silicon limitation and likely to be worse on Ampere then turning because its a smaller\new process

whatever other problems any given pcb might have more voltage should help with stability it buys you more room to breath when the silicon is already operating at its outer limits which is really the problem is that AIBs want there pre overclocked 2000Mhz cards and the silicon quiet simply is not going todo that as easily as pervious generations
Posted on Reply
#220
Caring1
Shatun_Bear
I can't believe the stupidity of some spending best part of a grand of graphics card AIB I have honestly never heard of, like Ventus or Eagle? Crazy risk.
Ventus is an MSI card, just as Eagle is a Gigabyte card, they are merely naming conventions by known brands.
Posted on Reply
#221
Calmmo
Call me crazy but this could all be drivers.
With some overpushed early chips clocking too high on a so far near the limit @ stock chip. Aka not one single issue. (kinda like zen2, only those don't boost more than they can handle)
And those cards will get a "fixed" bios update. Your Eagles, Strixes etc might be getting a slightly less OC variant bios.
Posted on Reply
#222
Shatun_Bear
OneMoar
the better cards don't seem to suffer from this particular issue and I have seen no confirmed reports of crashing that where caused by this particular problem on FE cards (despite what some people are claiming because they read it on the internet the FE cards don't have the problem at least NOT this particular issue)

this kind of thing is fairly normal when you ride the bleeding edge of new hardware
This is totally not normal, software issues are to be expected, sure, but hardware problems like this are not the norm, of course they are not. There is even talk of a mass recall. When was the last time that happened for an Nvidia launch if it's all 'to be expected'?

Secondly, to the bolded, you must have been burying your head in the sand then or not bothered to look, but there are reports everywhere of FE CTD problems, this is clearly not relegated to just AIBs.
Caring1
Ventus is an MSI card, just as Eagle is a Gigabyte card, they are merely naming conventions by known brands.
Oh I see, 'cheap' models of these manufacturers.
Posted on Reply
#224
lexluthermiester
OneMoar
I doubt a recall is likely and remember just because a user has experienced a CTD does not mean its this particular issue at fault there is plenty of other things that can cause a CTD or TDR event
Right, like drivers that need further refinement.
Posted on Reply
#225
kiriakost
lexluthermiester
Right, like drivers that need further refinement.
I am simply wonder of how NVIDIA will prioritize drivers that need further refinement?

Few months ago the GTS 1660 Super this were demonstrated as fresh product option which as fresh one, the drivers they should gain greater performance and compatibility.
Now the RTX 3000 issue this will change NVIDIA's drivers developers focus at this direction.
In simple English thousands of people expectations they get on hold.
Posted on Reply
Add your own comment