Friday, September 25th 2020

RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

Igor's Lab has posted an interesting investigative article where he advances a possible reason for the recent crash to desktop problems for RTX 3080 owners. For one, Igor mentions how the launch timings were much tighter than usual, with NVIDIA AIB partners having much less time than would be adequate to prepare and thoroughly test their designs. One of the reasons this apparently happened was that NVIDIA released the compatible driver stack much later than usual for AIB partners; this meant that their actual testing and QA for produced RTX 3080 graphics cards was mostly limited to power on and voltage stability testing, other than actual gaming/graphics workload testing, which might have allowed for some less-than-stellar chip samples to be employed on some of the companies' OC products (which, with higher operating frequencies and consequent broadband frequency mixtures, hit the apparent 2 GHz frequency wall that produces the crash to desktop).

Another reason for this, according to Igor, is the actual "reference board" PG132 design, which is used as a reference, "Base Design" for partners to architecture their custom cards around. The thing here is that apparently NVIDIA's BOM left open choices in terms of power cleanup and regulation in the mounted capacitors. The Base Design features six mandatory capacitors for filtering high frequencies on the voltage rails (NVVDD and MSVDD). There are a number of choices for capacitors to be installed here, with varying levels of capability. POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are generally worse than SP-CAPs (Conductive Polymer-Aluminium-Electrolytic-Capacitors) which are superseded in quality by MLCCs (Multilayer Ceramic Chip Capacitor, which have to be deployed in groups). Below is the circuitry arrangement employed below the BGA array where NVIDIA's GA-102 chip is seated, which corresponds to the central area on the back of the PCB.
In the images below, you can see how NVIDIA and it's AIBs designed this regulator circuitry (NVIDIA Founders' Edition, MSI Gaming X, ZOTAC Trinity, and ASUS TUF Gaming OC in order, from our reviews' high resolution teardowns). NVIDIA in their Founders' Edition designs uses a hybrid capacitor deployment, with four SP-CAPs and two MLCC groups of 10 individual capacitors each in the center. MSI uses a single MLCC group in the central arrangement, with five SP-CAPs guaranteeing the rest of the cleanup duties. ZOTAC went the cheapest way (which may be one of the reasons their cards are also among the cheapest), with a six POSCAP design (which are worse than MLCCs, remember). ASUS, however, designed their TUF with six MLCC arrangements - there were no savings done in this power circuitry area.

It's likely that the crash to desktop problems are related to both these issues - and this would also justify why some cards cease crashing when underclocked by 50-100 MHz, since at lower frequencies (and this will generally lead boost frequencies to stay below the 2 GHz mark) there is lesser broadband frequency mixture happening, which means POSCAP solutions can do their job - even if just barely.
Source: Igor's Lab
Add your own comment

297 Comments on RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

#101
TiN
Whole article is based on speculations of speculations. First of all, high-current DC-DC PDN (power delivery network) is real challenge, and indeed must use proper decoupling. However it does not mean that use of POSCAP/SPCAP or MLCC is the best in every case. Much more depends on transient tuning and VRM settings, and PCB layout itself, than using MLCC or POSCAP in specific spot. Just replacing everything with MLCCs will NOT help the design to reach higher speeds and stability. Why? Because one need to use all different caps in tandem, as their frequency response is different, as well as ESR, ESL and other factors.

Having everything with MLCC like glorified asus does means you have single deep resonance notch, instead of two less prominent notches when use MLCC+POSCAP together. Using three kinds, smaller POSCAP, bigger POSCAP, and some MLCCs gives better figure with 3 notches.. But again, with modern DC-DC controllers lot of this can be tuned from PID control and converter slew rate tweaks. This adjustability is one of big reasons why enthusiast cards often use "digital" that allows tweaking almost on the fly for such parameters. However this is almost never exposed to user, as wrong settings can easily make power phases go brrrrrr with smokes. Don't ask me how I know...

Everybody going nuts now with MLCC or POSCAP, but I didn't see a single note that actual boards used DIFFERENT capacitance and capacitor models, e.g. some use 220uF , some use 470uF :) There are 680 or even 1000uF capacitors in D case on the market, that can be used behind GPU die. It is impossible to install that much of capacitance with MLCC in same spot for example, as largest cap in 0603 is 47uF for example.

Before looking onto poor 6 capacitors behind the die - why nobody talks about huge POSCAP capacitor bank behind VRM on FE card, eh? Custom AIB cards don't have that, just usual array without much of bulk capacitance. If I'd be designing a card, I'd look on a GPU's power demands and then add enough bulk capacitance first to make sure of good power impedance margin at mid-frequency ranges, while worrying about capacitors for high-frequency decoupling later, as that is relatively easier job to tweak.

After all these wild theories are easy to test, no need any engineering education to prove this wrong or right. Take "bad" crashing card with "bad POSCAPs", test it to confirm crashes... Then desolder "bad POSCAPs", put bunch of 47uF low-ESR MLCCs instead, and test again if its "fixed". Something tells me that it would not be such a simple case and card may still crash, heh. ;-)
Posted on Reply
#102
kiriakost
dicktracy
Dis is why you don't want to be an early adopter.
I am an early adopter, I do run my Blog for eight long years, I do explore solely the planet of electrical test and measurement equipment and testers.
Due lots of reading and practice and the opportunity to receive highest precision parts and measuring tools, I did my entrance also at at electrical metrology.
This is the top of pyramid at that science.
And I won recognition at my sector by the industry it self, as they made the judgement that their Blogger and in a way a trainee early adopter, he does have true potentials to adopt and understand of what their High-tech work can do and it usage.

But here comes the difference between of me and others, I was preparing my self for 30 years as freelancer electrician and electronics repair man, studying , practicing , having a very high success rate when I do repairs or troubleshoot real problems at my local customers.
This is the hard and slow and painful way so some one to develop skills and understanding.

Today because of Igor an German retiree, all YouTube actors / product reviewers, they did found a reason to power on their cameras.
But even so they are clueless of what they are talking about.

And therefore all consumers they should simply wait so NVIDIA and their partners to do their own homework and any new decisions will be officially announced in the market no sooner than 40 days from now.
TiN
After all these wild theories are easy to test, no need any engineering education to prove this wrong or right. Take "bad" crashing card with "bad POSCAPs", test it to confirm crashes... Then desolder "bad POSCAPs", put bunch of 47uF low-ESR MLCCs instead, and test again if its "fixed". Something tells me that it would not be such a simple case and card may still crash, heh. ;-)
I can solder and desolder of anything too, but GPU engineering this is something that no one can grasp with out be part of NVIDIA R&D team.
Fifteen years ago the only word that consumers knew was number of pipelines.
GPU engineering has nothing to do of YOU becoming a car mechanic at your own car, it does not work that way due the unimaginable complexity of modern design.
Posted on Reply
#103
gloomfrog
it's a doubt whether using POSCAPs means cheap , on some cases one POSCAP could be expensive than ten MLCCs。
Posted on Reply
#104
TiN
opportunity to receive highest precision parts and measuring tools, I did my entrance also at at electrical metrology.
Do tell more :)

Though there is no much need for highest precision equipment to be able on capturing bode plot and response of relatively slow DC-DC converter that is used on 3080/3090 GPUs here. One do need decent differential probes, injector or high-speed load and good scope or bode plot analyzer :)

Again, one does not need to know anything about GPU or silicon design to make a good DC-DC converter that can meet power requirements of the chip. You can measure all this in typical EE lab that all AIBs already have. No need to work at NVIDIA to do this, as DC-DC converter design is very common job that is done in majority electronics, be it GPU, motherboard, console or TV.

Also fun fact = MLCC caps produce lot of acoustic noise. Remember sqeaking cards that customers hate and RMA so much? :)
Posted on Reply
#105
yeeeeman
This is another example why no one should buy a product on its first batches. Let it pass at least a month.
Posted on Reply
#106
Tsukiyomi91
Seems that I'll be getting an RTX3070 or a 3060.
Posted on Reply
#107
Chomiq
yeeeeman
This is another example why no one should buy a product on its first batches. Let it pass at least a month.
More like example of proper validation required in the R&D process instead of a rushed release. This falls both on Nvidia and (some) AiB partners.
Posted on Reply
#108
kiriakost
TiN
Do tell more :)

Again, one does not need to know anything about GPU or silicon design to make a good DC-DC converter that can meet power requirements of the chip.
No need to do so, you are well aware that lack of understanding this limited your joy about bringing back from the dead the 8846A.
I got one a year ago and I even help at developing logging software for it.
3080/3090 GPUs they are more complex than the 8846A. :)
Just keep that in mind.
Posted on Reply
#109
BoboOOZ
Turmania
Does this mean a recall is coming? If so not very good for Nvidia.
Nvidia never admits being wrong and always blames the partners (TSMC, Apple, etc.), so here they will say that the fault is with the AIB and the fix will be based on downclocking...
Posted on Reply
#110
HD64G
Another con of Ampere consumer GPUS made on Samsung's 8nm and ended being an ultra high power draw chip. And power circuit robustness is the same reason that the cheapest AIB models most often than not have biggest RMA rates than the higher quality made ones.
Posted on Reply
#111
kiriakost
BoboOOZ
Nvidia never admits being wrong and always blames the partners (TSMC, Apple, etc.), so here they will say that the fault is with the AIB and the fix will be based on down-clocking...
We are all here to verify that, but do not expect getting any solid answers faster than four weeks of time.
Posted on Reply
#112
Chomiq
For those that didn't get the memo, here's @TiN:

Just so someone doesn't jump the gun and says he's pulling this out of his you know what.
Posted on Reply
#113
kiriakost
HD64G
Another con of Ampere consumer GPUS made on Samsung's 8nm and ended being an ultra high power draw chip. And power circuit robustness is the same reason that the cheapest AIB models most often than not have biggest RMA rates than the higher quality made ones.
I thought so far that the cheapest ones receive a hell of torture because of poor people trying to OC them with out use of sanity .:D
Posted on Reply
#114
BoboOOZ
kiriakost
We are all here to verify that, but do not expect getting any solid answers faster than four weeks of time.
Of course, we need patience.

EVGA stance seems to confirm there is a problem with the choice of capacitors, although maybe not cheaping out is the root of the problem, but rather not enough testing.
On the other hand, FE cards seem to crash, too, so there might be other sources of issues, PSU related or such.
Posted on Reply
#115
kiriakost
Chomiq
For those that didn't get the memo, here's @TiN:

Just so someone doesn't jump the gun and says he's pulling this out of his you know what.
Electronics engineering and GPU architecture they are two different mountain tops.
BoboOOZ
Of course, we need patience.

EVGA stance seems to confirm there is a problem with the choice of capacitors, although maybe not cheaping out is the root of the problem, but rather not enough testing.
On the other hand, FE cards seem to crash, too, so there might be other sources of issues, PSU related or such.
an 750W PSU this has headroom of 1150W Max, you may expect only 1% relative complain about it.
Mostly because the users they are not aware of the actual health status of the PSU in their hands, current performance delivery in watts due it age.
Posted on Reply
#116
TiN
I still missing how GPU architecture or GPU design matter here? One can assume it's magic oompa-loompa inside chip doing the math, and it would be same either way, as soon as you need (can measure) how many amps and what voltage margins loompa's need to stay happy. That is number one test to be done for all new GPUs, before you can even begin to start writing specification of VRM design.

P.S. No joy in 8846A was not because of it's digital issues, but because I am/was not interested in it much, having way more fun with 3458A/2002/etc. :) Even fully working 8846A is quite poor unit for what it costs...

P.P.S. All above are just my personal ramblings, not related to any AIB point of view.
Posted on Reply
#118
kiriakost
TiN
I still missing how GPU architecture or GPU design matter here? One can assume it's magic oompa-loompa inside chip doing the math, and it would be same either way, as soon as you need (can measure) how many amps and what voltage margins loompa's need to stay happy. That is number one test to be done for all new GPUs, before you can even begin to start writing specification of VRM design.

P.S. No joy in 8846A was not because of it's digital issues, but because I am/was not interested in it much, having way more fun with 3458A/2002/etc. :) Even fully working 8846A is quite poor unit for what it costs...

P.P.S. All above are just my personal ramblings, not related to any AIB point of view.
It is not in my priorities of me to discover NVIDIA's magic oompa-loompa inside chip, because I do not make money from VGA card repairs.
I am aware of your measuring gear, but your accident did stop your exploration at the discovery of what an 8846A can do as by far most modern design.
Anyway this is another story, and a boring one for the readers of this forum.
Posted on Reply
#119
mtcn77
kiriakost
Anyway this is another story, and a boring one for the readers of this forum.
Please carry on. You are carrying it like the main event. I appreciate it more than the uninformed opinions.

This component race somehow makes me wonder if there are forbidden cheats that don't meet the regulations. Where there is a rule, so is a violation.
Posted on Reply
#120
jormungand
Looks like the scalpers saved the day!!!! We need to thank them, they sacrificed their wallets in order to protect ours bois!!!!
Re-manucfacturing uhmmmm
Now the companies will have to show that they made a reliable product and works fine.
Posted on Reply
#121
Haile Selassie
There seems to be more and more indications that this is poor QC control on yield side, not PCBA design problem. Either that or faulty boost algorithm or bad VID/FID table.
Over 2GHz seems to be an issue, either MCU design or design process limit or both.
I personally expect there will be BIOS updates that will lower the maximum boost clock.
Posted on Reply
#122
basco
could this be why Msi put so low power target on their 3x8pin trio cards?
Posted on Reply
#123
AsRock
TPU addict
LabRat 891
Gotta love EMI/RFI design oversights. From what I've read, it is the bane of every freshly college-educated EE and many a veteran EE. I bet somebody on the design teams knew that this would cause a problem and was promptly ignored after referencing datasheets claiming "It'll be fine!"
Posted on Reply
#124
mak1skav
Meh at least with 2xxx series we had Space Invaders but now we just have an ordinary crash to desktop ;)
Posted on Reply
Add your own comment