1080 Ti - Degradation of Performance During Neural Network Training

9of9 · Jun 22, 2019

Thanks! Installed that one as well.

After a reboot I ran MSI Afterburner again and it was able to run the test just fine - reports 90% pass.

I've tried it out on a few games and it seems like I'm getting graphics crashes across the board at this point, not just with neural net training :/ Even though the GPU temperature is now far more reasonable... I've attached a GPU-Z log for Assassin's Creed Odyssey. Runs for a few minutes, then crashes hard.

harm9963 · Jun 22, 2019

What is event viewer log showing?
have you ever used DDU?
what your spec/ any OC on ram or anything.
also try debug mode in Nvidia control panel, under help, will underclock your GPU to 1481.

9of9 · Jun 22, 2019

Will report back once I've found the event viewer log. Don't think I've used DDU on this machine, no. This is actually a relatively recent clean install of windows after suffering some unrecoverable corruption on my hard drive.

CPU is a Core i7 4820K - I had it overclocked at one point, but I don't think anything is OC'd right now.

Debug mode is a good shout - I'll try that. Judging by behaviour when it switches to limping mode after a crash, it is a lot more stable when under powered. Maybe a small underclock would ultimately fix this.

Valantar · Jun 22, 2019

What's the VRM cooling like on that card? And perhaps equally importantly, what's the VRM layout like? This is pure speculation, but your issues don't sound like standard thermal throttling, but more like some badly implemented measure to compensate for poorly cooled VRMs. If the VRMs don't have thermal sensors and also have poor cooling, the issue might be that they start overheating and thus delivering unstable voltages when subjected to heavy enough loads over time, causing the GPU to go into some sort of self-protection mode due to unstable voltages (as I believe someone mentioned before). This would match your recorded behavior as it'd be a "one-off" state rather than normal throttling. The question is whether repeating this process can lead to permanent damage either to the VRM (unlikely) or the GPU due to unstable/noisy voltages. I don't know, but it doesn't sound impossible.

Assimilator · Jun 23, 2019

9of9 said:
Maybe a small underclock would ultimately fix this.

I'd agree that would be a good thing to try. That said, if the card isn't able to run at the clocks it's supposed to, you should consider RMAing it - in particular, the poor fan curve may have caused long-term damage to the card via overheating.

If you do RMA it, see if you can get a different brand. If that's not possible, apply the improved MSI AB fan curve tweak to the replacement.

silkstone · Jun 24, 2019

Either A - You are overclocking an it crashes, it likely means you can't maintain those clocks any more
Or B - You are at stock and it continues to crash - Test in a known working PC or RMA.

9of9 · Jun 24, 2019

Valantar said:
What's the VRM cooling like on that card? And perhaps equally importantly, what's the VRM layout like? This is pure speculation, but your issues don't sound like standard thermal throttling, but more like some badly implemented measure to compensate for poorly cooled VRMs. If the VRMs don't have thermal sensors and also have poor cooling, the issue might be that they start overheating and thus delivering unstable voltages when subjected to heavy enough loads over time, causing the GPU to go into some sort of self-protection mode due to unstable voltages (as I believe someone mentioned before). This would match your recorded behavior as it'd be a "one-off" state rather than normal throttling. The question is whether repeating this process can lead to permanent damage either to the VRM (unlikely) or the GPU due to unstable/noisy voltages. I don't know, but it doesn't sound impossible.

That sounds plausible. The card has three fans covering the entirety of it, so I can't really tell what's going on with the VRMs - and I don't fancy taking it apart either, but the behaviour does match. Also, if you look at e.g. the Odyssey capture, whenever I test games I get this pretty consistent behaviour where everything runs okay for ten minutes or so and the graphs are all steady. But then just at the end, everything starts going wibbly, thermal throttling starts to kick in, temperature rises, and eventually it all crashes.

I've tried small underclocks and it still crashes, however running with 'Debug' mode from the NVIDIA Control Panel seems to work - the GPU can run pretty stable at those clock speeds.

Seems like the way forward will be to try and get it RMA'd and underclock it in the meantime.

System Name	Harm's Rig's
Processor	5950X /2700x / AMD 8370e 4500
Motherboard	ASUS DARK HERO / ASRock B550 Phantom Gaming 4
Cooling	Arctic Liquid Freezer III 420 Push/Pull -6 Noctua NF-A14 i and 6 Noctua NF-A14 i Meshify 2 XL
Memory	CORSAIR Vengeance RGB RT 32GB (4x16GB) DDR4 4266cl16 - Patriot Viper Steel DDR4 16GB (4x 8GB)
Video Card(s)	ZOTAC AMP EXTREME AIRO 4090 / 1080 Ti /290X CFX
Storage	SAMSUNG 980 PRO SSD 1TB/ WD DARK 770 2TB , Sabrent NVMe 512GB / 1 SSD 250GB / 1 HHD 3 TB
Display(s)	Thermal Grizzly WireView / TCL 646 55 TV / 50 Xfinity Hisense A6 XUMO TV
Case	Meshify 2 XL- TT 37 VIEW 200MM'S-ARTIC P14MAX
Audio Device(s)	Sharp Aquos
Power Supply	Seasonic Prime TX-1600 ATX3.1 \| Fully FSP Hydro PTM PRO 1200W ATX 3.0 PCI-E GEN-5 80 Plus Platinum -
Mouse	G502 - PS5 DualSense
Keyboard	G413-PS5 DualSense

System Name	Hotbox
Processor	AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard	ASRock Phantom Gaming B550 ITX/ax
Cooling	LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory	32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s)	PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage	2TB Adata SX8200 Pro
Display(s)	Dell U2711 main, AOC 24P2C secondary
Case	SSUPD Meshlicious
Audio Device(s)	Optoma Nuforce μDAC 3
Power Supply	Corsair SF750 Platinum
Mouse	Logitech G603
Keyboard	Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software	Windows 10 Pro

System Name	Firelance.
Processor	Threadripper 3960X
Motherboard	ROG Strix TRX40-E Gaming
Cooling	IceGem 360 + 6x Arctic Cooling P12
Memory	8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s)	MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage	2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s)	Dell S3221QS(A) (32" 38x21 60Hz) + 2x AOC Q32E2N (32" 25x14 75Hz)
Case	Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply	Fractal Design Ion+ 2 Platinum 760W
Mouse	Logitech G604
Keyboard	Razer Pro Type Ultra
Software	Windows 10 Professional x64

System Name	Gaming System / HTPC-Server
Processor	i7 8700K (@4.8 Ghz All-Core) / R7 5900X
Motherboard	Z370 Aorus Ultra Gaming / MSI B450 Mortar Max
Cooling	CM ML360 / CM ML240L
Memory	32Gb Hynix @3200 MHz / 16Gb Hynix @3000Mhz
Video Card(s)	Zotac RTX 3080 / Colorful GTX 1060
Storage	750G MX300 + 2x500G NVMe / 40Tb Reds + 1Tb WD Blue NVMe
Display(s)	LG 27GN800-B 27'' 2K 144Hz / Sony TV
Case	Xigmatek Aquarius Plus / Corsair Air 240
Audio Device(s)	On Board Realtek
Power Supply	Super Flower Leadex III Gold 750W / Andyson TX-700 Platinum
Mouse	Logitech G502 Hero / K400+
Keyboard	Wooting Two / K400+
Software	Windows 10 x64
Benchmark Scores	Cinebench R15 = 1542 3D Mark Timespy = 9758

1080 Ti - Degradation of Performance During Neural Network Training

9of9

New Member

Attachments

harm9963

9of9

New Member

Valantar

Assimilator

silkstone

9of9

New Member