• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

1080 Ti - Degradation of Performance During Neural Network Training

Thanks! Installed that one as well.

After a reboot I ran MSI Afterburner again and it was able to run the test just fine - reports 90% pass.

I've tried it out on a few games and it seems like I'm getting graphics crashes across the board at this point, not just with neural net training :/ Even though the GPU temperature is now far more reasonable... I've attached a GPU-Z log for Assassin's Creed Odyssey. Runs for a few minutes, then crashes hard.
 

Attachments

What is event viewer log showing?
have you ever used DDU?
what your spec/ any OC on ram or anything.
also try debug mode in Nvidia control panel, under help, will underclock your GPU to 1481.
 
Last edited:
Will report back once I've found the event viewer log. Don't think I've used DDU on this machine, no. This is actually a relatively recent clean install of windows after suffering some unrecoverable corruption on my hard drive.

CPU is a Core i7 4820K - I had it overclocked at one point, but I don't think anything is OC'd right now.

Debug mode is a good shout - I'll try that. Judging by behaviour when it switches to limping mode after a crash, it is a lot more stable when under powered. Maybe a small underclock would ultimately fix this.
 
What's the VRM cooling like on that card? And perhaps equally importantly, what's the VRM layout like? This is pure speculation, but your issues don't sound like standard thermal throttling, but more like some badly implemented measure to compensate for poorly cooled VRMs. If the VRMs don't have thermal sensors and also have poor cooling, the issue might be that they start overheating and thus delivering unstable voltages when subjected to heavy enough loads over time, causing the GPU to go into some sort of self-protection mode due to unstable voltages (as I believe someone mentioned before). This would match your recorded behavior as it'd be a "one-off" state rather than normal throttling. The question is whether repeating this process can lead to permanent damage either to the VRM (unlikely) or the GPU due to unstable/noisy voltages. I don't know, but it doesn't sound impossible.
 
Maybe a small underclock would ultimately fix this.

I'd agree that would be a good thing to try. That said, if the card isn't able to run at the clocks it's supposed to, you should consider RMAing it - in particular, the poor fan curve may have caused long-term damage to the card via overheating.

If you do RMA it, see if you can get a different brand. If that's not possible, apply the improved MSI AB fan curve tweak to the replacement.
 
Either A - You are overclocking an it crashes, it likely means you can't maintain those clocks any more
Or B - You are at stock and it continues to crash - Test in a known working PC or RMA.
 
What's the VRM cooling like on that card? And perhaps equally importantly, what's the VRM layout like? This is pure speculation, but your issues don't sound like standard thermal throttling, but more like some badly implemented measure to compensate for poorly cooled VRMs. If the VRMs don't have thermal sensors and also have poor cooling, the issue might be that they start overheating and thus delivering unstable voltages when subjected to heavy enough loads over time, causing the GPU to go into some sort of self-protection mode due to unstable voltages (as I believe someone mentioned before). This would match your recorded behavior as it'd be a "one-off" state rather than normal throttling. The question is whether repeating this process can lead to permanent damage either to the VRM (unlikely) or the GPU due to unstable/noisy voltages. I don't know, but it doesn't sound impossible.

That sounds plausible. The card has three fans covering the entirety of it, so I can't really tell what's going on with the VRMs - and I don't fancy taking it apart either, but the behaviour does match. Also, if you look at e.g. the Odyssey capture, whenever I test games I get this pretty consistent behaviour where everything runs okay for ten minutes or so and the graphs are all steady. But then just at the end, everything starts going wibbly, thermal throttling starts to kick in, temperature rises, and eventually it all crashes.

I've tried small underclocks and it still crashes, however running with 'Debug' mode from the NVIDIA Control Panel seems to work - the GPU can run pretty stable at those clock speeds.

Seems like the way forward will be to try and get it RMA'd and underclock it in the meantime.
 
Back
Top