• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

1080 Ti - Degradation of Performance During Neural Network Training

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.00/day)
Up front: I have an Inno3D GeForce GTX 1080Ti iChill X3 with an ASUS P9X79 Pro motherboard, running Windows 10. I know Inno3D doesn't have the best reputation, but I'm keeping my fingers crossed I don't need to replace the GPU just yet ;)

I've noticed this issue mainly while running neural network training on the GPU, but I think this is affecting games as well - recently, the card has started running slower.

More precisely, what happens is that I'll start up the computer, begin training, and for a few hours it will run at top capacity without any issues. But all of a sudden, it'll drop in productivity and subsequent work done takes almost exactly twice as long as before. Other times, CUDA will error out or the entire computer will crash and reboot. The thing is that this has only started happening recently - I ran the same training algorithm (StyleGAN) with the same image sizes, exact same load on GPU and VRAM, and I never had any of these problems. Something has started messing it up.

Trying to figure out what could be causing this - I ended up running a GPU-Z capture over the course of a day, which I've attached.

The switch over from one mode of operation to the other is really obvious, looking at the capture, but I'm having trouble figuring out what could cause this. As far as I can tell, there's nothing in the software that would cause it to switch operation, and I can't see anything going on in terms of background processes. If I kill the process and then start again, it continues running slow. Only way to reset it is to restart the computer, at which point the cycle continues.

To summarise the sensor log:
* GPU Clock spikes up and down around 1600, until around the 2h40 mark, at which point it caps out at 2000.
* GPU Temperature starts out close to 90'C, but after the breaking point at 2h40, drops down to about 64'C.
* Accordingly Fan Speed % is at 100 at first, but then drops to 70.
* GPU Load is between 90% and 100%, but after 2h40 it rises up to being 95% to 100%.
* Memory Controller Load goes from averaging 55% to 25%.
* Power Consumption drops from average 240W to 160W.
* PerfCap Reason oscillates between 2 and 16 in the first part, but is consistently 4 in the second part. I.e. it goes from power and thermal caps to voltage reliability cap super consistently.
* VDDC switches up from wiggling around ~0.81V to flat 1.04V
* CPU Temp drops from ~56'C to ~52'C
* Memory usage doesn't change.

I can't make heads or tails of this, personally. My initial guess was that the problem was maybe overheating, or some other process kicking in and using up GPU (maybe I've acquired BitCoin mining malware). But these figures make no sense to me. GPU Clock Speed goes up, but GPU Temperature goes down. VDDC goes up, Power Consumption goes down. GPU Load rises, but the GPU seems to be doing less work.

Maybe another process kicks in that's demanding GPU time, but with less occupancy? Even that, I'm not sure that it makes sense.

Please let me know if you have any idea what this might be a symptom of! :)
 

Attachments

  • GPU-Z Sensor Log.zip
    746.6 KB · Views: 192

rtwjunkie

PC Gaming Enthusiast
Supporter
Joined
Jul 25, 2008
Messages
13,909 (2.43/day)
Location
Louisiana -Laissez les bons temps rouler!
System Name Bayou Phantom
Processor Core i7-8700k 4.4Ghz @ 1.18v
Motherboard ASRock Z390 Phantom Gaming 6
Cooling All air: 2x140mm Fractal exhaust; 3x 140mm Cougar Intake; Enermax T40F Black CPU cooler
Memory 2x 16GB Mushkin Redline DDR-4 3200
Video Card(s) EVGA RTX 2080 Ti Xc
Storage 1x 500 MX500 SSD; 2x 6TB WD Black; 1x 4TB WD Black; 1x400GB VelRptr; 1x 4TB WD Blue storage (eSATA)
Display(s) HP 27q 27" IPS @ 2560 x 1440
Case Fractal Design Define R4 Black w/Titanium front -windowed
Audio Device(s) Soundblaster Z
Power Supply Seasonic X-850
Mouse Coolermaster Sentinel III (large palm grip!)
Keyboard Logitech G610 Orion mechanical (Cherry Brown switches)
Software Windows 10 Pro 64-bit (Start10 & Fences 3.0 installed)
What is your power supply? Brand, wattage, age?

How much RAM do you have? What are ambient room temps? How is air flow through the case.

Thanks for the good breakdown. These are a few other things which might or might not be oertinent as well.
 
Joined
Jul 11, 2015
Messages
617 (0.19/day)
System Name Harm's Rig's
Processor 5950X /2700x / AMD 8370e 4500
Motherboard ASUS DARK HERO / ASRock B550 Phantom Gaming 4
Cooling Enermax LIQMAX III ARGB 360 AIO/ Zalman cooler fan 110mm
Memory Patriot Viper Steel DDR4 16GB (4x 8GB) 4000M TRIDENT Z F-43600V15D-16GTZ /G.SKILL DDR4
Video Card(s) ZOTAC AMP EXTREME AIRO 4090 / 1080 Ti /290X CFX
Storage SAMSUNG 980 PRO SSD 1TB/ WD DARK 770 2TB , Sabrent NVMe 512GB / 1 SSD 250GB / 1 HHD 3 TB
Display(s) Thermal Grizzly WireView / TCL 646 55 TV / 50 Xfinity Hisense A6 XUMO TV
Case TT 37 VIEW 200MM'S/ NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply FSP Hydro PTM PRO 1200W ATX 3.0 PCI-E GEN-5 80 Plus Platinum - EVGA 1300G2/Corsair w750
Mouse G502
Keyboard G413
125307

90C ! is to higho_O poor ventilation, what is your room temp .

My 1080 Ti Gigabyte Gaming oc, max temp is never over 60c
 

Attachments

  • Capture.PNG1080ti temp.PNG
    Capture.PNG1080ti temp.PNG
    464.5 KB · Views: 349
Last edited:

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.00/day)
Thanks rtwjunkie!

* The PSU is a HX Series™ HX1050 Power Supply — 1050 Watt 80 PLUS® Gold Certified Modular PSU, which is about six years old at this point.
* 16GB RAM.
* Ambient room temps have been warm, but not too warm - around 22'C to 24'C in the past week.
* Airflow should be pretty good, I'm using this case with quite a few large, additional fans.

That's a good call, harm9963 - I'll have to experiment and see if bringing that down some makes any difference.

Can this kind of behaviour happen through overheating? Crashes and reboots, sure, thermal throttling, sure. But altogether switching down to this kind of... different mode of operation? I don't even know what to call it, since I want to say it's slower, but the clockspeed goes up o_O
 

INSTG8R

Vanguard Beta Tester
Joined
Nov 26, 2004
Messages
7,955 (1.13/day)
Location
Canuck in Norway
System Name Hellbox 5.1(same case new guts)
Processor Ryzen 7 5800X3D
Motherboard MSI X570S MAG Torpedo Max
Cooling TT Kandalf L.C.S.(Water/Air)EK Velocity CPU Block/Noctua EK Quantum DDC Pump/Res
Memory 2x16GB Gskill Trident Neo Z 3600 CL16
Video Card(s) Powercolor Hellhound 7900XTX
Storage 970 Evo Plus 500GB 2xSamsung 850 Evo 500GB RAID 0 1TB WD Blue Corsair MP600 Core 2TB
Display(s) Alienware QD-OLED 34” 3440x1440 144hz 10Bit VESA HDR 400
Case TT Kandalf L.C.S.
Audio Device(s) Soundblaster ZX/Logitech Z906 5.1
Power Supply Seasonic TX~’850 Platinum
Mouse G502 Hero
Keyboard G19s
VR HMD Oculus Quest 2
Software Win 10 Pro x64
Yeah with your temps and the subsequent behaviour it definitely sounds like it’s starting to throttle. Improving your load temps would probably solve this.
 
Joined
Jul 11, 2015
Messages
617 (0.19/day)
System Name Harm's Rig's
Processor 5950X /2700x / AMD 8370e 4500
Motherboard ASUS DARK HERO / ASRock B550 Phantom Gaming 4
Cooling Enermax LIQMAX III ARGB 360 AIO/ Zalman cooler fan 110mm
Memory Patriot Viper Steel DDR4 16GB (4x 8GB) 4000M TRIDENT Z F-43600V15D-16GTZ /G.SKILL DDR4
Video Card(s) ZOTAC AMP EXTREME AIRO 4090 / 1080 Ti /290X CFX
Storage SAMSUNG 980 PRO SSD 1TB/ WD DARK 770 2TB , Sabrent NVMe 512GB / 1 SSD 250GB / 1 HHD 3 TB
Display(s) Thermal Grizzly WireView / TCL 646 55 TV / 50 Xfinity Hisense A6 XUMO TV
Case TT 37 VIEW 200MM'S/ NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply FSP Hydro PTM PRO 1200W ATX 3.0 PCI-E GEN-5 80 Plus Platinum - EVGA 1300G2/Corsair w750
Mouse G502
Keyboard G413
Thanks rtwjunkie!

* The PSU is a HX Series™ HX1050 Power Supply — 1050 Watt 80 PLUS® Gold Certified Modular PSU, which is about six years old at this point.
* 16GB RAM.
* Ambient room temps have been warm, but not too warm - around 22'C to 24'C in the past week.
* Airflow should be pretty good, I'm using this case with quite a few large, additional fans.

That's a good call, harm9963 - I'll have to experiment and see if bringing that down some makes any difference.

Can this kind of behaviour happen through overheating? Crashes and reboots, sure, thermal throttling, sure. But altogether switching down to this kind of... different mode of operation? I don't even know what to call it, since I want to say it's slower, but the clockspeed goes up o_O
Open your case and put your hand close to GPU, you should feel a lot of air moving against your hand.
 

Attachments

  • Capture.PNGfan.PNG
    Capture.PNGfan.PNG
    1.2 MB · Views: 392
Joined
May 4, 2011
Messages
633 (0.13/day)
System Name Smooth-Operator
Processor AMD Ryzen 7 3800x
Motherboard Asrock x570 Taichi
Cooling AMD Wraith Prism
Memory 2x16GB 3200MHz CL16@CL14 DDR4
Video Card(s) Sapphire Radeon RX 580 8GB NITRO+
Storage 2x4TB WD HGST 7K6 7200RPM 256MB
Display(s) Samsung S24E370DL 24" IPS Freesync 75Hz
Case Fractal Design Focus G Window Blue
Audio Device(s) Creative X-Fi Titanium PCIe x1
Power Supply Corsair HX850 80+ Platinum
Mouse Gigabyte Aorus M3
Keyboard Zalman ZM-K300M
Software Windows 10 x64 Enterprise/Ubuntu Budgie amd64
Yeah with your temps and the subsequent behaviour it definitely sounds like it’s starting to throttle. Improving your load temps would probably solve this.
I partially agree on this but i think if the card would start to throttle it would do it sooner, not after few hours. Machine learning training is an intensive task which i would assume makes gpu load to be constantly around 99-100% so any thermal issues would appear within minutes.

There are also other things to consider, for example, data samples and processing done on it, what neural network does since the moment performance drops. Maybe performance drops because operations are getting more complicated and gpu simply doesn't have enough performance to keep up. I'm not deep into ML so these are just my thoughts of person who knows just absolute basics on level a tiny bit higher than just definition of ML.

On the other hand, in gaming we have similar problems. In this kind of tasks gpu also are not constant and performance drops over time if you are playing for several hours at one sitting which was tested and documented over internet pretty well.
 
Joined
Oct 21, 2006
Messages
621 (0.10/day)
Location
Oak Ridge, TN
System Name BorgX79
Processor i7-3930k 6/12cores@4.4GHz
Motherboard Sabertoothx79
Cooling Capitan 360
Memory Muhskin DDR3-1866
Video Card(s) Sapphire R480 8GB
Storage Chronos SSD
Display(s) 3x VW266H
Case Ching Mien 600
Audio Device(s) Realtek
Power Supply Cooler Master 1000W Silent Pro
Mouse Logitech G900
Keyboard Rosewill RK-1000
Software Win7x64
If the voltage goes high, and power goes down, I'd bet a power supply chip is locking up or otherwise losing it's shit, and the only reaction the board has is to go to a 'limp home' type mode to preserve itself.

I'd work on lowering the temperature first, and see if that extends the "ON" time; if it affects it, that's likely the deal.

Using a IR temperature measuring device can help id the part that's freaking out; it will cool considerably when it freaks.

Adding a small heatsink to said part might help, the fact it takes hours makes me think it might not be heatsinked.

If it's a chip that's not being reported over the smbus, the 'throttle monitor' might not see it, and it may just be one of those things that someone said "this will never happen, so we don't need to monitor it, but if it does this, we change to this run profile to keep it from catching on fire".

:)

Change the environment; if the operation time changes, it's heat; if it doesn't, it's software.
 

INSTG8R

Vanguard Beta Tester
Joined
Nov 26, 2004
Messages
7,955 (1.13/day)
Location
Canuck in Norway
System Name Hellbox 5.1(same case new guts)
Processor Ryzen 7 5800X3D
Motherboard MSI X570S MAG Torpedo Max
Cooling TT Kandalf L.C.S.(Water/Air)EK Velocity CPU Block/Noctua EK Quantum DDC Pump/Res
Memory 2x16GB Gskill Trident Neo Z 3600 CL16
Video Card(s) Powercolor Hellhound 7900XTX
Storage 970 Evo Plus 500GB 2xSamsung 850 Evo 500GB RAID 0 1TB WD Blue Corsair MP600 Core 2TB
Display(s) Alienware QD-OLED 34” 3440x1440 144hz 10Bit VESA HDR 400
Case TT Kandalf L.C.S.
Audio Device(s) Soundblaster ZX/Logitech Z906 5.1
Power Supply Seasonic TX~’850 Platinum
Mouse G502 Hero
Keyboard G19s
VR HMD Oculus Quest 2
Software Win 10 Pro x64
I partially agree on this but i think if the card would start to throttle it would do it sooner, not after few hours. Machine learning training is an intensive task which i would assume makes gpu load to be constantly around 99-100% so any thermal issues would appear within minutes.

There are also other things to consider, for example, data samples and processing done on it, what neural network does since the moment performance drops. Maybe performance drops because operations are getting more complicated and gpu simply doesn't have enough performance to keep up. I'm not deep into ML so these are just my thoughts of person who knows just absolute basics on level a tiny bit higher than just definition of ML.

On the other hand, in gaming we have similar problems. In this kind of tasks gpu also are not constant and performance drops over time if you are playing for several hours at one sitting which was tested and documented over internet pretty well.
I totally agree but high temps, voltage drops, utilization fluctuations definitely point to the card “backing down”. The timing is odd but all the symptoms are there. But because it’s a genuinely odd case there may be more to it. I know my Vega runs different boosts and temperatures across different games despite 100% utilization in all cases. As you said this definitely a heavy utilization scenario so only The odd timing is the question mark.
 

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.00/day)
Thanks guys! So much good insight :)

There are also other things to consider, for example, data samples and processing done on it, what neural network does since the moment performance drops. Maybe performance drops because operations are getting more complicated and gpu simply doesn't have enough performance to keep up. I'm not deep into ML so these are just my thoughts of person who knows just absolute basics on level a tiny bit higher than just definition of ML.

There is an element of stepping up to higher resolutions when training StyleGAN, yes. I'm not sure whether the GPU load increases as such, but what does happen is that the size of the image batches gets reduced at the very least, so as to remain inside the VRAM budget. However, I can see from the logs when those step downs happen and in this case I know I'm comparing like-for-like - it's just crunching through thousands of the same operations over and over again, the GPU usage is super predictable. Moreover, I've got the logs from the previous time I trained the same architecture with the same image sizes and parameters, so I can compare timings for each tick between how fast that training was taking at the same stage, and how fast this training is going. Before the GPU weirdness happens, timings are pretty much identical between themselves, and compared to the older logs. After the weirdness, everything slows by almost exactly one half.

partially agree on this but i think if the card would start to throttle it would do it sooner, not after few hours. Machine learning training is an intensive task which i would assume makes gpu load to be constantly around 99-100% so any thermal issues would appear within minutes.

If heat is the problem, then what might be happening is a problem with air flow through the room, rather than through the case. If the PC is generating more heat than the room's ventilation is dissipating, perhaps those several hours - plus the effect of time of day - is enough to raise the ambient temperature to a point where it becomes a problem.

If the voltage goes high, and power goes down, I'd bet a power supply chip is locking up or otherwise losing it's shit, and the only reaction the board has is to go to a 'limp home' type mode to preserve itself.

That's an interesting theory. Do you mean a chip on the card, or on the mobo? Or either one?

I've opened up the case, opened windows, air-sprayed some of the dust/cobwebs away, checked that all three fans spin up correctly under load. It seems like it's all in decent working order, but operating at its peak it still levels out at about 90'C... maybe a little closer to 89'C now that the room has cooled. Since it's late on this end, I'll leave it on overnight with the improved airflow and see if that makes any difference in the long term, even while temperature remains high.

Some thoughts before I leave it overnight:
* 1080 Tis do start thermal throttling at 91'C from what I can tell, so it makes sense to me that it would level out at 90'C - 91'C. But this level of throttling seems weird on its own and it never gets anywhere close to exceeding even 100'C, let alone the 105'C cap for a thermal shutdown.
* The fact that opening the case up hasn't made any big difference to the operating temperature is a little concerning, but I can't spot any physical faults with the cooling on the card itself. It doesn't appear congested, and all three fans are running well - the airflow is there. I don't have logs of temperature from when it was working correctly, but I wonder if that might just be due to higher GPU utilisation under the neural network training load, which pushes it up toward the thermal limit because of the type of work it's doing. I should see what the temperatures look like while running a demanding game tomorrow for comparison - if that reaches the 90s, then it does suggest a cooling issue I guess.
* Would it be an option to underclock the card, or reduce the limit for thermal throttling, to see if that improves stability?
* A few sites like this mention that the max RPM of the fans is 1600, but at full load mine is actually hitting 1700 RPM most of the time. I wonder if that's accurate, and if there are any sensor readings for the RPM of the three individual fans.
* What's up with the GPU clock going up when when this happens? The fact that clock is higher, VDDC is higher, GPU load is the same but GPU temp goes down and computation is slower really doesn't make any sense in my head. Surely higher clock speeds and more voltage should result in higher temperatures and more compute?
 

Solaris17

Super Dainty Moderator
Staff member
Joined
Aug 16, 2005
Messages
25,774 (3.79/day)
Location
Alabama
System Name Rocinante
Processor I9 14900KS
Motherboard EVGA z690 Dark KINGPIN (modded BIOS)
Cooling EK-AIO Elite 360 D-RGB
Memory 64GB Gskill Trident Z5 DDR5 6000 @6400
Video Card(s) MSI SUPRIM Liquid X 4090
Storage 1x 500GB 980 Pro | 1x 1TB 980 Pro | 1x 8TB Corsair MP400
Display(s) Odyssey OLED G9 G95SC
Case Lian Li o11 Evo Dynamic White
Audio Device(s) Moondrop S8's on Schiit Hel 2e
Power Supply Bequiet! Power Pro 12 1500w
Mouse Lamzu Atlantis mini (White)
Keyboard Monsgeek M3 Lavender, Akko Crystal Blues
VR HMD Quest 3
Software Windows 11
Benchmark Scores I dont have time for that.
Have you looked at the actual CPU/RAM consumption of the machine?

Have you tried different drivers? like older ones? or the studio drivers?
 

Mussels

Freshwater Moderator
Staff member
Joined
Oct 6, 2004
Messages
58,413 (8.21/day)
Location
Oystralia
System Name Rainbow Sparkles (Power efficient, <350W gaming load)
Processor Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard Asus x570-F (BIOS Modded)
Cooling Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory 2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s) Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage 2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s) Phillips 32 32M1N5800A (4k144), LG 32" (4K60) | Gigabyte G32QC (2k165) | Phillips 328m6fjrmb (2K144)
Case Fractal Design R6
Audio Device(s) Logitech G560 | Corsair Void pro RGB |Blue Yeti mic
Power Supply Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse Logitech G Pro wireless + Steelseries Prisma XL
Keyboard Razer Huntsman TE ( Sexy white keycaps)
VR HMD Oculus Rift S + Quest 2
Software Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores Nyooom.
Thanks rtwjunkie!

* The PSU is a HX Series™ HX1050 Power Supply — 1050 Watt 80 PLUS® Gold Certified Modular PSU, which is about six years old at this point.
* 16GB RAM.
* Ambient room temps have been warm, but not too warm - around 22'C to 24'C in the past week.
* Airflow should be pretty good, I'm using this case with quite a few large, additional fans.

That's a good call, harm9963 - I'll have to experiment and see if bringing that down some makes any difference.

Can this kind of behaviour happen through overheating? Crashes and reboots, sure, thermal throttling, sure. But altogether switching down to this kind of... different mode of operation? I don't even know what to call it, since I want to say it's slower, but the clockspeed goes up o_O

when nvidia drivers crash due to an unstable overclock, the driver resets and locks the car into a lower performance mode until you reboot.
this can also occur due to unstable CPU and RAM. my bet is the GPU heat, repaste it.
 
Joined
May 4, 2011
Messages
633 (0.13/day)
System Name Smooth-Operator
Processor AMD Ryzen 7 3800x
Motherboard Asrock x570 Taichi
Cooling AMD Wraith Prism
Memory 2x16GB 3200MHz CL16@CL14 DDR4
Video Card(s) Sapphire Radeon RX 580 8GB NITRO+
Storage 2x4TB WD HGST 7K6 7200RPM 256MB
Display(s) Samsung S24E370DL 24" IPS Freesync 75Hz
Case Fractal Design Focus G Window Blue
Audio Device(s) Creative X-Fi Titanium PCIe x1
Power Supply Corsair HX850 80+ Platinum
Mouse Gigabyte Aorus M3
Keyboard Zalman ZM-K300M
Software Windows 10 x64 Enterprise/Ubuntu Budgie amd64
when nvidia drivers crash due to an unstable overclock, the driver resets and locks the car into a lower performance mode until you reboot.
this can also occur due to unstable CPU and RAM. my bet is the GPU heat, repaste it.
In such case it would be worth to check event viewer for driver crashes which should be logged in there.
 
Joined
Jul 11, 2015
Messages
617 (0.19/day)
System Name Harm's Rig's
Processor 5950X /2700x / AMD 8370e 4500
Motherboard ASUS DARK HERO / ASRock B550 Phantom Gaming 4
Cooling Enermax LIQMAX III ARGB 360 AIO/ Zalman cooler fan 110mm
Memory Patriot Viper Steel DDR4 16GB (4x 8GB) 4000M TRIDENT Z F-43600V15D-16GTZ /G.SKILL DDR4
Video Card(s) ZOTAC AMP EXTREME AIRO 4090 / 1080 Ti /290X CFX
Storage SAMSUNG 980 PRO SSD 1TB/ WD DARK 770 2TB , Sabrent NVMe 512GB / 1 SSD 250GB / 1 HHD 3 TB
Display(s) Thermal Grizzly WireView / TCL 646 55 TV / 50 Xfinity Hisense A6 XUMO TV
Case TT 37 VIEW 200MM'S/ NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply FSP Hydro PTM PRO 1200W ATX 3.0 PCI-E GEN-5 80 Plus Platinum - EVGA 1300G2/Corsair w750
Mouse G502
Keyboard G413
Thanks guys! So much good insight :)



There is an element of stepping up to higher resolutions when training StyleGAN, yes. I'm not sure whether the GPU load increases as such, but what does happen is that the size of the image batches gets reduced at the very least, so as to remain inside the VRAM budget. However, I can see from the logs when those step downs happen and in this case I know I'm comparing like-for-like - it's just crunching through thousands of the same operations over and over again, the GPU usage is super predictable. Moreover, I've got the logs from the previous time I trained the same architecture with the same image sizes and parameters, so I can compare timings for each tick between how fast that training was taking at the same stage, and how fast this training is going. Before the GPU weirdness happens, timings are pretty much identical between themselves, and compared to the older logs. After the weirdness, everything slows by almost exactly one half.



If heat is the problem, then what might be happening is a problem with air flow through the room, rather than through the case. If the PC is generating more heat than the room's ventilation is dissipating, perhaps those several hours - plus the effect of time of day - is enough to raise the ambient temperature to a point where it becomes a problem.



That's an interesting theory. Do you mean a chip on the card, or on the mobo? Or either one?

I've opened up the case, opened windows, air-sprayed some of the dust/cobwebs away, checked that all three fans spin up correctly under load. It seems like it's all in decent working order, but operating at its peak it still levels out at about 90'C... maybe a little closer to 89'C now that the room has cooled. Since it's late on this end, I'll leave it on overnight with the improved airflow and see if that makes any difference in the long term, even while temperature remains high.

Some thoughts before I leave it overnight:
* 1080 Tis do start thermal throttling at 91'C from what I can tell, so it makes sense to me that it would level out at 90'C - 91'C. But this level of throttling seems weird on its own and it never gets anywhere close to exceeding even 100'C, let alone the 105'C cap for a thermal shutdown.
* The fact that opening the case up hasn't made any big difference to the operating temperature is a little concerning, but I can't spot any physical faults with the cooling on the card itself. It doesn't appear congested, and all three fans are running well - the airflow is there. I don't have logs of temperature from when it was working correctly, but I wonder if that might just be due to higher GPU utilisation under the neural network training load, which pushes it up toward the thermal limit because of the type of work it's doing. I should see what the temperatures look like while running a demanding game tomorrow for comparison - if that reaches the 90s, then it does suggest a cooling issue I guess.
* Would it be an option to underclock the card, or reduce the limit for thermal throttling, to see if that improves stability?
* A few sites like this mention that the max RPM of the fans is 1600, but at full load mine is actually hitting 1700 RPM most of the time. I wonder if that's accurate, and if there are any sensor readings for the RPM of the three individual fans.
* What's up with the GPU clock going up when when this happens? The fact that clock is higher, VDDC is higher, GPU load is the same but GPU temp goes down and computation is slower really doesn't make any sense in my head. Surely higher clock speeds and more voltage should result in higher temperatures and more compute?
1700 rpm is low, try a third party , msi AFB
 

Attachments

  • Capture.PNG3fans.PNG
    Capture.PNG3fans.PNG
    150.6 KB · Views: 366
  • Capture.PNGCUSTOM.PNG
    Capture.PNGCUSTOM.PNG
    419.4 KB · Views: 361
Joined
Nov 1, 2008
Messages
4,213 (0.75/day)
Location
Vietnam
System Name Gaming System / HTPC-Server
Processor i7 8700K (@4.8 Ghz All-Core) / R7 5900X
Motherboard Z370 Aorus Ultra Gaming / MSI B450 Mortar Max
Cooling CM ML360 / CM ML240L
Memory 16Gb Hynix @3200 MHz / 16Gb Hynix @3000Mhz
Video Card(s) Zotac 3080 / Colorful 1060
Storage 750G MX300 + 2x500G NVMe / 40Tb Reds + 1Tb WD Blue NVMe
Display(s) LG 27GN800-B 27'' 2K 144Hz / Sony TV
Case Xigmatek Aquarius Plus / Corsair Air 240
Audio Device(s) On Board Realtek
Power Supply Super Flower Leadex III Gold 750W / Andyson TX-700 Platinum
Mouse Logitech G502 Hero / K400+
Keyboard Wooting Two / K400+
Software Windows 10 x64
Benchmark Scores Cinebench R15 = 1542 3D Mark Timespy = 9758
Running it hard and hot for a long time may have degraded the TIM. I'd try replacing it with new compound to try to get those temps down. Even at full load, a 1080Ti shouldn't be reaching 90C. Also check the quality of the thermal pads that should be sitting on/above the memory.
I was getting into the 80's when using mine to mine crypto and that was with a second card in my PC restricting airflow.

Right now, it runs a lot cooler, even at full gaming load.
 

the54thvoid

Intoxicated Moderator
Staff member
Joined
Dec 14, 2009
Messages
12,378 (2.37/day)
Location
Glasgow - home of formal profanity
Processor Ryzen 7800X3D
Motherboard MSI MAG Mortar B650 (wifi)
Cooling be quiet! Dark Rock Pro 4
Memory 32GB Kingston Fury
Video Card(s) Gainward RTX4070ti
Storage Seagate FireCuda 530 M.2 1TB / Samsumg 960 Pro M.2 512Gb
Display(s) LG 32" 165Hz 1440p GSYNC
Case Asus Prime AP201
Audio Device(s) On Board
Power Supply be quiet! Pure POwer M12 850w Gold (ATX3.0)
Software W10
Is there a chance at all that the card thinks the workload is like a power virus. That would allow high clocks but heavy throttling on the power limit?

Other than that, as others have said, 90 degrees isn't Pascal's best temp. Cards actually start thermal throttling above about 50 degrees.
 
Joined
Feb 3, 2017
Messages
3,475 (1.33/day)
Processor R5 5600X
Motherboard ASUS ROG STRIX B550-I GAMING
Cooling Alpenföhn Black Ridge
Memory 2*16GB DDR4-2666 VLP @3800
Video Card(s) EVGA Geforce RTX 3080 XC3
Storage 1TB Samsung 970 Pro, 2TB Intel 660p
Display(s) ASUS PG279Q, Eizo EV2736W
Case Dan Cases A4-SFX
Power Supply Corsair SF600
Mouse Corsair Ironclaw Wireless RGB
Keyboard Corsair K60
VR HMD HTC Vive
Voltage reliability? With GPU at 90C, what does VRM do, could it simply be VRM overheating?
Rest of the changes make sense but GPU voltage up and GPU clocks from ~1600 to 2000 seems quite strange and would hint at some software/driver change kicking in.

Definitely try forcing fans at 100% and see if that changes anything.
I would also try lower the power limit - to around 200W maybe - just for testing to see what happens.
 
Joined
Jan 8, 2017
Messages
8,863 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
To summarise the sensor log:
* GPU Clock spikes up and down around 1600, until around the 2h40 mark, at which point it caps out at 2000.
* GPU Temperature starts out close to 90'C, but after the breaking point at 2h40, drops down to about 64'C.
* Accordingly Fan Speed % is at 100 at first, but then drops to 70.
* GPU Load is between 90% and 100%, but after 2h40 it rises up to being 95% to 100%.
* Memory Controller Load goes from averaging 55% to 25%.
* Power Consumption drops from average 240W to 160W.
* PerfCap Reason oscillates between 2 and 16 in the first part, but is consistently 4 in the second part. I.e. it goes from power and thermal caps to voltage reliability cap super consistently.
* VDDC switches up from wiggling around ~0.81V to flat 1.04V
* CPU Temp drops from ~56'C to ~52'C
* Memory usage doesn't change.

All of this points out to thermal problems , the card hits the 90C wall at which point it underclocks severely -> power and tempreure goes down.

The cooling is screwed up, there is no question about it. Check the case ventilation and/or take the card apart and reapply TIM. 1700 RPM isn't really low for a triple fan cooler of this caliber and should keep the card no where near 90c.

There are also other things to consider, for example, data samples and processing done on it, what neural network does since the moment performance drops. Maybe performance drops because operations are getting more complicated and gpu simply doesn't have enough performance to keep up. I'm not deep into ML so these are just my thoughts of person who knows just absolute basics on level a tiny bit higher than just definition of ML.

Compute workloads usually result in less heat output because the silicon that's dedicated for the fixed function graphics stuff isn't in use.
 
Last edited:
Joined
Jul 18, 2007
Messages
2,693 (0.44/day)
System Name panda
Processor 6700k
Motherboard sabertooth s
Cooling raystorm block<black ice stealth 240 rad<ek dcc 18w 140 xres
Memory 32gb ripjaw v
Video Card(s) 290x gamer<ntzx g10<antec 920
Storage 950 pro 250gb boot 850 evo pr0n
Display(s) QX2710LED@110hz lg 27ud68p
Case 540 Air
Audio Device(s) nope
Power Supply 750w superflower
Mouse g502
Keyboard shine 3 with grey, black and red caps
Software win 10
Benchmark Scores http://hwbot.org/user/marsey99/
how hot is the air coming out the psu?
 
Joined
Sep 17, 2014
Messages
20,780 (5.97/day)
Location
The Washing Machine
Processor i7 8700k 4.6Ghz @ 1.24V
Motherboard AsRock Fatal1ty K6 Z370
Cooling beQuiet! Dark Rock Pro 3
Memory 16GB Corsair Vengeance LPX 3200/C16
Video Card(s) ASRock RX7900XT Phantom Gaming
Storage Samsung 850 EVO 1TB + Samsung 830 256GB + Crucial BX100 250GB + Toshiba 1TB HDD
Display(s) Gigabyte G34QWC (3440x1440)
Case Fractal Design Define R5
Audio Device(s) Harman Kardon AVR137 + 2.1
Power Supply EVGA Supernova G2 750W
Mouse XTRFY M42
Keyboard Lenovo Thinkpad Trackpoint II
Software W10 x64
Thanks guys! So much good insight :)



There is an element of stepping up to higher resolutions when training StyleGAN, yes. I'm not sure whether the GPU load increases as such, but what does happen is that the size of the image batches gets reduced at the very least, so as to remain inside the VRAM budget. However, I can see from the logs when those step downs happen and in this case I know I'm comparing like-for-like - it's just crunching through thousands of the same operations over and over again, the GPU usage is super predictable. Moreover, I've got the logs from the previous time I trained the same architecture with the same image sizes and parameters, so I can compare timings for each tick between how fast that training was taking at the same stage, and how fast this training is going. Before the GPU weirdness happens, timings are pretty much identical between themselves, and compared to the older logs. After the weirdness, everything slows by almost exactly one half.



If heat is the problem, then what might be happening is a problem with air flow through the room, rather than through the case. If the PC is generating more heat than the room's ventilation is dissipating, perhaps those several hours - plus the effect of time of day - is enough to raise the ambient temperature to a point where it becomes a problem.



That's an interesting theory. Do you mean a chip on the card, or on the mobo? Or either one?

I've opened up the case, opened windows, air-sprayed some of the dust/cobwebs away, checked that all three fans spin up correctly under load. It seems like it's all in decent working order, but operating at its peak it still levels out at about 90'C... maybe a little closer to 89'C now that the room has cooled. Since it's late on this end, I'll leave it on overnight with the improved airflow and see if that makes any difference in the long term, even while temperature remains high.

Some thoughts before I leave it overnight:
* 1080 Tis do start thermal throttling at 91'C from what I can tell, so it makes sense to me that it would level out at 90'C - 91'C. But this level of throttling seems weird on its own and it never gets anywhere close to exceeding even 100'C, let alone the 105'C cap for a thermal shutdown.
* The fact that opening the case up hasn't made any big difference to the operating temperature is a little concerning, but I can't spot any physical faults with the cooling on the card itself. It doesn't appear congested, and all three fans are running well - the airflow is there. I don't have logs of temperature from when it was working correctly, but I wonder if that might just be due to higher GPU utilisation under the neural network training load, which pushes it up toward the thermal limit because of the type of work it's doing. I should see what the temperatures look like while running a demanding game tomorrow for comparison - if that reaches the 90s, then it does suggest a cooling issue I guess.
* Would it be an option to underclock the card, or reduce the limit for thermal throttling, to see if that improves stability?
* A few sites like this mention that the max RPM of the fans is 1600, but at full load mine is actually hitting 1700 RPM most of the time. I wonder if that's accurate, and if there are any sensor readings for the RPM of the three individual fans.
* What's up with the GPU clock going up when when this happens? The fact that clock is higher, VDDC is higher, GPU load is the same but GPU temp goes down and computation is slower really doesn't make any sense in my head. Surely higher clock speeds and more voltage should result in higher temperatures and more compute?

Few things:

- Did you ever flex the cooling solution OR applied pressure to it while it was installed. You will notice that if you 'push' or squeeze the back end of the card while its running, temps will probably skyrocket to 90C+. There are sensors around that area that will report differently at that point. This would explain the erratic behavior; if the card has a bend or sag. A badly placed anti-sag support for a GPU can have the same effect ironically ;) This is a pretty rare thing, but reading the topic I think this is the sort of odd stuff you should look for. It also possibly explains why things stabilize later on; when everything is warmed up, some minor expansion happens and problem goes away.

- Its clear the card can run at max load without problems because your sensor log shows normal behavior when it stops switching power states. You could try forcing a power state OR go into NVCP and play around with the power management options; Prefer max power isn't always the best solution for example, because Pascal is temperature limited. There might be a weird conflict as well between the chosen power state /Windows power plan/GPU BIOS. But that does not explain too well why the power state does get stable later on.

- Are you running at stock power target/core volts?

Voltage reliability? With GPU at 90C, what does VRM do, could it simply be VRM overheating?

That seems most likely at this point, but it still does not explain too well why things stabilize later on.
 
Last edited:
Joined
Jul 11, 2015
Messages
617 (0.19/day)
System Name Harm's Rig's
Processor 5950X /2700x / AMD 8370e 4500
Motherboard ASUS DARK HERO / ASRock B550 Phantom Gaming 4
Cooling Enermax LIQMAX III ARGB 360 AIO/ Zalman cooler fan 110mm
Memory Patriot Viper Steel DDR4 16GB (4x 8GB) 4000M TRIDENT Z F-43600V15D-16GTZ /G.SKILL DDR4
Video Card(s) ZOTAC AMP EXTREME AIRO 4090 / 1080 Ti /290X CFX
Storage SAMSUNG 980 PRO SSD 1TB/ WD DARK 770 2TB , Sabrent NVMe 512GB / 1 SSD 250GB / 1 HHD 3 TB
Display(s) Thermal Grizzly WireView / TCL 646 55 TV / 50 Xfinity Hisense A6 XUMO TV
Case TT 37 VIEW 200MM'S/ NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply FSP Hydro PTM PRO 1200W ATX 3.0 PCI-E GEN-5 80 Plus Platinum - EVGA 1300G2/Corsair w750
Mouse G502
Keyboard G413
Can you run a short benchmark ,like 2 or 3 min, GPU bound ,log it and compare to my log.
log benchmark 4k optimize .
 

Attachments

  • GPU-Z Sensor Log.txt
    79.4 KB · Views: 293
Last edited:

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.00/day)
1700 rpm is low, try a third party , msi AFB

So, I'm still testing the results, but I think you've hit the nail on the head - at least as far as my thermal situation goes. Tweaking the fan curve in MSI Afterburner has made a huge difference - the way it was set up:
125390


Meant that literally the moment temperature dropped below 90'C, fan speed would drop sharply as well. Adjusting a custom curve closer to yours, has made a massive difference - the GPU at full load runs around 65'C now.

Opening the case up and lowering room temperature overnight didn't make any difference - GPU temp stayed at 90'C. But I set up the new fan profile in the morning and after that it seemed to merrily keep working away at 65'C all day without any issues.

That said, I'm not entirely sure I'm out of the clear yet. Putting it through its paces this evening, I've had a couple of driver crashes each after running the training for just ten minutes or so - even without hitting the thermal cap. It's worth pointing out that this past week I've had two different things happen when I run training: sometimes I'd get what looks like a driver crash some minutes into training and that would cause the training to error out and need to be manually resumed. In that case I'd normally keep trying, sometimes needing to reboot, until it would run without crashing... at which point I'd normally get this other problem, where after a few hours it starts to go slower.

I'm not sure whether or how these issues are linked, though. I've gotten one good run for about nine hours without any issues at all, which is very promising. But subsequently, it has been crashy. If the other issue doesn't return though, I guess I'll classify them as different problems and consider this case closed for now :)

Can you run a short benchmark ,like 2 or 3 min, GPU bound ,log it and compare to my log.

Attached. It looks like I can get my fan speed probably a bit more aggressive still, compared to yours, but I think over longer-term usage the GPU temperature does cap out lower for me now.
 

Attachments

  • GPU-Z Benchmark.txt
    231.4 KB · Views: 278
Joined
Jul 11, 2015
Messages
617 (0.19/day)
System Name Harm's Rig's
Processor 5950X /2700x / AMD 8370e 4500
Motherboard ASUS DARK HERO / ASRock B550 Phantom Gaming 4
Cooling Enermax LIQMAX III ARGB 360 AIO/ Zalman cooler fan 110mm
Memory Patriot Viper Steel DDR4 16GB (4x 8GB) 4000M TRIDENT Z F-43600V15D-16GTZ /G.SKILL DDR4
Video Card(s) ZOTAC AMP EXTREME AIRO 4090 / 1080 Ti /290X CFX
Storage SAMSUNG 980 PRO SSD 1TB/ WD DARK 770 2TB , Sabrent NVMe 512GB / 1 SSD 250GB / 1 HHD 3 TB
Display(s) Thermal Grizzly WireView / TCL 646 55 TV / 50 Xfinity Hisense A6 XUMO TV
Case TT 37 VIEW 200MM'S/ NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply FSP Hydro PTM PRO 1200W ATX 3.0 PCI-E GEN-5 80 Plus Platinum - EVGA 1300G2/Corsair w750
Mouse G502
Keyboard G413

Attachments

  • Display Driver Uninstaller.exe
    1.3 MB · Views: 493
  • Capture.PNGMSI.PNG
    Capture.PNGMSI.PNG
    80.8 KB · Views: 344
  • Capture.PNGtestgpu.PNG
    Capture.PNGtestgpu.PNG
    90.4 KB · Views: 307

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.00/day)
What windows are you using? and have you used DDU?
https://nvidia.custhelp.com/app/answers/detail/a_id/4808 , HOTFIX DRIVER 430.97

I'm on Windows 10 version 1903. Haven't resorted to DDU yet - just going to try out the driver hotfix you've linked.

MSI Overclocking is being a bit weird. When I start testing, it'll peak my GPU usage for a few minutes, but it won't stop the test of its own accord, and when I hit 'Stop' it gives me a C++ Runtime Error and then just crashes :shadedshu: Will have to have a fiddle with it.
 
Joined
Jul 11, 2015
Messages
617 (0.19/day)
System Name Harm's Rig's
Processor 5950X /2700x / AMD 8370e 4500
Motherboard ASUS DARK HERO / ASRock B550 Phantom Gaming 4
Cooling Enermax LIQMAX III ARGB 360 AIO/ Zalman cooler fan 110mm
Memory Patriot Viper Steel DDR4 16GB (4x 8GB) 4000M TRIDENT Z F-43600V15D-16GTZ /G.SKILL DDR4
Video Card(s) ZOTAC AMP EXTREME AIRO 4090 / 1080 Ti /290X CFX
Storage SAMSUNG 980 PRO SSD 1TB/ WD DARK 770 2TB , Sabrent NVMe 512GB / 1 SSD 250GB / 1 HHD 3 TB
Display(s) Thermal Grizzly WireView / TCL 646 55 TV / 50 Xfinity Hisense A6 XUMO TV
Case TT 37 VIEW 200MM'S/ NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply FSP Hydro PTM PRO 1200W ATX 3.0 PCI-E GEN-5 80 Plus Platinum - EVGA 1300G2/Corsair w750
Mouse G502
Keyboard G413
GeForce Hotfix Driver Version 431.18
GeForce Hotfix display driver version 431.18 is based on our latest Game Ready Driver 430.86. This Hotfix driver addresses the following:
  • Fixes BSOD after waking ASUS GL703GS/Asus GL502VML notebook from hibernation
  • Shadow of the Tomb Raider may experience a game crash or TDR when launching game on Pascal GPU
  • Shadow of the Tomb Raider: Benchmark quits when running with ray tracing is enabled
  • Grand Theft Auto V may experience flickering when MSAA is used
This driver also includes the fixes that were released as part of the GeForce Hotfix 430.97 display driver.

 
Top