• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

1080 Ti - Degradation of Performance During Neural Network Training

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.27/day)
Up front: I have an Inno3D GeForce GTX 1080Ti iChill X3 with an ASUS P9X79 Pro motherboard, running Windows 10. I know Inno3D doesn't have the best reputation, but I'm keeping my fingers crossed I don't need to replace the GPU just yet ;)

I've noticed this issue mainly while running neural network training on the GPU, but I think this is affecting games as well - recently, the card has started running slower.

More precisely, what happens is that I'll start up the computer, begin training, and for a few hours it will run at top capacity without any issues. But all of a sudden, it'll drop in productivity and subsequent work done takes almost exactly twice as long as before. Other times, CUDA will error out or the entire computer will crash and reboot. The thing is that this has only started happening recently - I ran the same training algorithm (StyleGAN) with the same image sizes, exact same load on GPU and VRAM, and I never had any of these problems. Something has started messing it up.

Trying to figure out what could be causing this - I ended up running a GPU-Z capture over the course of a day, which I've attached.

The switch over from one mode of operation to the other is really obvious, looking at the capture, but I'm having trouble figuring out what could cause this. As far as I can tell, there's nothing in the software that would cause it to switch operation, and I can't see anything going on in terms of background processes. If I kill the process and then start again, it continues running slow. Only way to reset it is to restart the computer, at which point the cycle continues.

To summarise the sensor log:
* GPU Clock spikes up and down around 1600, until around the 2h40 mark, at which point it caps out at 2000.
* GPU Temperature starts out close to 90'C, but after the breaking point at 2h40, drops down to about 64'C.
* Accordingly Fan Speed % is at 100 at first, but then drops to 70.
* GPU Load is between 90% and 100%, but after 2h40 it rises up to being 95% to 100%.
* Memory Controller Load goes from averaging 55% to 25%.
* Power Consumption drops from average 240W to 160W.
* PerfCap Reason oscillates between 2 and 16 in the first part, but is consistently 4 in the second part. I.e. it goes from power and thermal caps to voltage reliability cap super consistently.
* VDDC switches up from wiggling around ~0.81V to flat 1.04V
* CPU Temp drops from ~56'C to ~52'C
* Memory usage doesn't change.

I can't make heads or tails of this, personally. My initial guess was that the problem was maybe overheating, or some other process kicking in and using up GPU (maybe I've acquired BitCoin mining malware). But these figures make no sense to me. GPU Clock Speed goes up, but GPU Temperature goes down. VDDC goes up, Power Consumption goes down. GPU Load rises, but the GPU seems to be doing less work.

Maybe another process kicks in that's demanding GPU time, but with less occupancy? Even that, I'm not sure that it makes sense.

Please let me know if you have any idea what this might be a symptom of! :)
 

Attachments

rtwjunkie

PC Gaming Enthusiast
Supporter
Joined
Jul 25, 2008
Messages
12,197 (3.04/day)
Location
Louisiana -Laissez les bons temps rouler!
System Name Bayou Devil
Processor Core i7-4790k 4.4Ghz @ 1.18v
Motherboard ASUS Z97 Deluxe
Cooling All air: 2x140mm Fractal exhaust; 3x 140mm Cougar Intake; Enermax T40F CPU cooler
Memory 2x 8GB Mushkin Redline DDR-3 1866
Video Card(s) MSI GTX 1080Ti Gaming X
Storage 1x 500 MX500 SSD; 1x 2TB WD Black; 2x 4TB WD Black;1x 2TB WD Green (eSATA)
Display(s) HP 27q 27" IPS @ 2560 x 1440
Case Fractal Design Define R4 Black w/Titanium front -windowed
Audio Device(s) Soundblaster Z
Power Supply Seasonic X-850
Mouse Coolermaster Sentinel III
Keyboard Logitech G610 Orion mechanical (Cherry Brown switches)
Software Windows 10 Pro 64-bit (Start10 & Fences 3.0 installed)
What is your power supply? Brand, wattage, age?

How much RAM do you have? What are ambient room temps? How is air flow through the case.

Thanks for the good breakdown. These are a few other things which might or might not be oertinent as well.
 
Joined
Jul 11, 2015
Messages
27 (0.02/day)
System Name Harm's Rig
Processor AMD 8370e 4500
Motherboard Msi GD80 990fxa
Cooling Air
Memory EVGA 16Gb
Video Card(s) 1080 Ti / 290x/290a cfx
Storage 1SSD / 1HHD
Display(s) 40"TV / 24" Asus
Case NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply EVGA 1300G2
Mouse G502
Keyboard G413

Attachments

Last edited:

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.27/day)
Thanks rtwjunkie!

* The PSU is a HX Series™ HX1050 Power Supply — 1050 Watt 80 PLUS® Gold Certified Modular PSU, which is about six years old at this point.
* 16GB RAM.
* Ambient room temps have been warm, but not too warm - around 22'C to 24'C in the past week.
* Airflow should be pretty good, I'm using this case with quite a few large, additional fans.

That's a good call, harm9963 - I'll have to experiment and see if bringing that down some makes any difference.

Can this kind of behaviour happen through overheating? Crashes and reboots, sure, thermal throttling, sure. But altogether switching down to this kind of... different mode of operation? I don't even know what to call it, since I want to say it's slower, but the clockspeed goes up o_O
 

INSTG8R

My Custom Title
Joined
Nov 26, 2004
Messages
5,510 (1.03/day)
Location
Canuck in Norway
System Name Hellbox 3.0(same case new guts)
Processor i7 4790K 4.6
Motherboard Asus Z97 Sabertooth Mark 1
Cooling TT Kandalf L.C.S.(Water/Air)AC Cuplex Kryos CPU Block/Noctua
Memory 2x8GB Corsair Vengance Pro 2400
Video Card(s) Sapphire Nitro+ Vega 64
Storage WD Caviar Black SATA 3 1TB x2 RAID 0 2xSamsung 850 Evo 500GB RAID 0 1TB WD Blue
Display(s) Samsung CGH70 27” 1440 144hz Freesync 2 HDR
Case TT Kandalf L.C.S.
Audio Device(s) Soundblaster ZX/Logitech Z906 5.1
Power Supply Seasonic X-1050W 80+ Gold
Mouse G502 Proteus Spectrum
Keyboard G19s
Software Win 10 Pro x64
Yeah with your temps and the subsequent behaviour it definitely sounds like it’s starting to throttle. Improving your load temps would probably solve this.
 
Joined
Jul 11, 2015
Messages
27 (0.02/day)
System Name Harm's Rig
Processor AMD 8370e 4500
Motherboard Msi GD80 990fxa
Cooling Air
Memory EVGA 16Gb
Video Card(s) 1080 Ti / 290x/290a cfx
Storage 1SSD / 1HHD
Display(s) 40"TV / 24" Asus
Case NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply EVGA 1300G2
Mouse G502
Keyboard G413
Thanks rtwjunkie!

* The PSU is a HX Series™ HX1050 Power Supply — 1050 Watt 80 PLUS® Gold Certified Modular PSU, which is about six years old at this point.
* 16GB RAM.
* Ambient room temps have been warm, but not too warm - around 22'C to 24'C in the past week.
* Airflow should be pretty good, I'm using this case with quite a few large, additional fans.

That's a good call, harm9963 - I'll have to experiment and see if bringing that down some makes any difference.

Can this kind of behaviour happen through overheating? Crashes and reboots, sure, thermal throttling, sure. But altogether switching down to this kind of... different mode of operation? I don't even know what to call it, since I want to say it's slower, but the clockspeed goes up o_O
Open your case and put your hand close to GPU, you should feel a lot of air moving against your hand.
 

Attachments

Joined
May 4, 2011
Messages
323 (0.11/day)
System Name Arena-Fighter6a
Processor AMD Ryzen 7 2700x
Motherboard Asrock x470 Taichi Ultimate
Cooling AMD Wraith Prism
Memory 2x16GB 3000MHz CL16 DDR4
Video Card(s) Gigabyte Radeon R9 380 4GB
Storage JBOD: 2+1+1TB 7200RPM HDD
Display(s) Samsung S24E370DL 24" IPS Freesync 75Hz
Case SilentiumPC Armis AR7 TG-RGB
Audio Device(s) Creative X-Fi Titanium PCIe x1
Power Supply Corsair HX850 80+ Platinum
Mouse Gigabyte Aorus M3
Keyboard Zalman ZM-K300M
Software Windows 10 x64 Enterprise/Ubuntu Budgie amd64
Yeah with your temps and the subsequent behaviour it definitely sounds like it’s starting to throttle. Improving your load temps would probably solve this.
I partially agree on this but i think if the card would start to throttle it would do it sooner, not after few hours. Machine learning training is an intensive task which i would assume makes gpu load to be constantly around 99-100% so any thermal issues would appear within minutes.

There are also other things to consider, for example, data samples and processing done on it, what neural network does since the moment performance drops. Maybe performance drops because operations are getting more complicated and gpu simply doesn't have enough performance to keep up. I'm not deep into ML so these are just my thoughts of person who knows just absolute basics on level a tiny bit higher than just definition of ML.

On the other hand, in gaming we have similar problems. In this kind of tasks gpu also are not constant and performance drops over time if you are playing for several hours at one sitting which was tested and documented over internet pretty well.
 
Joined
Oct 21, 2006
Messages
106 (0.02/day)
Location
Oak Ridge, TN
System Name BorgX79
Processor E5-1650v2 6/12cores@4.4GHz
Motherboard Sabertoothx79
Cooling Capitan 360
Memory Muhskin DDR3-1866
Video Card(s) Sapphire R480 8GB
Storage Chronos SSD
Display(s) 3x VW266H
Case Ching Mien 600
Audio Device(s) Realtek
Power Supply Cooler Master 1000W Silent Pro
Mouse Logitech G900
Keyboard Rosewill RK-1000
Software Win7x64
If the voltage goes high, and power goes down, I'd bet a power supply chip is locking up or otherwise losing it's shit, and the only reaction the board has is to go to a 'limp home' type mode to preserve itself.

I'd work on lowering the temperature first, and see if that extends the "ON" time; if it affects it, that's likely the deal.

Using a IR temperature measuring device can help id the part that's freaking out; it will cool considerably when it freaks.

Adding a small heatsink to said part might help, the fact it takes hours makes me think it might not be heatsinked.

If it's a chip that's not being reported over the smbus, the 'throttle monitor' might not see it, and it may just be one of those things that someone said "this will never happen, so we don't need to monitor it, but if it does this, we change to this run profile to keep it from catching on fire".

:)

Change the environment; if the operation time changes, it's heat; if it doesn't, it's software.
 

INSTG8R

My Custom Title
Joined
Nov 26, 2004
Messages
5,510 (1.03/day)
Location
Canuck in Norway
System Name Hellbox 3.0(same case new guts)
Processor i7 4790K 4.6
Motherboard Asus Z97 Sabertooth Mark 1
Cooling TT Kandalf L.C.S.(Water/Air)AC Cuplex Kryos CPU Block/Noctua
Memory 2x8GB Corsair Vengance Pro 2400
Video Card(s) Sapphire Nitro+ Vega 64
Storage WD Caviar Black SATA 3 1TB x2 RAID 0 2xSamsung 850 Evo 500GB RAID 0 1TB WD Blue
Display(s) Samsung CGH70 27” 1440 144hz Freesync 2 HDR
Case TT Kandalf L.C.S.
Audio Device(s) Soundblaster ZX/Logitech Z906 5.1
Power Supply Seasonic X-1050W 80+ Gold
Mouse G502 Proteus Spectrum
Keyboard G19s
Software Win 10 Pro x64
I partially agree on this but i think if the card would start to throttle it would do it sooner, not after few hours. Machine learning training is an intensive task which i would assume makes gpu load to be constantly around 99-100% so any thermal issues would appear within minutes.

There are also other things to consider, for example, data samples and processing done on it, what neural network does since the moment performance drops. Maybe performance drops because operations are getting more complicated and gpu simply doesn't have enough performance to keep up. I'm not deep into ML so these are just my thoughts of person who knows just absolute basics on level a tiny bit higher than just definition of ML.

On the other hand, in gaming we have similar problems. In this kind of tasks gpu also are not constant and performance drops over time if you are playing for several hours at one sitting which was tested and documented over internet pretty well.
I totally agree but high temps, voltage drops, utilization fluctuations definitely point to the card “backing down”. The timing is odd but all the symptoms are there. But because it’s a genuinely odd case there may be more to it. I know my Vega runs different boosts and temperatures across different games despite 100% utilization in all cases. As you said this definitely a heavy utilization scenario so only The odd timing is the question mark.
 

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.27/day)
Thanks guys! So much good insight :)

There are also other things to consider, for example, data samples and processing done on it, what neural network does since the moment performance drops. Maybe performance drops because operations are getting more complicated and gpu simply doesn't have enough performance to keep up. I'm not deep into ML so these are just my thoughts of person who knows just absolute basics on level a tiny bit higher than just definition of ML.
There is an element of stepping up to higher resolutions when training StyleGAN, yes. I'm not sure whether the GPU load increases as such, but what does happen is that the size of the image batches gets reduced at the very least, so as to remain inside the VRAM budget. However, I can see from the logs when those step downs happen and in this case I know I'm comparing like-for-like - it's just crunching through thousands of the same operations over and over again, the GPU usage is super predictable. Moreover, I've got the logs from the previous time I trained the same architecture with the same image sizes and parameters, so I can compare timings for each tick between how fast that training was taking at the same stage, and how fast this training is going. Before the GPU weirdness happens, timings are pretty much identical between themselves, and compared to the older logs. After the weirdness, everything slows by almost exactly one half.

partially agree on this but i think if the card would start to throttle it would do it sooner, not after few hours. Machine learning training is an intensive task which i would assume makes gpu load to be constantly around 99-100% so any thermal issues would appear within minutes.
If heat is the problem, then what might be happening is a problem with air flow through the room, rather than through the case. If the PC is generating more heat than the room's ventilation is dissipating, perhaps those several hours - plus the effect of time of day - is enough to raise the ambient temperature to a point where it becomes a problem.

If the voltage goes high, and power goes down, I'd bet a power supply chip is locking up or otherwise losing it's shit, and the only reaction the board has is to go to a 'limp home' type mode to preserve itself.
That's an interesting theory. Do you mean a chip on the card, or on the mobo? Or either one?

I've opened up the case, opened windows, air-sprayed some of the dust/cobwebs away, checked that all three fans spin up correctly under load. It seems like it's all in decent working order, but operating at its peak it still levels out at about 90'C... maybe a little closer to 89'C now that the room has cooled. Since it's late on this end, I'll leave it on overnight with the improved airflow and see if that makes any difference in the long term, even while temperature remains high.

Some thoughts before I leave it overnight:
* 1080 Tis do start thermal throttling at 91'C from what I can tell, so it makes sense to me that it would level out at 90'C - 91'C. But this level of throttling seems weird on its own and it never gets anywhere close to exceeding even 100'C, let alone the 105'C cap for a thermal shutdown.
* The fact that opening the case up hasn't made any big difference to the operating temperature is a little concerning, but I can't spot any physical faults with the cooling on the card itself. It doesn't appear congested, and all three fans are running well - the airflow is there. I don't have logs of temperature from when it was working correctly, but I wonder if that might just be due to higher GPU utilisation under the neural network training load, which pushes it up toward the thermal limit because of the type of work it's doing. I should see what the temperatures look like while running a demanding game tomorrow for comparison - if that reaches the 90s, then it does suggest a cooling issue I guess.
* Would it be an option to underclock the card, or reduce the limit for thermal throttling, to see if that improves stability?
* A few sites like this mention that the max RPM of the fans is 1600, but at full load mine is actually hitting 1700 RPM most of the time. I wonder if that's accurate, and if there are any sensor readings for the RPM of the three individual fans.
* What's up with the GPU clock going up when when this happens? The fact that clock is higher, VDDC is higher, GPU load is the same but GPU temp goes down and computation is slower really doesn't make any sense in my head. Surely higher clock speeds and more voltage should result in higher temperatures and more compute?
 

Solaris17

Dainty Moderator
Staff member
Joined
Aug 16, 2005
Messages
20,604 (4.05/day)
Location
Florida
System Name Venslar
Processor I9 7980XE
Motherboard MSI x299 Tomahawk Arctic
Cooling EK Custom
Memory 32GB Corsair DDR4 3000mhz
Video Card(s) Nvidia Titan RTX
Storage 2x 2TB Micron SSDs | 1x ADATA 128SSD | 1x Drevo 256SSD | 1x 1TB 850 EVO | 1x 250GB 960 EVO
Display(s) 3x AOC Q2577PWQ (2k IPS)
Case Inwin 303 White (Thermaltake Ring 120mm Purple accent)
Audio Device(s) Realtek ALC 1220 on Audio-Technica ATH-AG1
Power Supply Seasonic 1050W Snow
Mouse Roccat Kone Aimo White
Keyboard Ducky Shine 6 Snow White
Software Windows 10 x64 Pro
Have you looked at the actual CPU/RAM consumption of the machine?

Have you tried different drivers? like older ones? or the studio drivers?
 

Mussels

Moderprator
Staff member
Joined
Oct 6, 2004
Messages
47,072 (8.72/day)
Location
Australalalalalaia.
System Name LGBTBBQAMDRGBDIYPC
Processor Ryzen R7 2700X (stock/XFR OC)
Motherboard Aorus AX370-Gaming 5 (planned x570 upgrade when they come out)
Cooling Corsair H115i Pro W/ Corsair ML RGB fans
Memory 16GB DDR4 3200 Corsair Vengeance RGB Pro
Video Card(s) MSI GTX 1080 Gaming X (BIOS mod to Gaming Z) w/ NZXT Kraken x52 AIO
Storage 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000 Pro NVME
Display(s) Phillips 328m6fjrmb (32" 1440p 144hz curved) + Sony KD-55X8500F (55" 4K HDR)
Case Fractal Design R6 Gunmetal Grey (Type C TG)
Audio Device(s) Razer Leviathan + Corsair Void pro RGB, Blue Yeti mic
Power Supply Corsair HX 750i (Platinum, fan off til 300W)
Mouse Logitech G903 + PowerPlay mousepad
Keyboard Corsair K65 Rapidfire
Software Windows 10 pro x64 (all systems)
Benchmark Scores Laptops: i7-4510U + 840M 2GB (touchscreen) 275GB SSD + 16GB i7-2630QM + GT 540M + 8GB
Thanks rtwjunkie!

* The PSU is a HX Series™ HX1050 Power Supply — 1050 Watt 80 PLUS® Gold Certified Modular PSU, which is about six years old at this point.
* 16GB RAM.
* Ambient room temps have been warm, but not too warm - around 22'C to 24'C in the past week.
* Airflow should be pretty good, I'm using this case with quite a few large, additional fans.

That's a good call, harm9963 - I'll have to experiment and see if bringing that down some makes any difference.

Can this kind of behaviour happen through overheating? Crashes and reboots, sure, thermal throttling, sure. But altogether switching down to this kind of... different mode of operation? I don't even know what to call it, since I want to say it's slower, but the clockspeed goes up o_O
when nvidia drivers crash due to an unstable overclock, the driver resets and locks the car into a lower performance mode until you reboot.
this can also occur due to unstable CPU and RAM. my bet is the GPU heat, repaste it.
 
Joined
May 4, 2011
Messages
323 (0.11/day)
System Name Arena-Fighter6a
Processor AMD Ryzen 7 2700x
Motherboard Asrock x470 Taichi Ultimate
Cooling AMD Wraith Prism
Memory 2x16GB 3000MHz CL16 DDR4
Video Card(s) Gigabyte Radeon R9 380 4GB
Storage JBOD: 2+1+1TB 7200RPM HDD
Display(s) Samsung S24E370DL 24" IPS Freesync 75Hz
Case SilentiumPC Armis AR7 TG-RGB
Audio Device(s) Creative X-Fi Titanium PCIe x1
Power Supply Corsair HX850 80+ Platinum
Mouse Gigabyte Aorus M3
Keyboard Zalman ZM-K300M
Software Windows 10 x64 Enterprise/Ubuntu Budgie amd64
when nvidia drivers crash due to an unstable overclock, the driver resets and locks the car into a lower performance mode until you reboot.
this can also occur due to unstable CPU and RAM. my bet is the GPU heat, repaste it.
In such case it would be worth to check event viewer for driver crashes which should be logged in there.
 
Joined
Jul 11, 2015
Messages
27 (0.02/day)
System Name Harm's Rig
Processor AMD 8370e 4500
Motherboard Msi GD80 990fxa
Cooling Air
Memory EVGA 16Gb
Video Card(s) 1080 Ti / 290x/290a cfx
Storage 1SSD / 1HHD
Display(s) 40"TV / 24" Asus
Case NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply EVGA 1300G2
Mouse G502
Keyboard G413
Thanks guys! So much good insight :)



There is an element of stepping up to higher resolutions when training StyleGAN, yes. I'm not sure whether the GPU load increases as such, but what does happen is that the size of the image batches gets reduced at the very least, so as to remain inside the VRAM budget. However, I can see from the logs when those step downs happen and in this case I know I'm comparing like-for-like - it's just crunching through thousands of the same operations over and over again, the GPU usage is super predictable. Moreover, I've got the logs from the previous time I trained the same architecture with the same image sizes and parameters, so I can compare timings for each tick between how fast that training was taking at the same stage, and how fast this training is going. Before the GPU weirdness happens, timings are pretty much identical between themselves, and compared to the older logs. After the weirdness, everything slows by almost exactly one half.



If heat is the problem, then what might be happening is a problem with air flow through the room, rather than through the case. If the PC is generating more heat than the room's ventilation is dissipating, perhaps those several hours - plus the effect of time of day - is enough to raise the ambient temperature to a point where it becomes a problem.



That's an interesting theory. Do you mean a chip on the card, or on the mobo? Or either one?

I've opened up the case, opened windows, air-sprayed some of the dust/cobwebs away, checked that all three fans spin up correctly under load. It seems like it's all in decent working order, but operating at its peak it still levels out at about 90'C... maybe a little closer to 89'C now that the room has cooled. Since it's late on this end, I'll leave it on overnight with the improved airflow and see if that makes any difference in the long term, even while temperature remains high.

Some thoughts before I leave it overnight:
* 1080 Tis do start thermal throttling at 91'C from what I can tell, so it makes sense to me that it would level out at 90'C - 91'C. But this level of throttling seems weird on its own and it never gets anywhere close to exceeding even 100'C, let alone the 105'C cap for a thermal shutdown.
* The fact that opening the case up hasn't made any big difference to the operating temperature is a little concerning, but I can't spot any physical faults with the cooling on the card itself. It doesn't appear congested, and all three fans are running well - the airflow is there. I don't have logs of temperature from when it was working correctly, but I wonder if that might just be due to higher GPU utilisation under the neural network training load, which pushes it up toward the thermal limit because of the type of work it's doing. I should see what the temperatures look like while running a demanding game tomorrow for comparison - if that reaches the 90s, then it does suggest a cooling issue I guess.
* Would it be an option to underclock the card, or reduce the limit for thermal throttling, to see if that improves stability?
* A few sites like this mention that the max RPM of the fans is 1600, but at full load mine is actually hitting 1700 RPM most of the time. I wonder if that's accurate, and if there are any sensor readings for the RPM of the three individual fans.
* What's up with the GPU clock going up when when this happens? The fact that clock is higher, VDDC is higher, GPU load is the same but GPU temp goes down and computation is slower really doesn't make any sense in my head. Surely higher clock speeds and more voltage should result in higher temperatures and more compute?
1700 rpm is low, try a third party , msi AFB
 

Attachments

Joined
Nov 1, 2008
Messages
3,865 (0.99/day)
Location
Vietnam
System Name Gaming System / Laptop / HTPC / Miner
Processor i5 8700K @4.8Ghz / i5 540m / i7 970 / i5 2500k
Motherboard Z370 Aorus Ultra Gaming / Acer / Shuttle sx58j3/ P67 Pro 3
Cooling CM Seidon 120 XL / Laptop Cooling / SE-903 / Stock
Memory 16Gb Nighthawk 3000 MHz/ 4GB DDR3 / 16gb DDR3 / 12 GB DDR3
Video Card(s) Colorful 1080Ti / G210m / 7870XT / 2x1060 + 1080
Storage 750G MX300 + 3TB HDDs / 250G Ultra II /250G 850 EVO/ 250gb Mechanical
Display(s) Dell U2515H + Asus VX239H/ 15.6" Laptop Screen / 720p 42" Plasma TV/ None
Case Cooler master HAF 922 / Laptop Case / Corsair Air 240 / Custom
Audio Device(s) On Board Realtek
Power Supply Andyson N700 Titanium / Laptop Power / ACBell 700 W / FSP 850 W
Mouse Logitech G700s
Keyboard CM Quickfire XT (Cherry MX Reds)
Software Windows 10 x64
Benchmark Scores 3DMark Firestrike = Timespy = 9427 Heaven = 3735
Running it hard and hot for a long time may have degraded the TIM. I'd try replacing it with new compound to try to get those temps down. Even at full load, a 1080Ti shouldn't be reaching 90C. Also check the quality of the thermal pads that should be sitting on/above the memory.
I was getting into the 80's when using mine to mine crypto and that was with a second card in my PC restricting airflow.

Right now, it runs a lot cooler, even at full gaming load.
 
Joined
Dec 14, 2009
Messages
7,315 (2.09/day)
Location
Glasgow - home of formal profanity
System Name New Ho'Ryzen
Processor Ryzen 1700X @ 3.82Ghz
Motherboard Asus Crosshair VI Hero
Cooling TR Le Grand Macho & custom GPU loop
Memory 16Gb G.Skill 3200 RGB
Video Card(s) RTX 2080ti MSI Duke @2Ghz ish
Storage Samsumg 960 Pro m2. 512Gb
Display(s) LG 32" 165Hz 1440p GSYNC
Case Lian Li PC-V33WX
Audio Device(s) On Board
Power Supply Seasonic Prime TItanium 850
Software W10
Benchmark Scores Look, it's a Ryzen on air........ What's the point?
Is there a chance at all that the card thinks the workload is like a power virus. That would allow high clocks but heavy throttling on the power limit?

Other than that, as others have said, 90 degrees isn't Pascal's best temp. Cards actually start thermal throttling above about 50 degrees.
 
Joined
Feb 3, 2017
Messages
1,511 (1.69/day)
Processor i5-8400
Motherboard ASUS ROG STRIX Z370-I GAMING
Cooling CRYORIG C7 Cu
Memory 2*16GB DDR4-3200 CL16
Video Card(s) Gainward GeForce RTX 2080 Phoenix
Storage 1TB Samsung 970 Pro, 1TB Samsung 850 EVO, 1TB Crucial MX500
Display(s) ASUS PG279Q, Eizo EV2736W
Case Dan Cases A4-SFX
Power Supply Corsair SF600
Mouse Logitech G700
Keyboard Corsair K60
Voltage reliability? With GPU at 90C, what does VRM do, could it simply be VRM overheating?
Rest of the changes make sense but GPU voltage up and GPU clocks from ~1600 to 2000 seems quite strange and would hint at some software/driver change kicking in.

Definitely try forcing fans at 100% and see if that changes anything.
I would also try lower the power limit - to around 200W maybe - just for testing to see what happens.
 
Joined
Jan 8, 2017
Messages
4,182 (4.54/day)
System Name Good enough
Processor AMD Ryzen R7 1700X - 4.0 Ghz / 1.350V
Motherboard ASRock B450M Pro4
Cooling Scythe Katana 4 - 3x 120mm case fans
Memory 16GB - Corsair Vengeance LPX
Video Card(s) OEM Dell GTX 1080
Storage 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) 4K Samsung TV
Case Zalman R1
Power Supply 500W
To summarise the sensor log:
* GPU Clock spikes up and down around 1600, until around the 2h40 mark, at which point it caps out at 2000.
* GPU Temperature starts out close to 90'C, but after the breaking point at 2h40, drops down to about 64'C.
* Accordingly Fan Speed % is at 100 at first, but then drops to 70.
* GPU Load is between 90% and 100%, but after 2h40 it rises up to being 95% to 100%.
* Memory Controller Load goes from averaging 55% to 25%.
* Power Consumption drops from average 240W to 160W.
* PerfCap Reason oscillates between 2 and 16 in the first part, but is consistently 4 in the second part. I.e. it goes from power and thermal caps to voltage reliability cap super consistently.
* VDDC switches up from wiggling around ~0.81V to flat 1.04V
* CPU Temp drops from ~56'C to ~52'C
* Memory usage doesn't change.
All of this points out to thermal problems , the card hits the 90C wall at which point it underclocks severely -> power and tempreure goes down.

The cooling is screwed up, there is no question about it. Check the case ventilation and/or take the card apart and reapply TIM. 1700 RPM isn't really low for a triple fan cooler of this caliber and should keep the card no where near 90c.

There are also other things to consider, for example, data samples and processing done on it, what neural network does since the moment performance drops. Maybe performance drops because operations are getting more complicated and gpu simply doesn't have enough performance to keep up. I'm not deep into ML so these are just my thoughts of person who knows just absolute basics on level a tiny bit higher than just definition of ML.
Compute workloads usually result in less heat output because the silicon that's dedicated for the fixed function graphics stuff isn't in use.
 
Last edited:
Joined
Jul 18, 2007
Messages
2,631 (0.60/day)
System Name panda
Processor 6700k
Motherboard sabertooth s
Cooling raystorm block<black ice stealth 240 rad<ek dcc 18w 140 xres
Memory 32gb ripjaw v
Video Card(s) 290x gamer<ntzx g10<antec 920
Storage 950 pro 250gb boot 850 evo pr0n
Display(s) QX2710LED@110hz lg 27ud68p
Case 540 Air
Audio Device(s) nope
Power Supply 750w superflower
Mouse g502
Keyboard shine 3 with grey, black and red caps
Software win 10
Benchmark Scores http://hwbot.org/user/marsey99/
how hot is the air coming out the psu?
 
Joined
Sep 17, 2014
Messages
9,443 (5.35/day)
Location
Too Long to fit in a single line here.
Processor i7 8700k 4.7Ghz @ 1.26v
Motherboard AsRock Fatal1ty K6 Z370
Cooling beQuiet! Dark Rock Pro 3
Memory 16GB Corsair Vengeance LPX 3200/C16
Video Card(s) MSI GTX 1080 Gaming X @ 2100/5500
Storage Samsung 850 EVO 1TB + Samsung 830 256GB + Crucial BX100 250GB + Toshiba 1TB HDD
Display(s) Eizo Foris FG2421
Case Fractal Design Define C TG
Power Supply EVGA G2 750w
Mouse Logitech G502 Protheus Spectrum
Keyboard Sharkoon MK80 (Brown)
Software W10 x64
Thanks guys! So much good insight :)



There is an element of stepping up to higher resolutions when training StyleGAN, yes. I'm not sure whether the GPU load increases as such, but what does happen is that the size of the image batches gets reduced at the very least, so as to remain inside the VRAM budget. However, I can see from the logs when those step downs happen and in this case I know I'm comparing like-for-like - it's just crunching through thousands of the same operations over and over again, the GPU usage is super predictable. Moreover, I've got the logs from the previous time I trained the same architecture with the same image sizes and parameters, so I can compare timings for each tick between how fast that training was taking at the same stage, and how fast this training is going. Before the GPU weirdness happens, timings are pretty much identical between themselves, and compared to the older logs. After the weirdness, everything slows by almost exactly one half.



If heat is the problem, then what might be happening is a problem with air flow through the room, rather than through the case. If the PC is generating more heat than the room's ventilation is dissipating, perhaps those several hours - plus the effect of time of day - is enough to raise the ambient temperature to a point where it becomes a problem.



That's an interesting theory. Do you mean a chip on the card, or on the mobo? Or either one?

I've opened up the case, opened windows, air-sprayed some of the dust/cobwebs away, checked that all three fans spin up correctly under load. It seems like it's all in decent working order, but operating at its peak it still levels out at about 90'C... maybe a little closer to 89'C now that the room has cooled. Since it's late on this end, I'll leave it on overnight with the improved airflow and see if that makes any difference in the long term, even while temperature remains high.

Some thoughts before I leave it overnight:
* 1080 Tis do start thermal throttling at 91'C from what I can tell, so it makes sense to me that it would level out at 90'C - 91'C. But this level of throttling seems weird on its own and it never gets anywhere close to exceeding even 100'C, let alone the 105'C cap for a thermal shutdown.
* The fact that opening the case up hasn't made any big difference to the operating temperature is a little concerning, but I can't spot any physical faults with the cooling on the card itself. It doesn't appear congested, and all three fans are running well - the airflow is there. I don't have logs of temperature from when it was working correctly, but I wonder if that might just be due to higher GPU utilisation under the neural network training load, which pushes it up toward the thermal limit because of the type of work it's doing. I should see what the temperatures look like while running a demanding game tomorrow for comparison - if that reaches the 90s, then it does suggest a cooling issue I guess.
* Would it be an option to underclock the card, or reduce the limit for thermal throttling, to see if that improves stability?
* A few sites like this mention that the max RPM of the fans is 1600, but at full load mine is actually hitting 1700 RPM most of the time. I wonder if that's accurate, and if there are any sensor readings for the RPM of the three individual fans.
* What's up with the GPU clock going up when when this happens? The fact that clock is higher, VDDC is higher, GPU load is the same but GPU temp goes down and computation is slower really doesn't make any sense in my head. Surely higher clock speeds and more voltage should result in higher temperatures and more compute?
Few things:

- Did you ever flex the cooling solution OR applied pressure to it while it was installed. You will notice that if you 'push' or squeeze the back end of the card while its running, temps will probably skyrocket to 90C+. There are sensors around that area that will report differently at that point. This would explain the erratic behavior; if the card has a bend or sag. A badly placed anti-sag support for a GPU can have the same effect ironically ;) This is a pretty rare thing, but reading the topic I think this is the sort of odd stuff you should look for. It also possibly explains why things stabilize later on; when everything is warmed up, some minor expansion happens and problem goes away.

- Its clear the card can run at max load without problems because your sensor log shows normal behavior when it stops switching power states. You could try forcing a power state OR go into NVCP and play around with the power management options; Prefer max power isn't always the best solution for example, because Pascal is temperature limited. There might be a weird conflict as well between the chosen power state /Windows power plan/GPU BIOS. But that does not explain too well why the power state does get stable later on.

- Are you running at stock power target/core volts?

Voltage reliability? With GPU at 90C, what does VRM do, could it simply be VRM overheating?
That seems most likely at this point, but it still does not explain too well why things stabilize later on.
 
Last edited:
Joined
Jul 11, 2015
Messages
27 (0.02/day)
System Name Harm's Rig
Processor AMD 8370e 4500
Motherboard Msi GD80 990fxa
Cooling Air
Memory EVGA 16Gb
Video Card(s) 1080 Ti / 290x/290a cfx
Storage 1SSD / 1HHD
Display(s) 40"TV / 24" Asus
Case NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply EVGA 1300G2
Mouse G502
Keyboard G413
Can you run a short benchmark ,like 2 or 3 min, GPU bound ,log it and compare to my log.
log benchmark 4k optimize .
 

Attachments

Last edited:

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.27/day)
1700 rpm is low, try a third party , msi AFB
So, I'm still testing the results, but I think you've hit the nail on the head - at least as far as my thermal situation goes. Tweaking the fan curve in MSI Afterburner has made a huge difference - the way it was set up:
125390


Meant that literally the moment temperature dropped below 90'C, fan speed would drop sharply as well. Adjusting a custom curve closer to yours, has made a massive difference - the GPU at full load runs around 65'C now.

Opening the case up and lowering room temperature overnight didn't make any difference - GPU temp stayed at 90'C. But I set up the new fan profile in the morning and after that it seemed to merrily keep working away at 65'C all day without any issues.

That said, I'm not entirely sure I'm out of the clear yet. Putting it through its paces this evening, I've had a couple of driver crashes each after running the training for just ten minutes or so - even without hitting the thermal cap. It's worth pointing out that this past week I've had two different things happen when I run training: sometimes I'd get what looks like a driver crash some minutes into training and that would cause the training to error out and need to be manually resumed. In that case I'd normally keep trying, sometimes needing to reboot, until it would run without crashing... at which point I'd normally get this other problem, where after a few hours it starts to go slower.

I'm not sure whether or how these issues are linked, though. I've gotten one good run for about nine hours without any issues at all, which is very promising. But subsequently, it has been crashy. If the other issue doesn't return though, I guess I'll classify them as different problems and consider this case closed for now :)

Can you run a short benchmark ,like 2 or 3 min, GPU bound ,log it and compare to my log.
Attached. It looks like I can get my fan speed probably a bit more aggressive still, compared to yours, but I think over longer-term usage the GPU temperature does cap out lower for me now.
 

Attachments

Joined
Jul 11, 2015
Messages
27 (0.02/day)
System Name Harm's Rig
Processor AMD 8370e 4500
Motherboard Msi GD80 990fxa
Cooling Air
Memory EVGA 16Gb
Video Card(s) 1080 Ti / 290x/290a cfx
Storage 1SSD / 1HHD
Display(s) 40"TV / 24" Asus
Case NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply EVGA 1300G2
Mouse G502
Keyboard G413

Attachments

9of9

New Member
Joined
Jun 19, 2019
Messages
8 (0.27/day)
What windows are you using? and have you used DDU?
https://nvidia.custhelp.com/app/answers/detail/a_id/4808 , HOTFIX DRIVER 430.97
I'm on Windows 10 version 1903. Haven't resorted to DDU yet - just going to try out the driver hotfix you've linked.

MSI Overclocking is being a bit weird. When I start testing, it'll peak my GPU usage for a few minutes, but it won't stop the test of its own accord, and when I hit 'Stop' it gives me a C++ Runtime Error and then just crashes :shadedshu: Will have to have a fiddle with it.
 
Joined
Jul 11, 2015
Messages
27 (0.02/day)
System Name Harm's Rig
Processor AMD 8370e 4500
Motherboard Msi GD80 990fxa
Cooling Air
Memory EVGA 16Gb
Video Card(s) 1080 Ti / 290x/290a cfx
Storage 1SSD / 1HHD
Display(s) 40"TV / 24" Asus
Case NZXT Tempest custom
Audio Device(s) Sharp Aquos
Power Supply EVGA 1300G2
Mouse G502
Keyboard G413
GeForce Hotfix Driver Version 431.18
GeForce Hotfix display driver version 431.18 is based on our latest Game Ready Driver 430.86. This Hotfix driver addresses the following:
  • Fixes BSOD after waking ASUS GL703GS/Asus GL502VML notebook from hibernation
  • Shadow of the Tomb Raider may experience a game crash or TDR when launching game on Pascal GPU
  • Shadow of the Tomb Raider: Benchmark quits when running with ray tracing is enabled
  • Grand Theft Auto V may experience flickering when MSAA is used
This driver also includes the fixes that were released as part of the GeForce Hotfix 430.97 display driver.

 
Top