• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Tesla T4 Problems

sozkan

New Member
Joined
Mar 9, 2020
Messages
20 (0.01/day)
Hello, We have a Threadripper TRX40 workstation using Tesla T4 overheating.
I had some troubles during the installation. I thought it was gone!
But the problem seems like still remain unsolved. GPU temperature is at minimum 57, on simple benchmarks ( not eve stress test ) over heating up to 91 Celsius and GPU close it self.
GPU memory always in use even in idle stage.
Teslat4-1.gif
Teslat4-0.gif

Configuration viideo:

Any idea what is going on?
Best
Serkan
 
Last edited:
Joined
Jun 29, 2009
Messages
1,871 (0.35/day)
Location
Heart of Eutopia!
System Name ibuytheusedstuff
Processor 5960x
Motherboard x99 sabertooth
Cooling old socket775 cooler
Memory 32 Viper
Video Card(s) 1080ti on morpheus 1
Storage raptors+ssd
Display(s) acer 120hz
Case open bench
Audio Device(s) onb
Power Supply antec 1200 moar power
Mouse mx 518
Keyboard roccat arvo
i cannot find a 441.08 driver that officially supports tesla cards.
this is the latest with cuda 10.2 for tesla t series


maybe its worth a try

sorry i misread that its you are using the latest drivers!
 
Last edited:

sozkan

New Member
Joined
Mar 9, 2020
Messages
20 (0.01/day)
i cannot find a 441.08 driver that officially supports tesla cards.
this is the latest with cuda 10.2 for tesla t series


maybe its worth a try
Thank you for quick response.
I have installed actually the same driver, But during the installation has error and suggested to install DHC version. Later suggested standard version. I have no clue which one is working. But it is working with fault. Memory in Full load but no work load on it! It does heat from memory reason I guess.
I have made new video about the problem.
 
Joined
Jun 29, 2009
Messages
1,871 (0.35/day)
Location
Heart of Eutopia!
System Name ibuytheusedstuff
Processor 5960x
Motherboard x99 sabertooth
Cooling old socket775 cooler
Memory 32 Viper
Video Card(s) 1080ti on morpheus 1
Storage raptors+ssd
Display(s) acer 120hz
Case open bench
Audio Device(s) onb
Power Supply antec 1200 moar power
Mouse mx 518
Keyboard roccat arvo
it was my fault misreading info on gpu-z sorry
never saw this memory usage myself-the card seems to downclock okay

is this a new card?
newest bios on your motherboard?

maybe ya could post all your specs for easier helping? thx

did ya try to swap the cards to another slot?
are all pci-e slots occupied?
can ya switch the tesla card to pci-e 3.0 in bios?


for others who want to help: looks like everything was bought new:
asus lc360 aio
msi trx40 creator\changed to Gigabyte Aorus TRX40 Extreme with newest bios
g.skill neo F4-3600c16-19-19-39 \ 32gtznc \ x2
corsair hx1200i
corsair mp510 nvme x2
nvidia tesla T4m low profile

video gets interesting from 23.00min with msi mainboard

and new mainboard gigabyte start problems with tesla from 36.00 min and error D4=pci resource allocation error\out of resources.
 
Last edited:

sozkan

New Member
Joined
Mar 9, 2020
Messages
20 (0.01/day)
No more "msi trx40 creator" Because it was no even display signal and "Nvidia Tesla T4" overheat offline condition.
I have replaced with Gigabyte Aorus TRX40 Extreme.
Yes all new and fresh installation.
I have just updated new bios just came from Gigabyte support. But it is more warmer. Tesla T4 actually not even heating above room temperature at my other intel PC!
These motherboards has 4 piece x16 PCIe Lane, But both are support 2 of x8 2 of x16. So I have not much choice. GPU`s need to be on full speed Lane. But I will try Tesla on x8 speed. It might support. However Nvidia Claimed Tesla T4 won`t loose from it`s own performance at x8 speed lane. But I doubted.
 

Attachments

  • Teslat4-0.gif
    Teslat4-0.gif
    15.8 KB · Views: 355
Joined
Jun 29, 2009
Messages
1,871 (0.35/day)
Location
Heart of Eutopia!
System Name ibuytheusedstuff
Processor 5960x
Motherboard x99 sabertooth
Cooling old socket775 cooler
Memory 32 Viper
Video Card(s) 1080ti on morpheus 1
Storage raptors+ssd
Display(s) acer 120hz
Case open bench
Audio Device(s) onb
Power Supply antec 1200 moar power
Mouse mx 518
Keyboard roccat arvo
and just for testing i would place a fan to the tesla. you are not the only one with overheating tesla card

maybe its just dead on arrival

Tesla T4 actually not even heating above room temperature at my other intel PC!
so you are saying the tesla card works normal in another pc whithout heat + memory problems?
 
Last edited:

sozkan

New Member
Joined
Mar 9, 2020
Messages
20 (0.01/day)
and just for testing i would place a fan to the tesla. you are not the only one with overheating tesla card

maybe its just dead on arrival


so you are saying the tesla card works normal in another pc whithout heat + memory problems?
From Gigabyte support, Bios Updated and result more overheating:

I have removed the Tesla T4 from AMD MB and install on Intel i-9 Based Motherboard: seems like no overheating issue Except full memory use remain.
 
Joined
Aug 22, 2010
Messages
748 (0.15/day)
Location
Germany
System Name Acer Nitro 5 (AN515-45-R715)
Processor AMD Ryzen 9 5900HX
Motherboard AMD Promontory / Bixby FCH
Cooling Acer Nitro Sense
Memory 32 GB
Video Card(s) AMD Radeon Graphics (Cezanne) / NVIDIA RTX 3080 Laptop GPU
Storage WDC PC SN530 SDBPNPZ
Display(s) BOE CQ NE156QHM-NY3
Software Windows 11 beta channel
no ... issue Except full memory use remain.


Double-check VRAM usage in a CLI with this command:

Code:
"%ProgramFiles%\NVIDIA Corporation\NVSMI\nvidia-smi.exe"
 

sozkan

New Member
Joined
Mar 9, 2020
Messages
20 (0.01/day)
Double-check VRAM usage in a CLI with this command:

Code:
"%ProgramFiles%\NVIDIA Corporation\NVSMI\nvidia-smi.exe"
Nvidia.png


Tester setup is Intel based system. It is same time captured. According to "nvidia-smi.exe" Memory usage (86/15205) not much. But TechPowerUP app shows 15359MB (%100).
But our main Computer AMD Threadripper 3970x. Main problem there heat and Memory issue. I am really curious to see different Tesla T4 on Similar system if it is conflict of New generation AMD system and Nvidia Tesla GPU!
 
Joined
Jul 18, 2016
Messages
506 (0.18/day)
System Name Gaming PC / I7 XEON
Processor I7 4790K @stock / XEON W3680 @ stock
Motherboard Asus Z97 MAXIMUS VII FORMULA / GIGABYTE X58 UD7
Cooling X61 Kraken / X61 Kraken
Memory 32gb Vengeance 2133 Mhz / 24b Corsair XMS3 1600 Mhz
Video Card(s) Gainward GLH 1080 / MSI Gaming X Radeon RX480 8 GB
Storage Samsung EVO 850 500gb ,3 tb seagate, 2 samsung 1tb in raid 0 / Kingdian 240 gb, megaraid SAS 9341-8
Display(s) 2 BENQ 27" GL2706PQ / Dell UP2716D LCD Monitor 27 "
Case Corsair Graphite Series 780T / Corsair Obsidian 750 D
Audio Device(s) ON BOARD / ON BOARD
Power Supply Sapphire Pure 950w / Corsair RMI 750w
Mouse Steelseries Sesnsei / Steelseries Sensei raw
Keyboard Razer BlackWidow Chroma / Razer BlackWidow Chroma
Software Windows 1064bit PRO / Windows 1064bit PRO
i would not trust much gpuz
 
Joined
Jan 8, 2017
Messages
8,860 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
TESLAS DO NOT HAVE ACTIVE COOLING.

They are passively cooled and without modifying the cooling it's always going to overheat in a normal system, these things are designed for server casings with forced air intakes.

I can't believe no one pointed this out.

Take the shroud off and try and find a way to mount a fan on it, unless it's under warranty and you don't want to do that. Otherwise you have to come up with something else to try and force the air through the heatsink somehow. I've seen people try and make a "funnel" out of tape and put a fan at the end of it.
 
Last edited:
Joined
Sep 28, 2005
Messages
3,148 (0.47/day)
Location
Canada
System Name PCGR
Processor 12400f
Motherboard Asus ROG STRIX B660-I
Cooling Stock Intel Cooler
Memory 2x16GB DDR5 5600 Corsair
Video Card(s) Dell RTX 3080
Storage 1x 512GB Mmoment PCIe 3 NVME 1x 2TB Corsair S70
Display(s) LG 32" 1440p
Case Phanteks Evolve itx
Audio Device(s) Onboard
Power Supply 750W Cooler Master sfx
Software Windows 11
The level 20 has the front glass panel, right? That thing is doesn't have good airflow and unfortunately as the gentleman above me stated, the T4 is a fanless gpu. Also noticeable in the video.

So the poor thing is cooking as the airflow isn't the greatest.
 
Joined
Jun 2, 2017
Messages
7,790 (3.13/day)
System Name Best AMD Computer
Processor AMD 7900X3D
Motherboard Asus X670E E Strix
Cooling In Win SR36
Memory GSKILL DDR5 32GB 5200 30
Video Card(s) Sapphire Pulse 7900XT (Watercooled)
Storage Corsair MP 700, Seagate 530 2Tb, Adata SX8200 2TBx2, Kingston 2 TBx2, Micron 8 TB, WD AN 1500
Display(s) GIGABYTE FV43U
Case Corsair 7000D Airflow
Audio Device(s) Corsair Void Pro, Logitch Z523 5.1
Power Supply Deepcool 1000M
Mouse Logitech g7 gaming mouse
Keyboard Logitech G510
Software Windows 11 Pro 64 Steam. GOG, Uplay, Origin
Benchmark Scores Firestrike: 46183 Time Spy: 25121
The level 20 has the front glass panel, right? That thing is doesn't have good airflow and unfortunately as the gentleman above me stated, the T4 is a fanless gpu. Also noticeable in the video.

So the poor thing is cooking as the airflow isn't the greatest.

If the OP has a Level 20 he may want to change that toi something like the CM 500 Mesh so that the components can get proper airflow. It would have better (if they were still available) to use the Core X series.
 
Joined
Jan 8, 2017
Messages
8,860 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
That card wont be able to be cooled properly no matter how much airflow you throw at it, the air never goes through the heatsink like it should due to low pressure.
 
Joined
Sep 28, 2005
Messages
3,148 (0.47/day)
Location
Canada
System Name PCGR
Processor 12400f
Motherboard Asus ROG STRIX B660-I
Cooling Stock Intel Cooler
Memory 2x16GB DDR5 5600 Corsair
Video Card(s) Dell RTX 3080
Storage 1x 512GB Mmoment PCIe 3 NVME 1x 2TB Corsair S70
Display(s) LG 32" 1440p
Case Phanteks Evolve itx
Audio Device(s) Onboard
Power Supply 750W Cooler Master sfx
Software Windows 11
That card wont be able to be cooled properly no matter how much airflow you throw at it, the air never goes through the heatsink like it should due to low pressure.

Well, I guess the user could try to somehow attach a fan to blow directly through the fins from the back end blowing out towards the back plate. If that makes sense.

like this:



This here is a thread on the P4 which had the overheating issue:


If OP has a 3d printer, the link provides the gcode file needed for 3dprinter to print with. If you can find someone who has one, that could also work. The P4 and T4 look to be same size so it should work, no?
 
Joined
Jul 18, 2016
Messages
506 (0.18/day)
System Name Gaming PC / I7 XEON
Processor I7 4790K @stock / XEON W3680 @ stock
Motherboard Asus Z97 MAXIMUS VII FORMULA / GIGABYTE X58 UD7
Cooling X61 Kraken / X61 Kraken
Memory 32gb Vengeance 2133 Mhz / 24b Corsair XMS3 1600 Mhz
Video Card(s) Gainward GLH 1080 / MSI Gaming X Radeon RX480 8 GB
Storage Samsung EVO 850 500gb ,3 tb seagate, 2 samsung 1tb in raid 0 / Kingdian 240 gb, megaraid SAS 9341-8
Display(s) 2 BENQ 27" GL2706PQ / Dell UP2716D LCD Monitor 27 "
Case Corsair Graphite Series 780T / Corsair Obsidian 750 D
Audio Device(s) ON BOARD / ON BOARD
Power Supply Sapphire Pure 950w / Corsair RMI 750w
Mouse Steelseries Sesnsei / Steelseries Sensei raw
Keyboard Razer BlackWidow Chroma / Razer BlackWidow Chroma
Software Windows 1064bit PRO / Windows 1064bit PRO
nice solution
 

bug

Joined
May 22, 2015
Messages
13,163 (4.07/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
That card wont be able to be cooled properly no matter how much airflow you throw at it, the air never goes through the heatsink like it should due to low pressure.
Submerging it in water should solve any cooling issues :D
 
Joined
Aug 22, 2010
Messages
748 (0.15/day)
Location
Germany
System Name Acer Nitro 5 (AN515-45-R715)
Processor AMD Ryzen 9 5900HX
Motherboard AMD Promontory / Bixby FCH
Cooling Acer Nitro Sense
Memory 32 GB
Video Card(s) AMD Radeon Graphics (Cezanne) / NVIDIA RTX 3080 Laptop GPU
Storage WDC PC SN530 SDBPNPZ
Display(s) BOE CQ NE156QHM-NY3
Software Windows 11 beta channel
...According to "nvidia-smi.exe" Memory usage (86/15205) not much. But TechPowerUP app shows 15359MB (%100)...

I guess it's sth. like a buffer overflow in GPU-Z.
@W1zzard would have to take a look at that issue.

btw
Tesla driver has been updated today to version 442.50.
 

sozkan

New Member
Joined
Mar 9, 2020
Messages
20 (0.01/day)
Thank you very much for support.
I partially agree about passive cooling bad design ( cooling problem ) and solutions.
But I have tested in one of Intel and two of Amd Motherboards. Both Motherboards Towers are the similar cooling capabilities, and there is no workload!
- Intel system do not overheating on idle condition and still working after several hours not more than 45 Celsius.
- Both AMD motherboards are overheated a lot. In several seconds it is coming up to 90 Celsius and GPU turnoff. First Msi MB which is not even show up display signal!
I have contacted to the GPU manufacturer they have seen the things I have shared and they agree to replace the faulty card. I will try to get different model rather if it is incompatibility issue!

I am coming to the conclusion with possibilities.
- It might be faulty card and need replacement. After replacement, it might be good idea to have cooling upgrade.
- Vram issue which is shown on "GPU-z" fully occupied, however "nvidia-smi.exe" show it is not used! What so ever causes full memory use in "GPU-z", if not Memory issue, then it might be something else!
- AMD TRX40 Threadripper CPU versus Nvidia GPU both high tech and competitor company! Their unmentioned conflicts and hidden or unknown incompatibility issue!

I guess it's sth. like a buffer overflow in GPU-Z.
@W1zzard would have to take a look at that issue.

btw
Tesla driver has been updated today to version 442.50.
Thank you. Does it (buffer overflow) means defect? But it was overheating on AMD MB even there was no Windows installed.
 
Joined
Sep 28, 2005
Messages
3,148 (0.47/day)
Location
Canada
System Name PCGR
Processor 12400f
Motherboard Asus ROG STRIX B660-I
Cooling Stock Intel Cooler
Memory 2x16GB DDR5 5600 Corsair
Video Card(s) Dell RTX 3080
Storage 1x 512GB Mmoment PCIe 3 NVME 1x 2TB Corsair S70
Display(s) LG 32" 1440p
Case Phanteks Evolve itx
Audio Device(s) Onboard
Power Supply 750W Cooler Master sfx
Software Windows 11
Well, give that a shot! If it works afterwards, then good! If not, then it is something else. Out of curiosity, when you remove drivers, you are running DDU right? Or try the NVidia driver program that is on here.

This one is more extreme, but you try on a fully clean drive? Like a fresh install of windows?

Other than that, if you do end up using it, you may end up with heat issues anyway later on.
 
Joined
Feb 19, 2019
Messages
324 (0.17/day)
 
Joined
Aug 22, 2010
Messages
748 (0.15/day)
Location
Germany
System Name Acer Nitro 5 (AN515-45-R715)
Processor AMD Ryzen 9 5900HX
Motherboard AMD Promontory / Bixby FCH
Cooling Acer Nitro Sense
Memory 32 GB
Video Card(s) AMD Radeon Graphics (Cezanne) / NVIDIA RTX 3080 Laptop GPU
Storage WDC PC SN530 SDBPNPZ
Display(s) BOE CQ NE156QHM-NY3
Software Windows 11 beta channel
Does it (buffer overflow) means defect?

No, "buffer overflow" is just a common bug in software, in this case GPU-Z (if i guessed right).
Your hardware is fine.
 

sozkan

New Member
Joined
Mar 9, 2020
Messages
20 (0.01/day)
By the way I am not very sure, how long But, AMD TRX40 Threadripper MB was taking noticeable longer time than usual startup time to windows with Nvidia Tesla T4". when I removed It was faster. I will try it and share again.

Well, give that a shot! If it works afterwards, then good! If not, then it is something else. Out of curiosity, when you remove drivers, you are running DDU right? Or try the NVidia driver program that is on here.

This one is more extreme, but you try on a fully clean drive? Like a fresh install of windows?

Other than that, if you do end up using it, you may end up with heat issues anyway later on.
It is just 3-4 days old Windows. Before that, since 2 weeks I was trying to find out problem. Heating issue was before the windows Because Display signal was not coming until New Gaming GPU (GTX 1660 Super) comes!
 
Joined
Sep 28, 2005
Messages
3,148 (0.47/day)
Location
Canada
System Name PCGR
Processor 12400f
Motherboard Asus ROG STRIX B660-I
Cooling Stock Intel Cooler
Memory 2x16GB DDR5 5600 Corsair
Video Card(s) Dell RTX 3080
Storage 1x 512GB Mmoment PCIe 3 NVME 1x 2TB Corsair S70
Display(s) LG 32" 1440p
Case Phanteks Evolve itx
Audio Device(s) Onboard
Power Supply 750W Cooler Master sfx
Software Windows 11
I am not sure how two separate GPU's operate at same time on this system and how the drivers were installed (sorry, I did not watch the whole video) so I dont know what you did there. There clearly is a conflict going on that if the GPU is used at full use at idle thus making it overheat.

I am trying to do research on this but cant seem to find other examples of same issue.

By the way I am not very sure, how long But, AMD TRX40 Threadripper MB was taking noticeable longer time than usual startup time to windows with Nvidia Tesla T4". when I removed It was faster. I will try it and share again.


It is just 3-4 days old Windows. Before that, since 2 weeks I was trying to find out problem. Heating issue was before the windows Because Display signal was not coming until New Gaming GPU (GTX 1660 Super) comes!

Well, give the GPU RMA a try. If the system works fine without the GPU installed and leaving the other GPU in, then who knows. If RMA works then great! If not, then there is a conflict going on. As you said, PNY is offering RMA. But I truely think it is a conflict going on with the TR4 motherboard and the two GPU's together. I could be entirely wrong but this is what I think.
 

sozkan

New Member
Joined
Mar 9, 2020
Messages
20 (0.01/day)
I am not sure how two separate GPU's operate at same time on this system and how the drivers were installed (sorry, I did not watch the whole video) so I dont know what you did there. There clearly is a conflict going on that if the GPU is used at full use at idle thus making it overheat.

I am trying to do research on this but cant seem to find other examples of same issue.



Well, give the GPU RMA a try. If the system works fine without the GPU installed and leaving the other GPU in, then who knows. If RMA works then great! If not, then there is a conflict going on. As you said, PNY is offering RMA. But I truely think it is a conflict going on with the TR4 motherboard and the two GPU's together. I could be entirely wrong but this is what I think.

I will try RMA. I am familar with high grade Gaming GPU`s before. I even run Amd and Nvidia Gpu`s same time on same MB. But the new things are first for me also.
At first, we intended to use Tesla T4 only. I thought, Tesla T4 will run thru thunderbolt port (we have been informed by re-seller). But finally understood. It is not! So we get cheaper other GPU for display purpose and We use tesla as a processor in our CFD simulation.
How ever Tesla T4 Heatsup on AMD MB even it was alone on first shut!
 
Top