• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

@Devs What is "GPU Temperature (Hot Spot)" on RX Vega?

Status
Not open for further replies.
Joined
Dec 6, 2005
Messages
10,881 (1.62/day)
Location
Manchester, NH
System Name Senile
Processor I7-4790K@4.8 GHz 24/7
Motherboard MSI Z97-G45 Gaming
Cooling Be Quiet Pure Rock Air
Memory 16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s) GIGABYTE Vega 64
Storage Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s) 34" LG 34CB88-P 21:9 Curved UltraWide QHD (3440*1440) *FREE_SYNC*
Case Rosewill
Audio Device(s) Onboard + HD HDMI
Power Supply Corsair HX750
Mouse Logitech G5
Keyboard Corsair Strafe RGB & G610 Orion Red
Software Win 10
You can't have a sensor on just one HBM, both should be connected. You have separate dies & for safety/monitoring, each HBM die has it's own Thermal features. " take a glance over at the JEDEC PDF Docs", that's what I did.

If one HBM is overheating how would you know this is happening if it taking a reading from the other. Your CPU has thermal reading for each core, HBM is no different. The ability to monitor each die is important. Fuji chip has four HBM stack, so you should be seeing Thermal 0 to Thermal 3.

You can't just have one thermal read-out for all HBM die when connected to the main Vega/Fuji die, that's not how things are done.

I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same. Who knows? Without documentation from AMD, it's all speculation.
 
Joined
May 12, 2017
Messages
2,207 (0.87/day)
I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same. Who knows? Without documentation from AMD, it's all speculation.

I find it hard to believe AMD would design something & not fully implement it fully into Fuji/Vega. Another way is to look at other products that use HBM & check if thermal readout for each die is active. Volta is good example as it has four stacks of HBM.

Who's to say Fuji/Vega owners throttling issues are related to something you can't see.
 
Last edited:
Joined
Dec 6, 2005
Messages
10,881 (1.62/day)
Location
Manchester, NH
System Name Senile
Processor I7-4790K@4.8 GHz 24/7
Motherboard MSI Z97-G45 Gaming
Cooling Be Quiet Pure Rock Air
Memory 16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s) GIGABYTE Vega 64
Storage Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s) 34" LG 34CB88-P 21:9 Curved UltraWide QHD (3440*1440) *FREE_SYNC*
Case Rosewill
Audio Device(s) Onboard + HD HDMI
Power Supply Corsair HX750
Mouse Logitech G5
Keyboard Corsair Strafe RGB & G610 Orion Red
Software Win 10
Who's to say Fuji/Vega owners throttling issues are related to something you can't see.

From playing with Wattman for hours, it's clear to me that power limits are the number one throttling factor, if you have good cooling. If you don't have good cooling, then the 85c Core OR 85c Mem temp limits will hold it back. I got my core stable up to 1733 with voltage at 1050 Mv (Default is 1200 Mv). That was with power limit at it's max (50%), and the card was pulling 330W according to GPU-Z. Could have fried an egg on the back of the card, but it kept going. Fans were at full blast. That's when I started to wonder about the mysterious "hot spot" temp sensor.

What's interesting is the GTX 10 series has a hard-wired power limit. For instance, most 1070 cards can't get over 2100 core, no matter what you do.
 
Joined
Dec 31, 2009
Messages
19,366 (3.70/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
What's interesting is the GTX 10 series has a hard-wired power limit. For instance, most 1070 cards can't get over 2100 core, no matter what you do.
But it isn't really due to power limits. Many can't hit 2100 MHz regardless if there is power limit headroom. ;)
 
Joined
Dec 6, 2005
Messages
10,881 (1.62/day)
Location
Manchester, NH
System Name Senile
Processor I7-4790K@4.8 GHz 24/7
Motherboard MSI Z97-G45 Gaming
Cooling Be Quiet Pure Rock Air
Memory 16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s) GIGABYTE Vega 64
Storage Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s) 34" LG 34CB88-P 21:9 Curved UltraWide QHD (3440*1440) *FREE_SYNC*
Case Rosewill
Audio Device(s) Onboard + HD HDMI
Power Supply Corsair HX750
Mouse Logitech G5
Keyboard Corsair Strafe RGB & G610 Orion Red
Software Win 10
But it isn't really due to power limits. Many can't hit 2100 MHz regardless if there is power limit headroom. ;)

It isn't an exact science, obviously. Some silicon just won't do it for many reasons including cooling. And I should qualify my statement, that's specifically with a 1070 Ti. There are pencil mod guides out there to up the power limit too. But out of the box, 2100 was the ceiling for just about every 1070 Ti review I saw and the card I played with.
 
Joined
Dec 31, 2009
Messages
19,366 (3.70/day)
Benchmark Scores Faster than yours... I'd bet on it. :)
Correct. I was just saying it isn't the power limit only that is doing it (how I understood that post - if I was mistaking, apologies). It seems like its a silicon thing and lack of the ability to add significant voltage to it, power limits, and temps. Anyway... that Vega...... :)
 
Joined
Dec 6, 2005
Messages
10,881 (1.62/day)
Location
Manchester, NH
System Name Senile
Processor I7-4790K@4.8 GHz 24/7
Motherboard MSI Z97-G45 Gaming
Cooling Be Quiet Pure Rock Air
Memory 16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s) GIGABYTE Vega 64
Storage Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s) 34" LG 34CB88-P 21:9 Curved UltraWide QHD (3440*1440) *FREE_SYNC*
Case Rosewill
Audio Device(s) Onboard + HD HDMI
Power Supply Corsair HX750
Mouse Logitech G5
Keyboard Corsair Strafe RGB & G610 Orion Red
Software Win 10
Correct. I was just saying it isn't the power limit only that is doing it

Yea, assuming good silicon, good voltage, current regulation and good cooling, the power limit is the final wall in the 1070 Ti PCB (and a lot of other GTX 10 cards I suspect). They (NVidia) did that to the 1070 Ti so it wouldn't cannibalize the 1080 sales, or so I understand.

Back to the mysterious hot spot on Vega... and power limits. All things equal, when I set the power limit to 25%, the core tops out at 1650+, set it to 50% and it peaks at 1730+ ...and pulls one hell of a load for a GPU. Again, that's undervolted to 1050 Mv on the core. If I bump that up to 1100 Mv, both of the peak core speeds go down (indicating a power limit hit)
 
Joined
May 12, 2017
Messages
2,207 (0.87/day)
From playing with Wattman for hours, it's clear to me that power limits are the number one throttling factor, if you have good cooling. If you don't have good cooling, then the 85c Core OR 85c Mem temp limits will hold it back. I got my core stable up to 1733 with voltage at 1050 Mv (Default is 1200 Mv). That was with power limit at it's max (50%), and the card was pulling 330W according to GPU-Z. Could have fried an egg on the back of the card, but it kept going. Fans were at full blast. That's when I started to wonder about the mysterious "hot spot" temp sensor.

What's interesting is the GTX 10 series has a hard-wired power limit. For instance, most 1070 cards can't get over 2100 core, no matter what you do.

Putting power limits to one side, you need to understand track thermal throttling, what's causing it. I own a R9 Nano which I will claim is the fastest R9 Nano card. It does not have the highest clock speed, but it throttles less than any other R9 Nano.

Keeping the R9 Nano main VRMs cooler reduces it's thermal throttling, but the main VRMs is not the real problem.
At the other end of the card are two minor VRMs, which are rated for 85c. Because the main VRMs is heating up the internal baseplate so hot it is either tripping the VRMs at the other end of the card or tripping the other two ICs directly behind the inductor/main VRMs on the other side of the card. The VRMs at the other end of the card is rated at 85c so is the two ICs behind the inductor/main VRMs on the other side of the card. I believe, this is what is tripping the card when it overheat.

The short story is for Vega cards, you need to look at the location of other ICs especially ICs that are mounted on the other side of the card directly behind or near the inductor/ VRMs & check their thermal limits by looking at their documentation. I do not own a Vega card so I can't check location of any ICs on the back of the card (if any).
 
Last edited:
Joined
May 12, 2017
Messages
2,207 (0.87/day)
I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same. Who knows? Without documentation from AMD, it's all speculation.

I believe the hardware integration of HBM thermal sensors on Fuji & Vega is not at fault. My guess is, & i'm only guessing here, it's a firmware problem.
I just can't see the hardware group getting this wrong.

This is why I want to see Firmware Update built-in into AMD Adrenalin Software so that we get "Firmware Update" direct from the manufacture.
 
Last edited:
Joined
Dec 6, 2005
Messages
10,881 (1.62/day)
Location
Manchester, NH
System Name Senile
Processor I7-4790K@4.8 GHz 24/7
Motherboard MSI Z97-G45 Gaming
Cooling Be Quiet Pure Rock Air
Memory 16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s) GIGABYTE Vega 64
Storage Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s) 34" LG 34CB88-P 21:9 Curved UltraWide QHD (3440*1440) *FREE_SYNC*
Case Rosewill
Audio Device(s) Onboard + HD HDMI
Power Supply Corsair HX750
Mouse Logitech G5
Keyboard Corsair Strafe RGB & G610 Orion Red
Software Win 10
I believe the hardware integration of HBM thermal sensors on Fuji & Vega is not at fault. My guess is, & i'm only guessing here, it's a firmware problem.
I just can't see the hardware group getting this wrong.

This is why I want to see Firmware Update built-in into AMD Adrenalin Software so that we get "Firmware Update" direct from the manufacture.

Well here's one for you (I don't have a screen shot ATM)... when I overclock to a certain point, GPUz shows the HBM temp at 2100 degrees, but the card is humming along just fine. That tells me that the reported / exposed temperature sensor is not the same as the one responsible for throttling. Or there's something entirely else going on.
 
Status
Not open for further replies.
Top