@Devs What is "GPU Temperature (Hot Spot)" on RX Vega?

Sasqui · Mar 28, 2018

delshay said:
You can't have a sensor on just one HBM, both should be connected. You have separate dies & for safety/monitoring, each HBM die has it's own Thermal features. " take a glance over at the JEDEC PDF Docs", that's what I did.

If one HBM is overheating how would you know this is happening if it taking a reading from the other. Your CPU has thermal reading for each core, HBM is no different. The ability to monitor each die is important. Fuji chip has four HBM stack, so you should be seeing Thermal 0 to Thermal 3.

You can't just have one thermal read-out for all HBM die when connected to the main Vega/Fuji die, that's not how things are done.

I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same. Who knows? Without documentation from AMD, it's all speculation.

delshay · Mar 28, 2018

Sasqui said:
I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same. Who knows? Without documentation from AMD, it's all speculation.

I find it hard to believe AMD would design something & not fully implement it fully into Fuji/Vega. Another way is to look at other products that use HBM & check if thermal readout for each die is active. Volta is good example as it has four stacks of HBM.

Who's to say Fuji/Vega owners throttling issues are related to something you can't see.

Sasqui · Mar 28, 2018

delshay said:
Who's to say Fuji/Vega owners throttling issues are related to something you can't see.

From playing with Wattman for hours, it's clear to me that power limits are the number one throttling factor, if you have good cooling. If you don't have good cooling, then the 85c Core OR 85c Mem temp limits will hold it back. I got my core stable up to 1733 with voltage at 1050 Mv (Default is 1200 Mv). That was with power limit at it's max (50%), and the card was pulling 330W according to GPU-Z. Could have fried an egg on the back of the card, but it kept going. Fans were at full blast. That's when I started to wonder about the mysterious "hot spot" temp sensor.

What's interesting is the GTX 10 series has a hard-wired power limit. For instance, most 1070 cards can't get over 2100 core, no matter what you do.

EarthDog · Mar 28, 2018

Sasqui said:
What's interesting is the GTX 10 series has a hard-wired power limit. For instance, most 1070 cards can't get over 2100 core, no matter what you do.

But it isn't really due to power limits. Many can't hit 2100 MHz regardless if there is power limit headroom.

Sasqui · Mar 28, 2018

EarthDog said:
But it isn't really due to power limits. Many can't hit 2100 MHz regardless if there is power limit headroom.

It isn't an exact science, obviously. Some silicon just won't do it for many reasons including cooling. And I should qualify my statement, that's specifically with a 1070 Ti. There are pencil mod guides out there to up the power limit too. But out of the box, 2100 was the ceiling for just about every 1070 Ti review I saw and the card I played with.

EarthDog · Mar 28, 2018

Correct. I was just saying it isn't the power limit only that is doing it (how I understood that post - if I was mistaking, apologies). It seems like its a silicon thing and lack of the ability to add significant voltage to it, power limits, and temps. Anyway... that Vega......

Sasqui · Mar 28, 2018

EarthDog said:
Correct. I was just saying it isn't the power limit only that is doing it

Yea, assuming good silicon, good voltage, current regulation and good cooling, the power limit is the final wall in the 1070 Ti PCB (and a lot of other GTX 10 cards I suspect). They (NVidia) did that to the 1070 Ti so it wouldn't cannibalize the 1080 sales, or so I understand.

Back to the mysterious hot spot on Vega... and power limits. All things equal, when I set the power limit to 25%, the core tops out at 1650+, set it to 50% and it peaks at 1730+ ...and pulls one hell of a load for a GPU. Again, that's undervolted to 1050 Mv on the core. If I bump that up to 1100 Mv, both of the peak core speeds go down (indicating a power limit hit)

delshay · Mar 30, 2018

Sasqui said:
From playing with Wattman for hours, it's clear to me that power limits are the number one throttling factor, if you have good cooling. If you don't have good cooling, then the 85c Core OR 85c Mem temp limits will hold it back. I got my core stable up to 1733 with voltage at 1050 Mv (Default is 1200 Mv). That was with power limit at it's max (50%), and the card was pulling 330W according to GPU-Z. Could have fried an egg on the back of the card, but it kept going. Fans were at full blast. That's when I started to wonder about the mysterious "hot spot" temp sensor.

What's interesting is the GTX 10 series has a hard-wired power limit. For instance, most 1070 cards can't get over 2100 core, no matter what you do.

Putting power limits to one side, you need to understand track thermal throttling, what's causing it. I own a R9 Nano which I will claim is the fastest R9 Nano card. It does not have the highest clock speed, but it throttles less than any other R9 Nano.

Keeping the R9 Nano main VRMs cooler reduces it's thermal throttling, but the main VRMs is not the real problem.
At the other end of the card are two minor VRMs, which are rated for 85c. Because the main VRMs is heating up the internal baseplate so hot it is either tripping the VRMs at the other end of the card or tripping the other two ICs directly behind the inductor/main VRMs on the other side of the card. The VRMs at the other end of the card is rated at 85c so is the two ICs behind the inductor/main VRMs on the other side of the card. I believe, this is what is tripping the card when it overheat.

The short story is for Vega cards, you need to look at the location of other ICs especially ICs that are mounted on the other side of the card directly behind or near the inductor/ VRMs & check their thermal limits by looking at their documentation. I do not own a Vega card so I can't check location of any ICs on the back of the card (if any).

delshay · Apr 4, 2018

Sasqui said:
I don't think you can say for sure if AMD is following JEDEC, and it's possible that the HBM has only one sensor, with the assumption that they are all going to heat up about the same. Who knows? Without documentation from AMD, it's all speculation.

I believe the hardware integration of HBM thermal sensors on Fuji & Vega is not at fault. My guess is, & i'm only guessing here, it's a firmware problem.
I just can't see the hardware group getting this wrong.

This is why I want to see Firmware Update built-in into AMD Adrenalin Software so that we get "Firmware Update" direct from the manufacture.

Sasqui · Apr 4, 2018

delshay said:
I believe the hardware integration of HBM thermal sensors on Fuji & Vega is not at fault. My guess is, & i'm only guessing here, it's a firmware problem.
I just can't see the hardware group getting this wrong.

This is why I want to see Firmware Update built-in into AMD Adrenalin Software so that we get "Firmware Update" direct from the manufacture.

Well here's one for you (I don't have a screen shot ATM)... when I overclock to a certain point, GPUz shows the HBM temp at 2100 degrees, but the card is humming along just fine. That tells me that the reported / exposed temperature sensor is not the same as the one responsible for throttling. Or there's something entirely else going on.

System Name	Senile
Processor	I7-4790K@4.8 GHz 24/7
Motherboard	MSI Z97-G45 Gaming
Cooling	Be Quiet Pure Rock Air
Memory	16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s)	GIGABYTE Vega 64
Storage	Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s)	34" LG 34CB88-P 21:9 Curved UltraWide QHD (34401440) FREE_SYNC*
Case	Rosewill
Audio Device(s)	Onboard + HD HDMI
Power Supply	Corsair HX750
Mouse	Logitech G5
Keyboard	Corsair Strafe RGB & G610 Orion Red
Software	Win 10

System Name	Senile
Processor	I7-4790K@4.8 GHz 24/7
Motherboard	MSI Z97-G45 Gaming
Cooling	Be Quiet Pure Rock Air
Memory	16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s)	GIGABYTE Vega 64
Storage	Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s)	34" LG 34CB88-P 21:9 Curved UltraWide QHD (34401440) FREE_SYNC*
Case	Rosewill
Audio Device(s)	Onboard + HD HDMI
Power Supply	Corsair HX750
Mouse	Logitech G5
Keyboard	Corsair Strafe RGB & G610 Orion Red
Software	Win 10

System Name	Senile
Processor	I7-4790K@4.8 GHz 24/7
Motherboard	MSI Z97-G45 Gaming
Cooling	Be Quiet Pure Rock Air
Memory	16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s)	GIGABYTE Vega 64
Storage	Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s)	34" LG 34CB88-P 21:9 Curved UltraWide QHD (34401440) FREE_SYNC*
Case	Rosewill
Audio Device(s)	Onboard + HD HDMI
Power Supply	Corsair HX750
Mouse	Logitech G5
Keyboard	Corsair Strafe RGB & G610 Orion Red
Software	Win 10

System Name	Senile
Processor	I7-4790K@4.8 GHz 24/7
Motherboard	MSI Z97-G45 Gaming
Cooling	Be Quiet Pure Rock Air
Memory	16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s)	GIGABYTE Vega 64
Storage	Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s)	34" LG 34CB88-P 21:9 Curved UltraWide QHD (34401440) FREE_SYNC*
Case	Rosewill
Audio Device(s)	Onboard + HD HDMI
Power Supply	Corsair HX750
Mouse	Logitech G5
Keyboard	Corsair Strafe RGB & G610 Orion Red
Software	Win 10

System Name	Senile
Processor	I7-4790K@4.8 GHz 24/7
Motherboard	MSI Z97-G45 Gaming
Cooling	Be Quiet Pure Rock Air
Memory	16GB 4x4 G.Skill CAS9 2133 Sniper
Video Card(s)	GIGABYTE Vega 64
Storage	Samsung EVO 500GB / 8 Different WDs / QNAP TS-253 8GB NAS with 2x10Tb WD Blue
Display(s)	34" LG 34CB88-P 21:9 Curved UltraWide QHD (34401440) FREE_SYNC*
Case	Rosewill
Audio Device(s)	Onboard + HD HDMI
Power Supply	Corsair HX750
Mouse	Logitech G5
Keyboard	Corsair Strafe RGB & G610 Orion Red
Software	Win 10