• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Bug effecting all Nvidia GPUs - Nvidia won't respond - we need your help!

All that I want to know is, what was the impetus that compelled you to look into this? Was it noticeable stuttering in games? Buggy driver behaviour elsewhere?

If games do not reflect the same behaviour, and it was just because you happened to be benching Heaven one day and saw what you perceived to be irregularities in Afterburner frametime, then I honestly don't know if I would blame Nvidia for handling it the way they have. I sympathize with your curiosity and desire to get to the bottom of this issue, but it's a bit of a stretch to expect others to take it as seriously if there are no real, practical ramifications outside of Heaven 1600x900. "Could have an impact" is different from "*does* have an impact".

I would go through the testing procedure (and just might tonight out of curiosity), but I doubt we'll get anywhere since you've established that a 3090 + 1600x900 Heaven + 100ms polling + 300fps are apparently equal parts necessary to reproduce the symptoms.

I won't get into why those other aspects might be unrealistic/problematic, but 100ms rate is a hell of a lot of polling. Since you are a fellow 5900X owner I trust you are well acquainted with the performance hit/adverse behaviour/stuttering associated with unoptimized or overly frequent monitoring on modern systems. HWInfo default is 2000ms, and that still *can* have a slight impact on performance in some benchmarks and games. I understand the need for a polling rate as fast as 100ms when testing or reviewing hardware, but personally I wouldn't feel confident drawing any kind of conclusions from occasional spikes that I see from results at that polling rate.
 
1) I would expect that almost nobody if anybody else, has. It is very difficult to see, because your system is especially designed to hide these kinds of problems (details earlier in the thread if you want) But if you follow the instructions I provided, you will be able to have the computer see it for you, and put it on a graph which you will be able to see very easily (provided of course that your computer can do it, most high-end rigs will, it gets harder to replicate as processing power drops because it ends up 'hidden' behind the higher frametimes)
2) It occurs during a function provided by the nvidia driver (sorry, I edited that into my reply earlier, as I realised it was another small but critical detail missing). I have not tested any AMD hardware and I don't expect it should display any issue but you're welcome to try it.
To be honest, I'd be happy to test the issue, but
1. I don't currently own an AMD graphics card, and I think it kind of gives me irrelevant data to test only nvidia, and
2. Like others, I fail to see how this is a problem. If I understand the situation correctly, it only happens when you have Afterburner and HWinfo open at the same time, which doesn't make much sense. Your frametime shows HWinfo's polling intervals, which could be related to the nvidia driver, but more so to the communication between HWinfo and the driver. You might as well just close HWinfo and game on. Also, if your frametime spikes are so small that you have to rely on monitoring data to make them visible, then you might as well just ignore the graphs altogether.

Honestly, is your gaming experience affected, regardless of all the data you collected?
 
i do believe you, but why do you need to be monitoring in the first place? also, most people would assuming monitoring might trigger some spikes, especially something like hwinfo that refreshes a huge list of data per unit of time

seems likely that the vr with logging open officially noted issue is related if not the same, i've forgotten if amd had a similar one

how can we be sure that a fix is even possible? what if it's just the nature of the type of api call, such as some sort of serial/synchronous ping that awaits a reply from the hardware, this may or may not be a limitation of the driver or nv hardware, but a limitation of bus protocols or potentially even the OS's management of them

do you know if any games with their own in-game on-screen displays for temp/vram/ram/cpu/fps are also using the bad api call? (overwatch, cod, etc)

why would first tier support 1) be able to comprehend technical instructions or graphs 2) be able to directly ask the relevant driver developers, i have no idea why you ever bothered to message them... the first thing i'd do is maybe get manuel's attention or various devs/employees on twitter, then the same for technically competent & scientific review personalities (GN, AHOC, derbauer, computerbase, digitalfoundry, maybe the aussie steve i forgot the site/channel name of), what about the hwinfo dev
 
@xcasxcursex I thought it only fair to follow your test procedure before I pass more judgment. This 5900X is bone stock again today because it turns out that my CO per-core settings still need work.
  • Run #1 and #2 are with all sensors enabled. Run #3 is with GPU sensors disabled.
  • Afterburner at 100ms polling rate. Heaven lowest settings, 1600x900.
  • FPS during Heaven at these settings is about 400-600fps, but generally in the 400-500fps range.
  • Priority cores on my 5900X are conveniently right at the top, Core 0 and Core 1.
hwinfo all sensors.png hwinfo no gpu sensors.png

Run #1 and #2 (all sensors):

heaven frametime.png heaven frametime 2.png

Run #3 (no GPU sensors):

heaven frametime 3.png

I'm sorry, but I just don't see much of a difference here to support any sort of conclusion. And no real patterns as to the frequency of the frametime spikes either - if you take some liberties with "patterns" you can kinda see a little bit of it in run #1, but run #2 was done right after at the exact same settings and does not confirm that phenomenon.

Also, Heaven is kind of a crappy benchmark, I only ever use it for preliminary vetting of a GPU undervolt curve (and even then it sucks because the real stress test for my 2060 Super undervolt is days/weeks of COD MW19). It's from 2009, for pete's sake, and we're artificially running it at 1600x900 low. I don't even use it to verify my Vega 7 iGPU OCs on my 4650G because of how much better of a benchmark Valley is, being less CPU-reliant than Heaven.

Trying to expect game-level FPS consistency and smoothness out of something like Heaven - and drawing immediate connections to actual game performance based on observations in Heaven - is a bit ambitious, don't you think?

----

So yeah, no dice. To put it politely, I feel like you might be making a mountain out of a molehill. If you aren't suffering actually any major adverse impacts in your in-game experiences, why bother with trying to get through to Nvidia based on something that you think you see in Heaven benchmark?

Like, I'm sure you bought that 3090 to game, not to run Heaven 2009, right?

Whichever way you decide to go with this, best of luck. I know it's not always easy to convey a less-than-obvious observation or issue to brain-dead customer service reps at any company, be it Nvidia or AMD, Ford or GM.
 
100ms rate is a hell of a lot of polling. Since you are a fellow 5900X owner I trust you are well acquainted with the performance hit/adverse behaviour/stuttering associated with unoptimized or overly frequent monitoring on modern systems. HWInfo default is 2000ms, and that still *can* have a slight impact on performance in some benchmarks and games. I understand the need for a polling rate as fast as 100ms when testing or reviewing hardware, but personally I wouldn't feel confident drawing any kind of conclusions from occasional spikes that I see from results at that polling rate.
Just a reminder that the 100ms poll rate is just for your eyes. We can achieve the exact same results from a 10000ms (ten seconds) poll rate (or any other). The high poll rate is not required to reproduce the issue. You're right to mention it though, I'm just clearing it up.


Actually a thing, just sayin':
1624157998132.png

For the uninitiated, this big hole in the middle of our processing I'm showing in a few different tools is the PC waiting for the GPU monitoring to let go of the CPU. If you set up your graphics just right, that hole causes a spike, like in the graph. I really don't want to get into this, but for those who feel like there's no proof of this and no details given....you've not got even a tenth of it... and nor do you need it. You have all you need to see it and go 'yeh that's a thing.'. No need to make things overly complex.


All that I want to know is, what was the impetus that compelled you to look into this? Was it noticeable stuttering in games? Buggy driver behaviour elsewhere?

If games do not reflect the same behaviour, and it was just because you happened to be benching Heaven one day and saw what you perceived to be irregularities in Afterburner frametime, then I honestly don't know if I would blame Nvidia for handling it the way they have.

I would go through the testing procedure (and just might tonight out of curiosity), but I doubt we'll get anywhere since you've established that a 3090 + 1600x900 Heaven + 100ms polling + 300fps are apparently equal parts necessary to reproduce the symptoms.

I won't get into why those other aspects might be unrealistic/problematic, but 100ms rate is a hell of a lot of polling. Since you are a fellow 5900X owner I trust you are well acquainted with the performance hit/adverse behaviour/stuttering associated with unoptimized or overly frequent monitoring on modern systems. HWInfo default is 2000ms, and that still *can* have a slight impact on performance in some benchmarks and games. I understand the need for a polling rate as fast as 100ms when testing or reviewing hardware, but personally I wouldn't feel confident drawing any kind of conclusions from occasional spikes that I see from results at that polling rate.
Noticeable stutter in game....but that isn't really anything to worry about in the context of this. Just because it is an actual thing in a real-life situation, doesn't make it more critical, and just because you can only see it in a forced scenario in a measurement tool, doesn't mean it's not significant.

I have glossed over this but while we're waiting for people to go read the thread I'll tell you a story about how I came to notice this incredibly elusive thing. You may be aware of the monitor technoloy ULMB, which comes with G Sync monitors (the module ones). It's a backlight strobe, basically. Great for fast motion, but theres a catch, despite being on a VRR monitor, it works at a fixed refresh rate. So, to avoid tearing you'd have to use vsync and you'd lose any kind of competitive edge you'd have gained.... but, not long ago, I discovered scanline sync. This is a feature of RTSS which allows you to syncronise to the vsync signals in a manner that does not cause double-buffering and the input lag that come with it. buuuut there's a catch. In order to keep it in that state, you need a MINIMUM framerate of AT LEAST your refresh rate. So for 120Hz I need 120FPS minimum. And it's gotta be stable. Instability in frametimes causes the tearline to roll into view and things get super ugly super fast. TLDR it's mega demanding on the GPU and CPU. The old PC wasn't cutting it. And along rolls the new one praise the silicon gods and I can now hit that FPS and it is freakin. awesome. But it's better, because I'm pretty sure I can double it (which comes with all the usual benefits of doubling framerate but just drops half of the frames from displaying because refresh rate.) I always wanted to try this so off I went aiming for 240 FPS minimum. And this bloody thing raised its ugly head.

And yeh, I get it, that's a super edge case. But like I've said above, this edge case just exposed a problem that is always there. The real world problems for me did not start when I tried for 240FPS minimum. Framerates aren't even the bulk of the concern from real-world problems. The problems started long ago and have been covered up by the fact that the old card couldn't realistically hit the frametimes needed to see it. Trying for 240FPS didn't cause the problem, it just uncovered it.


@xcasxcursex I thought it only fair to follow your test procedure before I pass more judgment. This 5900X is bone stock again today because it turns out that my CO per-core settings still need work.
  • Run #1 and #2 are with all sensors enabled. Run #3 is with GPU sensors disabled.
  • Afterburner at 100ms polling rate. Heaven lowest settings, 1600x900.
  • FPS during Heaven at these settings is about 400-600fps, but generally in the 400-500fps range.
  • Priority cores on my 5900X are conveniently right at the top, Core 0 and Core 1.
View attachment 204665 View attachment 204664

Run #1 and #2 (all sensors):

View attachment 204666 View attachment 204667

Run #3 (no GPU sensors):

View attachment 204668

I'm sorry, but I just don't see much of a difference here to support any sort of conclusion. And no real patterns as to the frequency of the frametime spikes either - if you take some liberties with "patterns" you can kinda see a little bit of it in run #1, but run #2 was done right after at the exact same settings and does not confirm that phenomenon.

Also, Heaven is kind of a crappy benchmark, I only ever use it for preliminary vetting of a GPU undervolt curve (and even then it sucks because the real stress test for my 2060 Super undervolt is days/weeks of COD MW19). It's from 2009, for pete's sake, and we're artificially running it at 1600x900 low. I don't even use it to verify my Vega 7 iGPU OCs on my 4650G because of how much better of a benchmark Valley is, being less CPU-reliant than Heaven.

Trying to expect game-level FPS consistency and smoothness out of something like Heaven - and drawing immediate connections to actual game performance based on observations in Heaven - is a bit ambitious, don't you think?

----

So yeah, no dice. To put it politely, I feel like you might be making a mountain out of a molehill. If you aren't suffering actually any major adverse impacts in your in-game experiences, why bother with trying to get through to Nvidia based on something that you think you see in Heaven benchmark?

Like, I'm sure you bought that 3090 to game, not to run Heaven 2009, right?

Whichever way you decide to go with this, best of luck. I know it's not always easy to convey a less-than-obvious observation or issue to brain-dead customer service reps at any company, be it Nvidia or AMD, Ford or GM.
I appreciate your running the tests! Thanks!! Unfortunately I can't really see anything in those screenshots. They look like they're long enough for an entire run of the benchmark. which is some minutes long, we're looking for an event that last for milliseconds.... It's just lost at this resolution I'm afraid. That being said, it is possible to infer a lot from them because the constant spikes over a long period have the effect of raising the averages, and I don't see anything like that here.

Yes, heaven is a crappy tool for lots of things, but not for this. If you have issues with heaven's performance, try this: When you run it, click camera and then free or walk. It'll sit still and give you a solid frametime to use for testing. Remember, this is not a graphics test. We're just putting a load on the system in a convenient way to make a general performance problem stand out. We happen to be doing it with graphics tools, so we can see the result graphically, but it's not a graphics test.

Even if you don't know how to analyse kernel tracing as pictured above, I'm sure you can see that the computer is doing stuff, and then there's a big hole where it's doing nothing, and then it goes back to doing stuff. This is why it's not a molehill. That's not a hole in the graphics processing, that's a hole in everything processing. You can't fake that and you can't get it from a problem with your PC. That's code-based. It's waiting for something because it is programmed to wait.

This is why a teensy stutter that people are keen to fob off as no biggie, is actually a biggie. The teensy stutter isn't the problem. The big hole is. It's just that if we make that big hole occur during graphics processing, we can see a spike. That spike, that stutter, is not the problem. It's just a visualisation of it.

I bought the 3090 to do all kinds of things, gaming being one, but that branches out to things like simulation and software development. This issue effects all of that.... as well as preventing me from using the GPU as intended for gaming. Again, this isn't a gaming thing and it's not a graphics thing. This is a bug impacting performance, that's all. It's not at all unreasonable to expect it to behave properly.... And even if it's entirely fiction, it's not unreasonable to have a competent engineer provide service to my new card....and let us remember THAT is the purpose of this thread. Not to even get it fixed. I'm just trying to get a competent staffer at nvidia to look at it, that's all. I think that's a fair ask. I did pay for it. I'm legit surprised the community doesn't agree....
 
Last edited:
tbh. i would contact NVidia for that.
TPU has a lot of members going edgy, insulting and very aggressive as soon as you not weigh every of their (mostly useless) words in gold.
 
Thanks to @tabascosauz actually being constructive unlike most of this thread, thank you sir, we have new info. HWinfo65 v7.x is not effected in the same way as 6.x:
1624187760105.png


Dirty image whipped up in 2 seconds, if you know how to read this you see it already, if you don't, don't worry about it. Point is, now we know one reason why it's not working for some people. Thanks tabasco!


tbh. i would contact NVidia for that.
I did though. They refuse to investigate. I posted one of the agent's replies in this thread where he said so directly. I'm here because I'm trying to garner community support for an issue effecting the community because nvidia's refusal to provide service to me amounts to a refusal to provide service to everyone.

I honestly thought if the community found out nvidia had shipped a bug that's costing them time and money and frames, they'd give a pretty big darn about it and there'd be a big fuss. I was so sure of this I went to some lengths to keep this issue quiet as I didn't want to create problems for nvidia. When they initially refused to respond, I warned them that I'd take it to reddit and they'd probably get 10000 cases and maybe it'd be easier to just deal with me. As it turns out, reddit admins are useless (no surprise there) and most deleted the thread immediately. The few threads that survived were downvoted by non-technical types who don't understand this matter's significance and its relevance to them.

Trouble is, because this is suuuper technical, nobody gets it. So they all argue it doesn't even exist, and nvidia get away with screwing those people over. I didn't see that coming. I honestly figured there were more technically minded people in this community. We can see a few in this thread, like those who immediately recognised this as a race condition most likely while waiting for I2C buss negotiation, or immediately noticed the great similarity between this and a known issue presently attributed to VR (but probably not VR-specific at all and actually this bug IS that bug) . But as nice as it is to see some people 'get' this, we don't need a few people, we need enough of a community reaction to actually force nvidia to pull my case out of the queue and do their damned job.
 
Last edited:
Would a 9900K paired with a 3080 be strong enough to reproduce the stutters? I'd like to test it myself, even if I don't game at those really high framerates.
 
just because you can only see it in a forced scenario in a measurement tool, doesn't mean it's not significant.
I don't get it. In what way is an issue that needs an extremely CPU limited environment with fast polled monitoring to manifest, and does not affect gameplay experience whatsoever significant?

Thanks to @tabascosauz actually being constructive unlike most of this thread, thank you sir, we have new info. HWinfo65 v7.x is not effected in the same way as 6.x: View attachment 204699

Dirty image whipped up in 2 seconds, if you know how to read this you see it already, if you don't, don't worry about it. Point is, now we know one reason why it's not working for some people. Thanks tabasco!



I did though. They refuse to investigate. I posted one of the agent's replies in this thread where he said so directly. I'm here because I'm trying to garner community support for an issue effecting the community because nvidia's refusal to provide service to me amounts to a refusal to provide service to everyone.

I honestly thought if the community found out nvidia had shipped a bug that's costing them time and money and frames, they'd give a pretty big darn about it and there'd be a big fuss. I was so sure of this I went to some lengths to keep this issue quiet as I didn't want to create problems for nvidia. When they initially refused to respond, I warned them that I'd take it to reddit and they'd probably get 10000 cases and maybe it'd be easier to just deal with me. As it turns out, reddit admins are useless (no surprise there) and most deleted the thread immediately. The few threads that survived were downvoted by non-technical types who don't understand this matter's significance and its relevance to them.

Trouble is, because this is suuuper technical, nobody gets it. So they all argue it doesn't even exist, and nvidia get away with screwing those people over. I didn't see that coming. I honestly figured there were more technically minded people in this community. We can see a few in this thread, like those who immediately recognised this as a race condition most likely while waiting for I2C buss negotiation, or immediately noticed the great similarity between this and a known issue presently attributed to VR (but probably not VR-specific at all and actually this bug IS that bug) . But as nice as it is to see some people 'get' this, we don't need a few people, we need enough of a community reaction to actually force nvidia to pull my case out of the queue and do their damned job.
So it is an issue with HWinfo v6, and not the nvidia driver?
 
I don't get it. In what way is an issue that needs an extremely CPU limited environment with fast polled monitoring to manifest, and does not affect gameplay experience whatsoever significant?


So it is an issue with HWinfo v6, and not the nvidia driver?
The thing to getting this, is to understand that you're not providing this strange specific environment, in order to manifest the bug. You only need use the power monitoring to manifest it.... But you probably won't be able to SEE it, because the system is intentionally designed to mask any inconsistencies so that they aren't visible on screen. That's why we need that strange environment, to make it possible to see it. It's happening anyway, though.

Think of it like this; This is like a flat tyre in your car. If you are driving on the beach and have your windows up, you might not even know. But if you look out the window and drive on a road, it will be very obvious. The flat tyre is there either way, though. I'm too tired for a good analogy now but I think that gets the idea across :)

The issue is still occurring when the nvidia api is used, it's just that some apps use it in different ways. It would appear that one certain way they use it, has a problem, which is why different apps behave in different ways. In this case, we can even measure a difference between the two versions of the app. I don't entirely discard contacting the individual application developers though, but it's best practice if there's something broken, to deal with it from the top down.



Would a 9900K paired with a 3080 be strong enough to reproduce the stutters? I'd like to test it myself, even if I don't game at those really high framerates.
For sure. I've done it on far lesser hardware than that - although it takes a little 'massage' to find that place where the framerate is high enough to expose it and not so high as to entirely overpower the CPU - but with your system I expect it'll basically do it the same way as mine.
 
@xcasxcursex at this point I won't pretend to understand the technical aspects of how this all works :laugh:

A few observations perhaps of some or no relevance:
  • Regrettably the 2060 Super is the most capable card I have - the FPS in Heaven is obviously there but I have no idea if an Ampere or higher end card would behave differently. It's sufficient for my needs and that's all I can ask for in this shortage.
  • I moved off pre-v7.0 versions of HWInfo because Core Clock and Effective Clock reporting and C6 Residency are completely broken for systems running AGESA 1200 or later, as long as Snapshot Polling is on.
  • I use Snapshot Polling in HWInfo because the Core Clock numbers are a little more sane, and usually much more in line with Effective Clock unlike the otherwise pie-in-the-sky boost numbers AMD wants us to believe. The stated aim is to reduce the "observer effect", I'm not sure if it has any influence on the results. I just know that Snapshot Polling doesn't mitigate or eliminate HWInfo's slight performance impact on long benchmarks (AIDA and Cinebench ST).
  • There's also the whole Nvidia software scheduling shebang, would be interesting if someone could test AMD cards, albeit a 5900X is pretty high end to be holding back any of these GPUs.
I'll admit, I don't have too much of a horse in this race as my games run pretty smooth, without inexplicable stutters. Nothing like that which is seen in Heaven. HWInfo is always running and open on my second monitor.

However, I can easily introduce some occasional minor stutters to the mix, by doing things like watching Youtube, or Discord voice chat, or lots of chrome tabs on my 2nd monitor. Not nearly severe or frequent enough to affect gameplay, but definitely noticeable for me, I tend to nitpick on these things. Otherwise, games like MW are buttery smooth on DLSS 117fps capped or uncapped, with HWInfo.
 
First time I joined I just had my account deleted such was the level here; but is it me, or have things got better recently?
Don't derail an already 'intense' thread with gossip. Comment on OP's issue or move along.
 
The thing to getting this, is to understand that you're not providing this strange specific environment, in order to manifest the bug. You only need use the power monitoring to manifest it.... But you probably won't be able to SEE it, because the system is intentionally designed to mask any inconsistencies so that they aren't visible on screen. That's why we need that strange environment, to make it possible to see it. It's happening anyway, though.

Think of it like this; This is like a flat tyre in your car. If you are driving on the beach and have your windows up, you might not even know. But if you look out the window and drive on a road, it will be very obvious. The flat tyre is there either way, though. I'm too tired for a good analogy now but I think that gets the idea across :)

The issue is still occurring when the nvidia api is used, it's just that some apps use it in different ways. It would appear that one certain way they use it, has a problem, which is why different apps behave in different ways. In this case, we can even measure a difference between the two versions of the app. I don't entirely discard contacting the individual application developers though, but it's best practice if there's something broken, to deal with it from the top down.
I think we'll have to agree to disagree. If I can't see any inconsistency on screen, i.e. my game looks smooth to my eyes, then there is no issue to report.

With the car analogy, it sounds more like turning the A/C on. You might be able to convince yourself that there is a slight performance loss, but if you're focused on driving, you won't notice a thing.

To say something practical: I have a small secondary screen where I run Windows task manager and GPU-Z to constantly monitor my PC. It doesn't have any visually detectable influence on framerates or on frametime consistency. Same if I open HWinfo 7.04. Though if you look at my specs, you'll see that it's quite difficult to put my system into a CPU limited situation - not that you should intentionally aim for a CPU bottleneck on any PC anyway. ;)
 
Have you tried any other tools? This is regular increases in frametime, correct? CapFrameX would seem to be the appropriate tool for that, much less extra variables compared to Afterburner.

Tried Heaven running at about 430 FPS over 20 seconds of monitoring without any HW monitoring software running, with HWInfo and with LibreHardwareMonitor. Monitoring software running did cause more frametime spikes but not at the regular 1s interval and the same spikes were there even without monitoring GPU (simple to do in LibreHardwareMonitor). I would still suspect this has much more to do with simple CPU load.
 
Please don't use Heaven to test this thing.
Heaven has strange stutters at many places in the benchmark, always at the exact same spots each loop, and those exact same stutters happen on both of my AMD cards as well as my GTX 1070.
Exactly in the very same locations.
And they happen with or without monitoring software open.
The only thing that reduces the stutters is lowering the framerate (at 60 FPS you don't see anything except maybe that one huge hitch on the circling flyby).
 
Hmmmn. He might be ob to something. This might be related to the year long VR bug that happens when you have mobitoring software up.

I just have no time to test it though.

Have you tried CSGO or Valorant? I get around 1000fps at 1080p in Valorant tutorial.
 
Back
Top