• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

Bug effecting all Nvidia GPUs - Nvidia won't respond - we need your help!

xcasxcursex

New Member
Joined
Jun 19, 2021
Messages
23 (0.02/day)
So, I've found a bug, I've reported it to nvidia, and it landed with a helpdesk noob who didn't understand it, and is now stuck in his queue as he's gotten butthurt and refuses to look at it. Yes, seriously.

We're going to need the community, to force nvidia to pay attention to this intentionally 'lost' case. Sadly, the community seems not to actually care..... How about techpowerup? Little help?

Here's a link to an illustration of the bug in effect and steps to (visibly) reproduce it yourself:
*EDIT: There is one very important detail missing from this link: The card performing the tests attached, is a 3090. This is relevant when combined with the suggested resolution to observe the issue, because the intention is to produce an extreme framerate >250FPS, so if you do this test with a different card a lower resolution will be required.

You can create your own case with nvidia, or if you like you can tell them to look at mine. Same name as here, they can find it.

Thanks in advance for your help.
 
Last edited:
Sorry man, I’m not having any trouble and I definitely don’t see legions of people here complaining about mysterious Nvidia problems you don’t truly identify.

If you are too lazy to identify the problem in writing then I’m too lazy to decipher your image.
 
You get a better response if you post this in the Reddit/Nvidia forums.
Went to the nvidia forums after two weeks of no response from tech support, no response there, went to reddit a week later, posts deleted, a week later I'm here.

Sorry man, I’m not having any trouble and I definitely don’t see legions of people here complaining about mysterious Nvidia problems you don’t truly identify.

If you are too lazy to identify the problem in writing then I’m too lazy to decipher your image.
Follow the link by clicking the image. The one I labelled "Here's a link to an illustration of the bug in effect and steps to (visibly) reproduce it yourself: "
 
Last edited:
Went to the nvidia forums after two weeks of no response from tech support, no response there, went to reddit a week later, posts deleted, a week later I'm here.


Follow the link by clicking the image. The one I labelled "Here's a link to an illustration of the bug in effect and steps to (visibly) reproduce it yourself: "

I did, and I found no problems at all. On two cards, the 3070 and the 1050ti in my laptop.
Though as it mentions, it only occurs in some monitoring software so the question is then, is the issue rather related to their implementation being less than optimal than a horrifying bug?
Either way, no problems here.

The picture itself is pretty much worthless either way without better resolution on the scale, there are always variances in framerate and frametimes. But I found no variances that occur regularly, as they would in that case.
 
The picture itself is pretty much worthless either way without better resolution on the scale, there are always variances in framerate and frametimes. But I found no variances that occur regularly, as they would in that case.
And polling the GPU will cause data to be sent over the PCI-E bus, which can cause a very minor frametime spike. Sometimes this can't be avoided and it's usually so small it won't be noticeable(I know I've never noticed it).
 
I'll call this close to flamebate, but maybe he really believes it. I wouldn't put much credence in this.
 
And polling the GPU will cause data to be sent over the PCI-E bus

further polling most things increases load in some way. try spamming the shit out of like a thermistor on an I2C bus.

measure 0 or the temperature of the sun.
 
Sorry man, I’m not having any trouble and I definitely don’t see legions of people here complaining about mysterious Nvidia problems you don’t truly identify.

If you are too lazy to identify the problem in writing then I’m too lazy to decipher your image.
Regarding this: It's not something you'll notice unless you're seriously digging deep to tune your performance, or if you are running really strange loads in really unusual ways (see: the process to reproduce the bug. Who plays games at 900p on a 3090? These kind of strange conditions are what's required to make this visible to the naked eye). Default settings such as pre-rendering queue depths will ensure that this bug is hidden from view, but it is still impacting your performance. It's just that instead of stuttering in a visual way you see on-screen or in a graph, it stutters elsewhere in your system, like, keyboard inputs or network traffic or something fun.

This is why even though it's effecting literally every card that's been tested, I'm the only one (as far as I know) who's noticed it. It's not obvious, to put it lightly. At least, not under normal conditions. I personally noticed it because I was messing with some frame synchronisation that required millisecond accurate extremely low frametimes with a single frame pre-render. I've given instructions that will reproduce it reliably in a way that's easy to see on a frametime plot.


I did, and I found no problems at all. On two cards, the 3070 and the 1050ti in my laptop.
Though as it mentions, it only occurs in some monitoring software so the question is then, is the issue rather related to their implementation being less than optimal than a horrifying bug?
Either way, no problems here.

The picture itself is pretty much worthless either way without better resolution on the scale, there are always variances in framerate and frametimes. But I found no variances that occur regularly, as they would in that case.
You're the first of 20 PC's not to see any issue, but the rest of your post makes me wonder if your test platform is valid. You say "there are alwys variances in frametimes" but take a look at my graph on the right. As explained in the text there, I used a frametime limiter to accentuate this, maybe you will want to also, but it isn't needed to observe this fault (you'll need a sharper eye though) and of course this demonstration assumes you can maintain stable frametimes in the first place, obviously we can't test a frametime-related issue otherwise.

The graph I've shown is more than enough to illustrate the issue even with that resolution - because the issue is so blatantly apparent. I can grab you higher res images if you like though.

The question regarding the monitoring apps is valid. I can see in traces that a specific Nvidia API call is the one taking an exceedingly long time, and because this is not unique to a specific app, I'm going upstream to the first common point. If there's a faulty API implementation then nvidia will want to issue an advisory to developers as such.
And polling the GPU will cause data to be sent over the PCI-E bus, which can cause a very minor frametime spike. Sometimes this can't be avoided and it's usually so small it won't be noticeable(I know I've never noticed it).
Which this isn't, as traces will show. The Nvidia techs will get all that, just as soon as they actually look at this.

I'll call this close to flamebate, but maybe he really believes it. I wouldn't put much credence in this.
Test it as described and you will believe it too. You really think I've spent the past month having people call me a liar because they wouldn't even look, for my benefit? The only person getting flamed over this, is me.

Edit: Your signature applies here.
further polling most things increases load in some way. try spamming the shit out of like a thermistor on an I2C bus.
I can slow polling to every 10 seonds and it will still spike. I can copy every frame down the PCI buss and back up again every time and not generate enough load to even reach 1/10th of this spike. This isn't excessive buss traffic or normal behaviour when polling.
 
Last edited:
frametime.PNG


Hwinfo running and monitoring everything, horrible frametimes for sure.

Spike was a printscreen which I later realized was only for heaven so I had to snip it.
Smaller variances are of no concern considering all the stuff I got running in the background but would you look at that. No regular issues at all.

Point is, the problem might exist but i get the feeling your overstating its severity.
 
Easy solution to easy problem, just set max FPS, solves 99% of all frametime issues.
When the GPU pipeline is getting 100% hammered, any polling will cause slight stutter, even when moving your mouse. Nvidia already knew about this, that's why they created Reflex API, which basically limit the GPU pipeline at 98% load, leave the last 2% for mouse input latency reduction or hardware polling.
Other solution is using "Prefer Maximum performance" in the Power Management Mode in NVCP, which keep high GPU clocks so that GPU pipeline is free.
 
View attachment 204532

Hwinfo running and monitoring everything, horrible frametimes for sure.

Spike was a printscreen which I later realized was only for heaven so I had to snip it.
Smaller variances are of no concern considering all the stuff I got running in the background but would you look at that. No regular issues at all.

Point is, the problem might exist but i get the feeling your overstating its severity.
If you want to reproduce the fault in a manner you can easily view in a frametime graph, please follow my instructions. If you follow some other process, as you have, I can't guarantee that it will work.

Easy solution to easy problem, just set max FPS, solves 99% of all frametime issues.
When the GPU pipeline is getting 100% hammered, any polling will cause slight stutter, even when moving your mouse. Nvidia already knew about this, that's why they created Reflex API, which basically limit the GPU pipeline at 98% load, leave the last 2% for mouse input latency reduction or hardware polling.
Other solution is using "Prefer Maximum performance" in the Power Management Mode in NVCP, which keep high GPU clocks so that GPU pipeline is free.
The frametimes are just a symptom of the issue. The aim here is not to achieve stable frametimes, it is to fix the driver. I don't have any desire to sweep this under the rug.

The GPU pipeline is not getting 100% hammered. In my process you will find it at 17%. Polling the GPU does not cause stutter in other scenarios. This isn't a utilisation issue.
 
If you want to reproduce the fault in a manner you can easily view in a frametime graph, please follow my instructions. If you follow some other process, as you have, I can't guarantee that it will work.
I did follow your process but at this point im starting to think it won't matter what anyone does because you will refuse to believe any of it.
 
Guys, please. I've been trying to fix YOUR GPU for the past month. Do me a favour: If you don't want to perform my test, don't. If you don't want to contact nvidia, don't. But please, pretty please, I am SO tired of explaining the same things over and over. I've been through all of this on several forums now and it's the same every time... If you're not going to do my test as instructed, and if you're not a developer who would understand it anyway, and if you're not willing to call nvidia regrdless.... please, just step away. That's all I ask of you. Thanks.

Edit: Don't get me wrong, I'm down to spend all day explaining it to people who want to understand, I just have zero inclination toward arguments.
I did follow your process but at this point im starting to think it won't matter what anyone does because you will refuse to believe any of it.
No, you didn't. You prove it in your screenshot.
 
Last edited:
Gpuz not affected?
 
No, you didn't. You prove it in your screenshot.
Yes, I did.
First you complain I didnt limit it even though your post says that would not be needed but makes it easier. Even then I had no issues.
So then I did it again, this time with a limiter as you suggested and thats what I got, and thats still apparently not correct.
A nice smooth flat frametime graph, even when running hwinfo which as you suggested would show frametime spikes

As you say in the post yourself
Here, Heaven is configured to run lowest settings and 1600x900 resolution, to ensure a very high frame rate/low frame times, which will ensure that the fault is easily visible in the graph. The fault continues at any resolution or framerate. In this case, I have applied a frametime limiter, in order to get a flat line which will accentuate the frametime spike from the glitch. You do not need to use a frametime limiter, but it will make the bug more apparent in the graph.

Which is exactly what I did with in this case Hwinfo64 running on all sensors which according to you, should cause these regular spikes in frametimes.

Otherwise its you who have not explained it properly because I see no issues at all running Hwinfo64 while I run Heaven.

Else, please confirm that hwinfo64 + Heaven + Limiter should not produce spikes? Because your post says otherwise claiming that no matter what framerate or what resolution the spikes will occur.
 
No, you didn't. You prove it in your screenshot.
F*** my apologies there is a minor omission in that doc (I've pasted the wrong draft like an idiot) that has a major impact. I mentioned the resolution but not that these test were done on a 3090 (I mention it earlier in the thread but not on that page). If your card is weaker than that and since the process doesn't specify it like I thought, it's possible you followed it otherwise. Mea culpa. I blew it.

I did however mention you're going to need extremely high frame rates and 120 ain't that. Try to double or triple it by reducing the resolution.
Otherwise its you who have not explained it properly

It's exactly that, I am sorry. I've had SO many people fail to follow the process (usually followed by abusing me which is great fun) I tarred you with their brush. Jerk move. My bad.

Else, please confirm that hwinfo64 + Heaven + Limiter should not produce spikes? Because your post says otherwise claiming that no matter what framerate or what resolution the spikes will occur.
The delays will occur but you may not see them. Since I've wasted a ton of your time I owe you at least a proper explanation. I'll try and keep it plain-english-y.

What happens here is that the monitoring call which should be done in microseconds, takes several milliseconds. This is CPU time, not GPU time. FWIW, an earlier experiment showed me that this is extremely memory-speed critical. Taking my memory down from the usual 3800 flat 16s to 2133 with stock timings, made this issue extremely drastic and noticeable. Delays in the memory pipeline appear as delays in the cpu pipeline at a higher level of monitoring (because it's the CPU that's waiting on the data from RAM). So, given that at this point, the frame is being rendered by the CPU in order to take it's place at the end of a queue behind two other frames which have to be processed and displayed before the one that was delayed, that delay is eaten up by the buffer and you don't see it - BUT IT STILL ATE YOUR CPU IN THERE. IT JUST HID THE EVIDENCE. < Caps because this is super important otherwise it would be a non-issue, right?

So, we have the need to force a CPU-limited scenario, in order to see the CPU's behaviour. And we want to avoid a pre-rendering queue hiding the mess, right? So how? Frames, all of the frames. By reaching a massive framerate we have ensured the GPU is lightly loaded (or it wouldnt be able to get those frames - we're not trying to inducee a GPU-limited scenario here, so not too high!) and will attempt to render the frames at full tilt, thus loading up the CPU by emptying the prerender queue quickly, the loaded CPU exacerbating the issue, and the short queue exposing it.

SOOOO you need like 250+ FPS to see it. 300+ is recommended. The more, the better - so long as the system can realistically handle that load. This is why I went to the trouble of specifying 1600x900 in heaven, because at that res, with the 3090, you'll just be able to see it. So, aside from the fact I'd typed that stuff like a dozen times and didn't realise that this time I forgot to mention what card it was like an idiot..... Now you understand why nobody sees it. It's buried and hidden by mechanisms that are supposed to do exactly that, to give us smooth frametimes and high framerates. And this is why I'm being a stickler about following the process (which I screwed up and I apologise again) because if one does not (which sadly appears to be almost everyone almost all of the time) then you very easily end up in a scenario where you are not within the parameters where the bug is visible to you, for example by running 120FPS where your frametimes are too high to see it and too high to get the CPU mad, and you could probably easily run too low res and choke your system entirely and not see anything.

BTW, nvidias got the right info including the card type and much, much more than I've shared here, so that's not why they can't see it.
 
That makes a lot more sense and you should have started with that information!

my 8700k is however unable to push that framerate in Heaven so I can't look in to it at those framerates. What CPU did you use when you tested this? Or other hardware in general for the system?
 
Adding to the above because this has come up before: Not EVERY load makes this happen. I don't know why, I'd like to ask nvidia. I can tell you a real-world load that does: Battlefield 1. That's the game that made me notice this. But trying some other 300FPS load won't lead to "your bug doesn't exist" because I chose heaven because I know it's repeatable there. Some other load may not be. Even using BF1 as an example, it causes the problem, but the frametimes are too unstable (other than the menus but there's a reason I didn't suggest that) so it's really not useful. Myself and a handful of friends have tested it across a bunch of systems and it always works. That's why I'm specifying heaven, because it's repeatable. I did try other loads (mostly benchmarks because this is for reproduction at the lab and they might not have <insert game here>) and heaven was the best one.
That makes a lot more sense and you should have started with that information!

my 8700k is however unable to push that framerate in Heaven so I can't look in to it at those framerates. What CPU did you use when you tested this? Or other hardware in general for the system?
I should have! I thought I did! I'm honestly so sorry man. I typed it so so many times having my posts deleted and trying different places, I just thought it was in there like usual, and clearly, it isn't. I blew it.

Yeh again it's a hard bug to reproduce for many reasons and yeh one is because the hardware requirements are rough. I did manage to get a 1070+5820k to do it but that thing is tuned to the nines (it's my old gaming rig). A mate did it on a 2070 (sorry I don't know what CPU it was I wanna say 9900k)... So other GPUs can do it. CPU is probably the tricky one, because it's a matter of getting it to be loaded but not too loaded (as described above). Your 8700 probably can hit some lower-than-250 framerate that will successfully expose the spikes but there's a fair amount of work in finding that (not-so-)sweet spot. This one is a 5900x with chart-topping benchmarks pushing 3800 16-16-16-34 ram and the elusive 3090 and even still it took me months to pin it down.

It really is hard to spot. Honestly that's part of the reason it's a concerning bug, because it's the kind of thing that gets missed, and stays in the drivers forever making a tiny but entirely unnecessary dent in performance. You need such high end hardware to be able see it, that it's super easy to hide it under the performance of the thing; or you don't have that hardware and you don't ever see it.... This bug is trying hard to last forever.

Awaiting approval before being displayed publicly.
Uhh what?

1624084553571.png


Since it's been suggested a higher resolution might be useful, here it is: 100ms sample rate, vertical scale set to 16.7ms aka 60FPS. This means every horizontal grey line is 1.6ms. Heaven is set to free roam mode so the frametimes should be more stable than usual heaven, however I'm doing it with my browser, discord, etc open so there are a few spikes. The first section is just standing in heaven. Then I alt-tab out and start hwinfo64. Look at those spikes. That's dipping from 450+ FPS to 120. Then I exit hwinfo. Nice and flat again (except for those two spikes, that's discord doing something and beeping at me. I don't think this requires I re-do it, after all it's pretty obvious it's not the same as that middle section.)

Gpuz not affected?
Awaiting approval before being displayed publicly.
Normally I'd assume it's because I'm new and it was automated and I would have to wait for an admin to see it, but since you've been by already and this has been public already, I'm wondering if the thread was hidden manually?
 
What happens here is that the monitoring call which should be done in microseconds, takes several milliseconds. This is CPU time, not GPU time.
It's probably waiting for some kind of lock, not uncommon, especially if the I2C bus is involved. Only NVIDIA can fix it, maybe they already have the fix ready and are just waiting for verification, or the right driver release window. Or they are too busy with higher priority issues

I'm wondering if the thread was hidden manually?
You made some changes to your post which triggered the spam detection for new users, so the thread went to an "approval queue"
 
Hi,
Is this threads title actuate all gpu's effected or is this just 30 series effected ?
op didn't list all gpu's tested only mentioned 3090.
 
From the description and details - is this a GPU problem or an API/driver problem? Seems to be a latter, maybe. This is admitted to be a CPU-limited scenario, CPU is causing the bump and RAM speed greatly affects things. Also, regular checks that do cause CPU load for monitoring purposes are by themselves unavoidable, 100ms frequency is rather intensive as well.

Configuration of monitoring software surely plays into this as well if this is CPU load. Did you disable any other meters from monitoring and it still happens? Are you graphing on screen?
You mention that this does not happen with all monitoring software, you name Libre Hardware Monitor as one that does not. Is that really the case?
By the way, is the same thing replicable in some other OS, Linux for example?
 
Hi,
Is this threads title actuate all gpu's effected or is this just 30 series effected ?
op didn't list all gpu's tested only mentioned 3090.
It does it on all cards but because it's difficult to observe the faster cards are a lot easier to see it. As above, CPU actually ends up being important, too, since the GPU could just have the resolution lowered to reach high framerates, but the CPU may not be able to keep up.

From the description and details - is this a GPU problem or an API/driver problem? Seems to be a latter, maybe. This is admitted to be a CPU-limited scenario, CPU is causing the bump and RAM speed greatly affects things. Also, regular checks that do cause CPU load for monitoring purposes are by themselves unavoidable, 100ms frequency is rather intensive as well.

Configuration of monitoring software surely plays into this as well if this is CPU load. Did you disable any other meters from monitoring and it still happens? Are you graphing on screen?
You mention that this does not happen with all monitoring software, you name Libre Hardware Monitor as one that does not. Is that really the case?
By the way, is the same thing replicable in some other OS, Linux for example?
Well analysed, yes, this is an NVAPI issue as best I can tell. It's tough because the apps which are effected are closed source so there's a limit to what I can see. There's inevitably a point of this where my only answers are "I don't know, and I'd like to ask nvidia". You're right, the 100ms is extreme, I only do that to record the images for the purpose of proving this is a thing. Normally it's at default 1000. Note that this is MSI afterburner's poll rate, but the app which is causing the spikes is hwinfo, and the poll rate on that is 1 second (as visible by the giant spike every 1 second in the graphs). I actually tested it at 2, 5 and 10 seconds to see if the spikes disappeared. They didn't change a bit, other than being every 2, 5, or 10 seconds instead of every 1. I still have plenty of CPU, GPU and memory bandwidth available.... And even with other apps polling at far higher rates, there are no issues. It really doesn't suggest any kind of excessive load is to blame here. It does seem like there's some kind of scheduling/handling issue, as wizard said, it's probably waiting for a lock.... Honestly if I dig into the traces far enough I might even be able to get that specific, but that kind of work is way into the "that's nvidia's job" territory ;)

It's probably waiting for some kind of lock, not uncommon, especially if the I2C bus is involved. Only NVIDIA can fix it, maybe they already have the fix ready and are just waiting for verification, or the right driver release window. Or they are too busy with higher priority issues


You made some changes to your post which triggered the spam detection for new users, so the thread went to an "approval queue"
Thanks man I thought I got shadowbanned right from the drop, appreciate your explaining what not to do next time haha :)

Sadly, the response from nvidia after some weeks of explaining all that has been said above and much, much more, was the following:


So I tested an in-house PC that has a Win10 X64 +RTX 3080. Ran multiple games at 1080P @1440P without any issues.

I was getting good FPS as well.

I found no reason to test any 3rd party benchmark tests since all games were running perfectly. We use benchmark tools if the PC or GPU has performance issues during the normal usage or while gaming. So it indicates that your's is a singular case and there is a possibility that it's a hardware issue.

If you've read the above, you already understand why his test methodology was entirely inadequate and his conclusions entirely illogical. But the consequence of his inability to cope with this, is that we all are trapped in helpdesk limbo. He would reply but never actually do anything related to the issue just treating it like a normal stuttering complaint. Now they just don't even respond for weeks.
 
Last edited:
i own the whole Ampere Lineup except of the new TI Cards and i have Zero Problems.
 
Back
Top