Thursday, November 28th 2019

AMD Radeon "Navi" OpenCL Bug Makes it Unfit for SETI@Home

A bug with the Radeon RX 5700-series "Navi" OpenCL compute API ICD (installable client driver) is causing the GPUs to crunch incorrect results for distributed compute project SETI@Home. Since there are "many" Navi GPUs crunching the project cross-validating each others' incorrect results, the large volume of incorrect results are able to beat the platform's algorithm and passing statistical validation, "polluting" the SETI@Home database. Some volunteers at the SETI@Home forums, where the the issue is being discussed, advocate banning or limiting results from contributors using these GPUs, until AMD comes out with a fix for its OpenCL driver. SETI@Home is a distributed computing project run by SETI (Search for Extraterrestrial Intelligence), tapping into volunteers' compute power to make sense of radio waves from space.
Sources: SETI@Home Forums, AMD Community, TH1813254617 (Reddit)
Add your own comment

32 Comments on AMD Radeon "Navi" OpenCL Bug Makes it Unfit for SETI@Home

#1
lynx29
Thank goodness for the Scientific Method. :roll:
Posted on Reply
#2
notb
LOL
SETI@Home is fun and all, but this is a general problem in OpenCL. There's a suggestion that Navi has bad FFT implementation.
So as of this moment Navi cards are unfit for almost all computing production systems... and rather pointless for development (even students).

And this shows up basically a week after W5700 launch.

Fun stuff.
Posted on Reply
#5
ToxicTears
Fluffmeister
Navi gonna find aliens this way.
yeah, and that's the mistake no one wants :eek:
Posted on Reply
#6
_Flare
Thats a feature to stop non-gaming misusage.
Posted on Reply
#7
Parn
notb
LOL
SETI@Home is fun and all, but this is a general problem in OpenCL. There's a suggestion that Navi has bad FFT implementation.
So as of this moment Navi cards are unfit for almost all computing production systems... and rather pointless for development (even students).

And this shows up basically a week after W5700 launch.

Fun stuff.
Really?

Fourier Transform is one of the fundamentals for compute work. If AMD indeed screwed up its implementation at the hardware level, they would need a recall.
Posted on Reply
#8
laszlo
is a special bug inserted by aliens so we can't find them :roll:
Posted on Reply
#9
looniam
as usual amd is late to the party anyhow.

2080TIs already found space invaders.
Posted on Reply
#10
Vya Domus
Since there are "many" Navi GPUs crunching the project cross-validating each others' incorrect results, the large volume of incorrect results are able to beat the platform's algorithm and passing statistical validation, "polluting" the SETI@Home database.
What ? Why ? That is by far the shittiest validation method I have ever heard of.
Posted on Reply
#11
notb
Vya Domus
What ? Why ? That is by far the shittiest validation method I have ever heard of.
?
That's how science works. If most people on Earth do an experiment incorrectly, the bad result becomes statistically relevant (as in: not an obvious outlier).
There's no way to test this other than perform a different experiment of the same phenomenon.

In fact, that's why we're able to notice these issues in computational science.
There are different libraries that do equivalent math. And there are different CPUs and GPUs that we can compare.

If Navi was doing some computation incorrectly, but no other hardware was used, there would be no way to test for this error.
Posted on Reply
#12
PanicLake
Vya Domus
What ? Why ? That is by far the shittiest validation method I have ever heard of.
Yep, it is like saying Trump is a "nice person" because many people voted for him.
Posted on Reply
#13
DeathtoGnomes
notb
?
That's how science works. If most people on Earth do an experiment incorrectly, the bad result becomes statistically relevant (as in: not an obvious outlier).
There's no way to test this other than perform a different experiment of the same phenomenon.

In fact, that's why we're able to notice these issues in computational science.
There are different libraries that do equivalent math. And there are different CPUs and GPUs that we can compare.

If Navi was doing some computation incorrectly, but no other hardware was used, there would be no way to test for this error.
hold the phone, something i can agree from you? nahhh :p

the only way to test correctly is to use other hardware, iirc they try to not send the validation to a similar system. They wont just discard the data, they'll save it to send it out again. I do agree they should suspend the 5700s for the time being.
Posted on Reply
#14
Vya Domus
PanicLake
Yep, it is like saying Trump is a "nice person" because many people voted for him.
The way it works is you collect experimental and validation data within the same experiment, afterwards, when you have a model you use the validation data to test it and not the output of another model as it is pretty much the case here with the way SETI is testing these results.
Posted on Reply
#15
mstenholm
They doesn't work at F@H either....
Posted on Reply
#16
xkm1948
No surprise. OpenCL has been loosing developer interest for a long time. Small community, little resources, bugged GPU driver and etc.

This is the case for almost all “Open Standard” computation acceleration framework. Not a lot of researchers like to invest their money and human resources into such things due to fear of being ripped off by bigger fish since everything published will be fair game to use. It is a damn shame though. OpenCL would have been a great alternative to CUDA.
Posted on Reply
#17
Xuper
xkm1948
No surprise. OpenCL has been loosing developer interest for a long time. Small community, little resources, bugged GPU driver and etc.

This is the case for almost all “Open Standard” computation acceleration framework. Not a lot of researchers like to invest their money and human resources into such things due to fear of being ripped off by bigger fish since everything published will be fair game to use. It is a damn shame though. OpenCL would have been a great alternative to CUDA.
the Issue is Navi not Other AMD cards.this problem has nothing to do with OpenCL Driver or anything , Only Navi. Hold you breath.Man , Read all comments !!
I run a rx5700 and have noticed this issue. The task runs to completion and returns blatantly incorrect results. The only times when my rx5700 GPU gets a valid result is when it is validated against another AMD rx5700 series GPU (both gets the wrong result). I've currently stopped my computer from accepting GPU work units (it took me way too long to realize something was wrong, sorry). I believe this is an issue with the Navi architecture and not necessarily solely with AMD's OpenCL driver, as I see older AMD GPUs still returning "correct" results.

Someone has to redo all the work units where the results came from Navi AMD GPUs (RX5700, RX 5700XT, RX 5500M, RX 5500), and ban all AMD Navi GPUs until a fix is found.

Interestingly, my RX5700 has not been causing issues with other projects, like Einstein@home, Milkyway@home, Collatz, etc. Something about Navi and OpenCL really does not like Seti@home.

If any of you need any testing or logs on an AMD RX5700, hit me up.

edit: Corrected OpenGl to OpenCl, thanks Keith Myers
Posted on Reply
#18
Cheeseball
Hopefully the Adrenalin Pro drivers for the new Radeon Pro WX 5700 aren't affected by this, because this would be bad for its launch.

They probably prioritized fixing the random crashes in the drivers first before concentrating on GPGPU stuff.
Posted on Reply
#19
notb
mstenholm
They doesn't work at F@H either....
Until this is solved, we can safely assume Navi doesn't work in most popular computation scenarios.
Of course this can be fixed in software. Let's hope there will not be any performance penalty, because what would that mean for all the Navi supercomputers ordered? :D
Posted on Reply
#20
prtskg
notb
Until this is solved, we can safely assume Navi doesn't work in most popular computation scenarios.
Of course this can be fixed in software. Let's hope there will not be any performance penalty, because what would that mean for all the Navi supercomputers ordered? :D
It's working fine with projects like Einstein@home, Milkyway@home, Collatz, etc. I know Seti@home isn't working fine. I'm not sure about F@H. And which supercomputers have ordered navi?
Vega is AMD's compute card atm. Arcturus is coming compute card, which is more similar to Vega than Navi.
Posted on Reply
#21
Assimilator
LMAO ouch:

Keith Myers from SETI@home forums
The new Navi 5700 and 5700XT are useless for compute currently. The drivers are not ready for compute. All projects that rely on AMD OpenCL drivers are producing nothing but garbage results and invalids. The AMD developers and the Khronos group are aware of the problem but not a peep from either of them about what the real problem is or when to expect a fix. In the meantime]and:

[quote=Keith Myers from SETI@home forums]
Phoronix did testing and reviews of the RX 5700XT and could not get the card and drivers to pass the OpenCL parts of their standardized test suite.
Posted on Reply
#22
Jism
Could be pretty much due to running maths on a consumer graphics card instead of a pro version. Many of the Vega chips where initially designed as a PRO card but failed certain quality guidelines.
Posted on Reply
#23
notb
prtskg
It's working fine with projects like Einstein@home, Milkyway@home, Collatz, etc. I know Seti@home isn't working fine. I'm not sure about F@H. And which supercomputers have ordered navi?
Vega is AMD's compute card atm. Arcturus is coming compute card, which is more similar to Vega than Navi.
Computing is not about funky distributed projects.
This problem was noticed in one of them because gamers already started using Navi (card for scientists/engineers was just announced and isn't used yet).

A GPU doesn't have a "calculate Seti@home" that doesn't work (while "calculate Einstein@home" does).
It makes errors in some math instruction that Einstein@home may not use. That's it.

As mentioned earlier: there's a possibility that FFT results are incorrect. FFT (Fast Fourier Transform) is a fundamental algorithm used for many problems. So the card is already almost useless for computing.
And another thing is about being reliable. It's obvious that AMD haven't properly tested this card, so there's really no reason to believe in other results. Everything will have to be tested by the clients... and there goes the "value".
Posted on Reply
#24
Steevo
Unacceptable, these cards are sold with a feature set as advertised. Failing the standards set forth as advertised is false advertising, and consumers of all types should receive the product they pay for.
Posted on Reply
Add your own comment