Monday, October 10th 2022

AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

Oct 10th, 2022 03:00 Discuss (48 Comments)

When AMD announced that the company would deliver the world's fastest supercomputer, Frontier, the company also took a massive task to provide a machine capable of producing one ExaFLOP of total sustained ability to perform computing tasks. While the system is finally up and running, making a machine of that size run properly is challenging. In the world of High-Performance Computing, getting the hardware is only a portion of running the HPC center. In an interview with InsideHPC, Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), provided insight into what it is like to run the world's fastest supercomputer and what kinds of issues it is facing.

The Frontier system is powered by AMD EPYC 7A53s "Trento" 64-core 2.0 GHz CPUs and Instinct MI250X GPUs. Interconnecting everything is the HPE (Cray) Slingshot 64-port switch, which is responsible for sending data in and out of compute blades. The recent interview points out a rather interesting finding: exactly AMD Instinct MI250X GPUs and Slingshot interconnect cause hardware troubles for the Frontier. "It's mostly issues of scale coupled with the breadth of applications, so the issues we're encountering mostly relate to running very, very large jobs using the entire system … and getting all the hardware to work in concert to do that," says Justin Whitt. In addition to the limits of scale "The issues span lots of different categories, the GPUs are just one. A lot of challenges are focused around those, but that's not the majority of the challenges that we're seeing," he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."

Many applications cannot run on hardware of that size, so unique tuning is needed. With the hardware issues that AMD GPUs provide, it is a bit harder to have an operational system on time. However, the Oak Ridge team is confident in their expertise and has no trouble meeting deadlines. For more information read the InsideHPC interview.

Source: InsideHPC

Add your own comment

48 Comments on AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

#26

Oberon

ThomasKAMD is the one providing the solution to the customer, doesn't matter who's switch is being used, AMD is taking the blame.

Tell me you have no experience outside of consumer hardware without telling me you have no experience outside of consumer hardware.

#27

mechtech

PunkenjoyI have build clusters, render farm and other types of super computers in one of my previous job.

Like they said, this is indeed expected. You have all kind of issue, bad cables, bad memory, etc. If you have 1% defect rate and you build a 1000 nodes system, that means 10 systems will have defect.

After that the fun start, try to find the source of the problem, trying to isolate it. It takes times and effort and the larger the cluster is, the harder it can be.

Render farm are most of the time easier since they just use the network and will crash by itself. A cluster have also the interconnect that can fail. You run codes on multiples nodes and it's not always clear where it fail. Sometime one node will crash because it received corrupted data from another nodes. Sometime it's the switch, the storage, etc. Way more parts to fail than a regular PC and trying to pin point a failure can sometime be really a pain in the ass and take days.

So to me, this article is more something to please the AMD bashing communities than anything else. I build both AMD/Intel systems and it's was not really much the CPU vendor that really effected defect rates. Larger cluster required more time to settle.

Must be a PITA troubleshooting and correcting that lol

#28

Oberon

PunkenjoySo to me, this article is more something to please the AMD bashing communities than anything else.

Lots of that going on here lately.

#29

R-T-B

Dirt ChipSucks to be an early adopter on a multi 100s million dollar product
:)

Sir that's always what you are on a multimillion dollar build. You think they poop these out daily?

ChomiqNothing burger:

Basically, yeah.

OberonLots of that going on here lately.

Clickbait gets clicks, sadly.

#30

Has to do with dual sided dimms :D

#31

Vayra86

PunkenjoyI have build clusters, render farm and other types of super computers in one of my previous job.

Like they said, this is indeed expected. You have all kind of issue, bad cables, bad memory, etc. If you have 1% defect rate and you build a 1000 nodes system, that means 10 systems will have defect.

After that the fun start, try to find the source of the problem, trying to isolate it. It takes times and effort and the larger the cluster is, the harder it can be.

Render farm are most of the time easier since they just use the network and will crash by itself. A cluster have also the interconnect that can fail. You run codes on multiples nodes and it's not always clear where it fail. Sometime one node will crash because it received corrupted data from another nodes. Sometime it's the switch, the storage, etc. Way more parts to fail than a regular PC and trying to pin point a failure can sometime be really a pain in the ass and take days.

So to me, this article is more something to please the AMD bashing communities than anything else. I build both AMD/Intel systems and it's was not really much the CPU vendor that really effected defect rates. Larger cluster required more time to settle.

www.merriam-webster.com/dictionary/clickbait

You can use a white pitchfork for this one instead of a red one.

#32

FromCan't Operate a Day without Issues" to
"We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."
It doesn't get any more clickbait than that.

#33

thesmokingman

Lmao, TPU added that trollish bit. That's really poor form man.

#34

thesmokingmanLmao, TPU added that trollish bit. That's really poor form man.

News title: "Politician caught wearing women's clothes in public"
Inside picture of Hilary Clinton.

#35

Space Lynx

Astronaut

Dirt ChipSucks to be an early adopter on a multi 100s million dollar product
:)

if you actually read the original article. it states these kind of obstacles are in the norm for something of this size.

this is just clickbait garbage. humans bore me. i guess i need to start drinking now

#36

N3utro

Did they try to turn it off and on again?

#37

PapaTaipei

All that compute power to spy EVERYTHING and EVERYONE, EVERYWHERE.

#38

Mussels

Freshwater Moderator

Crackong60 million parts...
Even a 0.001% chance of malfunction would mean 100% in this scale
There are always more than 1 component having malfunction in any given time of operation.

This.

Exascale, Exaproblems.

CallandorWoTif you actually read the original article. it states these kind of obstacles are in the norm for something of this size.

this is just clickbait garbage. humans bore me. i guess i need to start drinking now

Well yeah, but it's also how you spot the people who lack the ability to think and leap on answers that fit an existing worldview

#39

AlwaysHope

They cheaped out on the cables.. .suck it! :laugh:

#40

R-T-B

PapaTaipeiAll that compute power to spy EVERYTHING and EVERYONE, EVERYWHERE.

Is it conspiracy hour already?

#41

Count von Schwalbe

R-T-BIs it conspiracy hour already?

#popcorn

#42

mkppo

Terrible title..

The reality is that this is absolutely normal.

#43

Dirt Chip

CallandorWoTif you actually read the original article. it states these kind of obstacles are in the norm for something of this size.

this is just clickbait garbage. humans bore me. i guess i need to start drinking now

You are right- It is garbage, and the pull is not helping and should be killed.
But this is my escapism, so please don't judge me for worse ;)

#44

bug

P4-630Did we ever hear from an intel/nvidia supercomputer that had startup issues?.... Not that I know...

It's till not clear what kind of issues are we talking about here. If this is about the acceptance phase, yes, there will be issues galore, nothing to write home about. If it's post-acceptance issues, that could be a problem. We also don't know how much is hardware and how much is software related (it's possible this is what they're trying to figure out right now).

#45

HenrySomeone

Not surprising, you always get what you pay for...

#46

ThomasK

All of a sudden, lots of people here in the comment section seem to have a vast experience with supercomputers...

Gathered in another forum's comment section.

#47

thesmokingman

ThomasKAll of a sudden, lots of people here in the comment section seem to have a vast experience with supercomputers...

Gathered in another forum's comment section.

Nah, click baity titles like this are troll and shill magne as Mussels alluded to. You can tell which is which.

#48

Mussels

Freshwater Moderator

thesmokingmanNah, click baity titles like this are troll and shill magne as Mussels alluded to. You can tell which is which.

It's pretty funny.

It does explain some peoples purchasing and hardware/brand preferences, if they can't get past the headlines to actually read the content

Add your own comment

AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

48 Comments on AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

Latest GPU Drivers

New Forum Posts

Popular Reviews

Controversial News Posts

AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

Related News

48 Comments on AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

Latest GPU Drivers

New Forum Posts

Popular Reviews

Controversial News Posts