Wednesday, August 26th 2009

AMD Demos 48-core ''Magny-Cours'' System, Details Architecture

Earlier slated coarsely for 2010, AMD fine-tuned the expected release time-frame of its 12-core "Magny-Cours" Opteron processors to be within Q1 2010. The company seems to be ready with the processors, and has demonstrated a 4 socket, 48 core machine based on these processors. Magny Cours holds symbolism in being one of the last processor designs by AMD before it moves over to "Bulldozer", the next processor design by AMD built from ground-up. Its release will provide competition to Intel's multi-core processors available at that point.

AMD's Pat Conway at the IEEE Hot Chips 21 conference presented the Magny-Cours design that include several key design changes that boost parallelism and efficiency in a high-density computing environment. Key features include: Move to socket G34 (from socket-F), 12-cores, use of a multi-chip module (MCM) package to house two 6-core dies (nodes), quad-channel DDR3 memory interface, and HyperTransport 3 6.4 GT/s with redesigned multi-node topologies. Let's put some of these under the watch-glass.
Socket and Package
Loading 12 cores onto a single package and maintaining sufficient system and memory bandwidth would have been a challenge. With the Istanbul six-core monolothic die already measuring 346 mm² with a transistor-load of 904 million, making something monolithic twice the size is inconceivable, at least on the existing 45 nm SOI process. The company finally broke its contemptuous stance on multi-chip modules which it ridiculed back in the days of the Pentium D, and designed one of its own. Since each die is a little more than a CPU (in having a dual-channel memory controller, AMD chooses to call it a "node", a cluster of six processing cores that connects to its neighbour on the same package using one of its four 16-bit HyperTransport links. The rest are available to connect to neighbouring sockets and the system in 2P and 4P multi-socket topologies.

The socket itself gets a revamp from the existing 1,207-pin Socket-F, to the 1,974-pin Socket G34. The high pin-count ensures connections to HyperTransport links, four DDR3 memory connections, and other low-level IO.

Multi-Socket Topologies
A Magny-Cours Opteron processor can work in 2P and 4P systems for up to 48 physical processing cores. The multi-socket technologies AMD devised ensures high inter-core and inter-node bandwidth without depending on the system chipset IO for the task. In the 2P topology, one node from each socket uses one of its HyperTransport 16-bit links to connect to the system, the other to the neighbouring node on the package, and the remaining links to connect to the nodes of the neighbouring socket. It is indicated that AMD will make use of 6.4 GT/s links (probably generation 3.1). In 4P systems, it uses 8-bit links instead, to connect to three other sockets, but ensures each node is connected to the other directly, on indirectly over the MCM. With a total of 16 DDR3 DCTs in a 4P system, a staggering 170.4 GB/s of cumulative memory bandwidth is achieved.

Finally, AMD projects a up to 100% scaling with Magny-Cours compared to Istanbul. Its "future-silicon" projected for 2011 is projected to almost double that.

Source: INPAI
Add your own comment

104 Comments on AMD Demos 48-core ''Magny-Cours'' System, Details Architecture

#1
jmcslob
Mussels said:
compress in H264, decompress with hardware acceleration.

you seem to think its hard, but its a feature built into the latest windows 7 - it can recode HD video and stream it to other PC's (or consoles/extenders) on the fly. you're seeing it as starting from nothing, i'm seeing it as an application for an existing tech.
:toast:
Posted on Reply
#2
FordGT90Concept
"I go fast!1!11!1!"
Mussels said:
compress in H264, decompress with hardware acceleration.

you seem to think its hard, but its a feature built into the latest windows 7 - it can recode HD video and stream it to other PC's (or consoles/extenders) on the fly. you're seeing it as starting from nothing, i'm seeing it as an application for an existing tech.
It takes almost two minutes for the fastest of mobile processors (can't find a similar comparison for desktop/server processors) to convert 24 seconds of video in h.264:
http://www.tomshardware.com/charts/mobile-cpu-charts/Mainconcept-H.264-Encoder,473.html

It would take approximately 7 hours to do a feature length (90 minutes) film.
Posted on Reply
#3
jmcslob
FordGT90Concept said:
It takes almost two minutes for the fastest of mobile processors (can't find a similar comparison for desktop/server processors) to convert 24 seconds of video in h.264:
http://www.tomshardware.com/charts/mobile-cpu-charts/Mainconcept-H.264-Encoder,473.html

It would take approximately 7 hours to do a feature length (90 minutes) film.
What was the OS used for that benchmark?
here is where the best mobile cpu is http://www.cpubenchmark.net/cpu_lookup.php?cpu=Intel+Core2+Quad+Q9000+%40+2.00GHz
the best on toms that have been benchmarked are only 1/3 the power
Posted on Reply
#4
FordGT90Concept
"I go fast!1!11!1!"
Most likely Vista but it could be XP too. It doesn't really matter. Working with video is always a heavy task for processors because of the sheer amount of data.
Posted on Reply
#5
jmcslob
FordGT90Concept said:
Most likely Vista but it could be XP too. It doesn't really matter. Working with video is always a heavy task for processors because of the sheer amount of data.
and if your os would send more to your gpu would that help?
Posted on Reply
#6
FordGT90Concept
"I go fast!1!11!1!"
Only if the encoder is designed to and the GPU isn't already burdened. Assuming you do send it to the GPU, that also defeats the purpose of having 48 cores.
Posted on Reply
#7
jmcslob
I'm sorry i thought you meant recode with a device such as a laptop, sent from a pc
Posted on Reply
#8
FordGT90Concept
"I go fast!1!11!1!"
I think we're getting at putting the disk in a server and having the server send it to a laptop or screen of some sort to be viewed. A centralized computing system for the home instead of having multiple slower processors throughout.
Posted on Reply
#9
jmcslob
FordGT90Concept said:
I think we're getting at putting the disk in a server and having the server send it to a laptop or screen of some sort to be viewed. A centralized computing system for the home instead of having multiple slower processors throughout.
Right ok, with simple satellite controllers to access on demand
Posted on Reply
#10
FordGT90Concept
"I go fast!1!11!1!"
Oh, we can't forget that Hollywood would explode if that were made possible. :shadedshu

Bah. :(
Posted on Reply
#11
jmcslob
cant you do that now with a decent quad core a PC setup up with 2 video cards running independently and a KVM switch on 2 separate desktops or screens...
you don't need the kvm switch my bad a blue tooth setup works great through the house and you can run more than one channel at once for multiple keyboards etc...
and these http://www.newegg.com/Product/Product.aspx?Item=N82E16815158122 bad example here is one that can do HD http://www.newegg.com/Product/Product.aspx?Item=N82E16817707107 better yet and cheaper http://www.newegg.com/Product/Product.aspx?Item=N82E16882754006
Posted on Reply
#12
Mussels
Moderprator
i think he missed the original discussion on it.

We were talking about one large, powerful system to do the encoding - say, a 48 core magny cours system (or a weaker system with GPU encoding), sending the data over the network and then weaker systems doing the DEcoding (with GPU acceleration)

the weak systems dont have to do squat but playback a 'video' with hardware acceleration.
Posted on Reply
#13
FordGT90Concept
"I go fast!1!11!1!"
jmcslob said:
cant you do that now with a decent quad core a PC setup up with 2 video cards running independently and a KVM switch on 2 separate desktops or screens...
Yes, so long as HDCP isn't involved. Hollywood tried to mandate HDCP on pretty much everything but luckily it failed.

jmcslob said:
http://www.newegg.com/Product/Product.aspx?Item=N82E16882754006
That would work. You'd need two cables for 1080p though (125 MB/s each, 1080p is over 150 MB/s). What that does is split the bandwidth of HDMI and sends half the packets on one cable and half on the other. At the other end, it sticks the two sets of packets back together and puts it back into HDMI format. It isn't encoding or decoding, just changing the medium. I'm sure there is some degree of latency associated with it though.
Posted on Reply
#14
hat
Enthusiast
Oh my... how much would one of these systems cost? I could see this in the basement of some hardcore WCG junkie...
Posted on Reply
#15
FordGT90Concept
"I go fast!1!11!1!"
Mussels said:
i think he missed the original discussion on it.

We were talking about one large, powerful system to do the encoding - say, a 48 core magny cours system (or a weaker system with GPU encoding), sending the data over the network and then weaker systems doing the DEcoding (with GPU acceleration)

the weak systems dont have to do squat but playback a 'video' with hardware acceleration.
Exactly but why not just put the disk in the weaker system and decode straight from disk there?

In any case, my point is that CPUs need to get the power of GPUs on a single core instead of multiple cores just to get a fraction of the power of a GPU. Maybe this is a fault with x86. I don't know. Regardless, we need processors with higher IPS, not more cores. Even applications coded back in the 1980s can benefit from higher IPS--they can't benefit from more cores.
Posted on Reply
#16
Mussels
Moderprator
FordGT90Concept said:
Exactly but why not just put the disk in the weaker system and decode straight from disk there?

In any case, my point is that CPUs need to get the power of GPUs on a single core instead of multiple cores just to get a fraction of the power of a GPU. Maybe this is a fault with x86. I don't know. In any case, we need processors with higher IPS, not more cores. Even applications coded back in the 1980s can benefit from higher IPS--they can't benefit from more cores.
yeah you definately missed the original discussion.

we arent talking about playing movies here. we're talking about one main system doing everythning - games, movies, the whole lot, then encoding it and streaming it to multiple cheap ass systems around the house.
Posted on Reply
#17
FordGT90Concept
"I go fast!1!11!1!"
Which could still be done better with one huge IPS processor versus 48 cores.
Posted on Reply
#18
Mussels
Moderprator
FordGT90Concept said:
Which could still be done better with one huge IPS processor versus 48 cores.
possibly. but that seems rather hard to make, whereas multi core systems arent.
Posted on Reply
#19
jmcslob
Mussels said:
i think he missed the original discussion on it.

We were talking about one large, powerful system to do the encoding - say, a 48 core magny cours system (or a weaker system with GPU encoding), sending the data over the network and then weaker systems doing the DEcoding (with GPU acceleration)

the weak systems dont have to do squat but playback a 'video' with hardware acceleration.
Right ok, cant you do that now with a decent home server, I'm sure a 48 core system could serve say an entire hotel
Posted on Reply
#20
Mussels
Moderprator
jmcslob said:
Right ok, cant you do that now with a decent home server, I'm sure a 48 core system could serve say an entire hotel
not with gaming involved.
Posted on Reply
#21
jmcslob
Mussels said:
not with gaming involved.
ok gotcha yeah but wouldn't that have more to do with storage transfer rates
Posted on Reply
#22
FordGT90Concept
"I go fast!1!11!1!"
If you got a huge IPS processor, you could MCM them to get your multiple cores. The point is, GPU IPS has been steadily rising since their invention. CPU IPS barely changed since 2005 when the first multi-core processors debuted. Now instead of focusing on IPS, they're just throwing as many low IPS cores as they can reasonable power/cool on a chip.

DirectX 11 doesn't help. Instead of GPUs continuing their trend of higher IPS, it encourages them to do the same thing as CPUs: multiple cores...


I guess what I am getting at is that AMD and Intel are being lazy and all the programmers are having to work twice as hard to get the same goal accomplished as they would have had to if there was higher IPS and fewer cores. I mean, there's nothing revolutionary about multiple cores but there is in increasing the IPS (e.g. the huge jump between Pentium D and Core 2).


Instead of putting 12 cores in a single processor, they should be focusing on putting the power of 12 cores into a single core.
Posted on Reply
#23
jmcslob
widen the bridge so traffic can flow more efficiently,ok but until then keep loading up the cores
Posted on Reply
#24
btarunr
Editor & Senior Moderator
The bridges are wide enough. Each 16-bit link is HyperTransport 3.1, 6.4 GT/s. On par with Intel's QPI 6.4 GT/s.
Posted on Reply
#25
mdm-adph
FordGT90Concept said:
You have to understand how programs work to understand that multi-core, most of the time, is not a good thing. It adds many layers of complexity which a single, faster core gets the same performance with pure simplicity on the coding end.

Simply put, if you got a nail and you need to hammer it in, would you rather have one really big hammer or 48 tiny hammers?


There's a few occassions where multiple cores are good but, those few times are exactly that, a few (no more than four). Faster cores are preferred over SMT.
Irrelevant -- like I said, there's an upper limit to how fast a single-core can go, therefore multi-core is the only way.

Unless you think you'll be happy with 4GHz 10 years from now.
Posted on Reply
Add your own comment