Tuesday, August 24th 2010

AMD Details Bulldozer Processor Architecture

AMD is finally going to embrace a truly next generation x86 processor architecture that is built from ground up. AMD's current architecture, the K10(.5) "Stars" is an evolution of the more market-successful K8 architecture, but it didn't face the kind of market success as it was overshadowed by competing Intel architectures. AMD codenamed its latest design "Bulldozer", and it features an x86 core design that is radically different from anything we've seen from either processor giants. With this design, AMD thinks it can outdo both HyperThreading and Multi-Core approaches to parallelism, in one shot, as well as "bulldoze" through serial workloads with a broad 8 integer pipeline per core, (compared to 3 on K10, and 4 on Westmere). Two almost-individual blocks of integer processing units share a common floating point unit with two 128-bit FMACs.

AMD is also working on a multi-threading technology of its own to rival Intel's HyperThreading, that exploits Bulldozer's branched integer processing backed by shared floating point design, which AMD believes to be so efficient, that each SMT worker thread can be deemed a core in its own merit, and further be backed by competing threads per "core". AMD is working on another micro-architecture codenamed "Bobcat", which is a downscale implementation of Bulldozer, with which it will take on low-power and high performance per Watt segments that extend from all-in-One PCs all the way down to hand-held devices and 8-inch tablets. We will explore the Bulldozer architecture in some detail.

Bulldozer: The Turbo Diesel Engine
In many respects, the Bulldozer architecture is comparable to a diesel engine. Lower RPM (clock-speeds), high torque (instructions per second). When implemented, Bulldozer-based processors could outperform competing processor architectures at much lower clock speeds, due to one critical area AMD seems to have finally addressed: instructions per clock (IPC), unlike with the 65 nm "Barcelona" or 45 nm "Shanghai" architectures that upped IPC synthetically by using other means (such as backing the cores up with a level-3 cache, upping the uncore/northbridge clock speeds), the 32 nm Bulldozer actually features a broad integer unit with eight integer pipelines split into two portions, each portion having its own scheduler and L1 Data cache.



Parallelism: A Radical Approach?
Back when analysts were pinning high hopes on the Barcelona architecture, their hopes were fueled by early reports suggesting that AMD was using wide 128-bit wide floating point units, leading analysts to believe that AMD may have conquered its biggest nemesis - floating point performance, in turn its pure math crunching abilities. However, that wasn't exactly to be. That's because the processor's overall number crunching abilities were pegged to its floating point performance, ignoring the integer units.



AMD split 8 integers per core into two blocks, each block having four integer pipelines, an integer scheduler for those, and an L1 Data cache. These constitute the lowest level of "dedicated components", dedicated to processor threads. There is a shared floating point unit between the two, with two 128-bit FMACs, arbitrated by a floating point scheduler. The Fetch/Decode, an L2 cache, and the FPU constitute "shared" components.



AMD is implementing a simultaneous multithreading (SMT) technology, it can split each of the "dedicated" components (in this case, the integer unit) to deal with a thread of its own, while sharing certain components with the other integer unit, and effectively make each set of dedicated components a "core" in its own merit of efficiency. This way, the actual core of the Bulldozer die is deemed a "module", a superlative of two cores, and the Bulldozer die (chip) features n-number of modules depending on the model.
So now you have a chip with eight cores with much lower die sizes and transistor counts compared to a hypothetical 32 nm K10 8-core processor. It is unclear whether AMD wants to further push down SMT to the "core" level and run two threads simultaneously over dedicated components, but one thing for sure is that AMD has embraced SMT in some form or another. In all this, the chip-level parallelism is transparent to the operating system, it will only see a fixed number of logical processors, without any special software or driver requirement.

So in one go, AMD shot up its integer performance. Either a thread makes use of one integer unit with its four pipelines, or deals with both the integer units arbitrated by the fetch/decode, and the shared FPU.

Outside the modules
At the chip-level, there's a large L3 cache, a northbridge that integrates the PCI-Express root complex, and an integrated memory controller. Since the northbridge is completely on the chip, the processor does not need to deal with the rest of the system with a HyperTransport link. It connects to the chipset (which is now relegated to a southbridge, much like Intel's Ibex Peak), using A-Link Express, which like DMI, is essentially a PCI-Express link. It is important to note that all modules and extra-modular components are present on the same piece of silicon die. Because of this design change, Bulldozer processors will come in totally new packages that are not backwards compatible with older AMD sockets such as AM3 or AM2(+).
Expectations
Not surprisingly, AMD isn't talking about Bulldozer as the next big thing since dual-core processors (something it did with Barcelona). AMD currently does have an 8-core and 12-core processors codenamed "Magny-Cours", which are multichip modules of Shanghai (4-core) and Istanbul (6-core) dies. AMD expects an 8-core Bulldozer implementation (built with four modules), to have 50% higher performance-per-watt compared to Magny-Cours.



Market Segments
As mentioned in the graphic before, AMD's modular design allows it to create different products by simply controlling the number of modules on the die (by whichever method). With this, AMD will have processors ready with most PC and server market segments, all the way from desktop PCs, enthusiast-grade PCs, notebooks, to servers. AMD expects to have a full-fledged lineup in 2011. The first Bulldozer CPUs will be sold to the server market.


Hotchips 22 Presentation by AMD on the Bobcat Architecture
Below are as-is slides from AMD's Hotchips presentation on the Bobcat architecture.
Add your own comment

283 Comments on AMD Details Bulldozer Processor Architecture

#1
nt300
So Bulldozer CPUs will not work with old AM3 motherboards but old AM3 cpus will work in new AM3+ motherboards. I hope AMD does not mess up the DDR3 scaling because Dual-channel is not enough to feed 8 bulldozer cores.
Desktop Bulldozer Processors Will Require New Platforms - AMD.
AMD Zambezi to Use AM3+ Platforms

http://www.xbitlabs.com/news/cpu/display/20100826225852_Desktop_Bulldozer_Processors_Will_Require_New_Platforms_AMD.html

Advanced Micro Devices said that its next-generation desktop processors code-named Zambezi will use socket AM3+ platforms, which will be backwards compatible with the firm's existing AM3 products. While the latter is an advantage for the platform, it may be a disadvantage for eight-core processors based on Bulldozer micro-architecture...............
Posted on Reply
#2
cadaveca
My name is Dave
I'm not buying any of it. Let's wait for some motherboards to surface before deciding who's got the right story...I think these guys aren't all talking to the same people @ AMD, and the guys they are talking to, aren't exactly up to date on all the pertinent info. Idiots.
Posted on Reply
#3
jmcslob
I'm thinking the first round will be like the PhenomII 920 and 940 but after that they will all be AM3r2 only Cpu's
Posted on Reply
#4
Super XP
New platform, no problem I am looking forward to buying a new mobo.
Posted on Reply
#5
Neo4
by: nt300
So Bulldozer CPUs will not work with old AM3 motherboards but old AM3 cpus will work in new AM3+ motherboards. I hope AMD does not mess up the DDR3 scaling because Dual-channel is not enough to feed 8 bulldozer cores.
Dual channel memory is more than enough and Intel proved it with socket 1366 and triple channel designs being an unnecessary expensive. Why do you think they went back to dual channel? Read the reviews it wasn't just for the expense. (By the way, read the reviews on the real world impact on RAM speed as well.) And how can current AM3 designs support a radical and completely new design never before tried by ANY CPU manufacturer? One that doesn't require a Northbridge chipset because it's built into the CPU itself? If current boards supported "Bulldozer" then it would just be a rehash of "Stars" and little faster than what AMD has now. Despite the die shrink to 32 nm which will certainly allow higher clocks and lower TDP's. It certainly wouldn't have a chance against Intel's current and future processors. Allowing current CPU's to work in the Bulldozer boards to come is far more generous than anybody should expect and far more than the Intel camp would ever allow. AMD, I strongly suspect, has a major new performance boost coming with Bulldozer and it's going to strike with even more impact because they will downplay it right up to the day it's released to the server market next April or so. Remember when AMD shocked everybody by how much faster the 4000 video series was to the 3000 series by keeping a low profile up until the day they went on sale? By next August, regular peeps like us will be able to purchase hardware from NewEgg probably no more expensive than current AMD hardware and all we'll need to upgrade our boxes will be a new board and CPU. Next year at this time TechPowerup, HardOCP, Anandtech and all the other hardware review sites will be gushing their enthusiasm for what AMD will have accomplished. Exciting times my friends when you think that you can just buy a new board that supports Bulldozer, use your current Phenom II and buy a Bulldozer CPU later when you have the cash. That's a pretty painless and inexpensive upgrade path compared to ChipZilla.. ;)
Posted on Reply
#6
JF-AMD
AMD Rep (Server)
People seem to be really caught up in how many channels of memory there are, and not necessarily how efficient those channels perform.

What if you had 2 channels that could perform the same as 3? Would you still demand 3 or would you be ok with 2?

It's the same thing with thermals on servers. Intel is at 32nm but their best 2P power score (@ 100% utilization) is 174W. Ours is 126W (on a 45nm process). I have people try to convince me that 32nm is an advantage because you have lower power consumption.

It's not about the technology, it's about the output.
Posted on Reply
#7
cheezburger
by: JF-AMD
People seem to be really caught up in how many channels of memory there are, and not necessarily how efficient those channels perform.

What if you had 2 channels that could perform the same as 3? Would you still demand 3 or would you be ok with 2?

It's the same thing with thermals on servers. Intel is at 32nm but their best 2P power score (@ 100% utilization) is 174W. Ours is 126W (on a 45nm process). I have people try to convince me that 32nm is an advantage because you have lower power consumption.

It's not about the technology, it's about the output.
that's because in computer world everything are accelerate by pure brutal force. not efficiency. if you can do same performance intel that consume 174W while only use 126W. why not increase to 174W and crush intel? i don't understand you logic at all.
Posted on Reply
#8
bear jesus
by: JF-AMD
People seem to be really caught up in how many channels of memory there are, and not necessarily how efficient those channels perform.
I think that some people, myself included just assumed that each channel is limited more by the ram than anything else thus assumed that the only way to get more performance is to add more channels.

I'm still interested in the idea of a quad memory channel bulldozer (preferably interlagos) for a home server partly as in a way i assume that with so many core's and with running a multiple virtual machines it would benifit from the extra channels, although really i dont have a clue what would be needed memory bandwith wise or if i would have a need for so many channels.
Posted on Reply
#9
btarunr
Editor & Senior Moderator
by: cheezburger
that's because in computer world everything are accelerate by pure brutal force. not efficiency.
That's exactly what JF is talking about. "Pure brutal force" counts, not what goes into creating that. So If Bulldozer's client SKU uses say dual-channel DDR3-1866 MHz as its memory standard (since 1866 MHz 1.5V bulk DIMMs are a reality), it's making up for memory bandwidth that triple-channel DDR3-1066 MHz (Core i7 official memory standard) has with its third channel. It's the same as 256-bit high-speed GDDR5 vs. 384-bit low-speed GDDR5 AMD vs. NVIDIA point.

And you're wrong, efficiency is God in the server world.
Posted on Reply
#10
JF-AMD
AMD Rep (Server)
Because there are large companies that buy tens of thousands of servers and all they care about is the absolute lowest power possible so that they can have the largest number of threads with the lowest watts per thread. Think of very large cloud companies.

As a matter of fact, these customers routinely underclock their processors because the proportional drop in power is greater than the drop in performance, leading to better performance per watt.

Not every application requires performance. As a matter of fact, because only ~5% of the processors bought are top bin (ours and intel's), you can actually say that 95% of the customers want something other than raw performance (either price/performance or performance/watt.) It is pretty simplistic to think that performance is the only vector that matters. It's akin to asking a hybrid car owner what the 0-60mph time is or asking a sports car owner what the gas mileage is.

There are plenty of different usage models in the market and the "raw performance at all costs" is ~5% of the market. At best.
Posted on Reply
#11
JF-AMD
AMD Rep (Server)
by: bear jesus
I think that some people, myself included just assumed that each channel is limited more by the ram than anything else thus assumed that the only way to get more performance is to add more channels.

I'm still interested in the idea of a quad memory channel bulldozer (preferably interlagos) for a home server partly as in a way i assume that with so many core's and with running a multiple virtual machines it would benifit from the extra channels, although really i dont have a clue what would be needed memory bandwith wise or if i would have a need for so many channels.
Actually, you find that 3 channels is in reality less efficient. I could get into the long math of it, but let me cut to the chase: Everything in the computer world is based on even numbers. 3 channels of memory is the odd man out and is not handled the same way. Plus you don't get to do some things on the server side like advanced ECC unless you have even numbers of channels.
Posted on Reply
#12
cadaveca
My name is Dave
by: btarunr
And you're wrong, efficiency is God in the server world.
AMD's process uses less current than Intel's, and this is a huge advantage for AMD(not like I haven't said that before). I think they have the efficiency thing down pat already...and hopefully Bulldozer brings that brute force. The two things together = 1 killer chip.
Posted on Reply
#13
bear jesus
by: JF-AMD
Actually, you find that 3 channels is in reality less efficient. I could get into the long math of it, but let me cut to the chase: Everything in the computer world is based on even numbers. 3 channels of memory is the odd man out and is not handled the same way. Plus you don't get to do some things on the server side like advanced ECC unless you have even numbers of channels.
Really i'm expecting to have to choose between 2 or 4 channels for the server mainly depending on performace along with either 8 or 16 core's. But it is good to know that a triple channel baised server would not be a good idea for my wants/needs.
Posted on Reply
#14
DigitalUK
by: JF-AMD
People seem to be really caught up in how many channels of memory there are, and not necessarily how efficient those channels perform.

What if you had 2 channels that could perform the same as 3? Would you still demand 3 or would you be ok with 2?
i think JF knows something hes not telling us...lol blink twice if its dual channel
Posted on Reply
#15
bear jesus
by: DigitalUK
i think JF knows something hes not telling us...lol blink twice if its dual channel
I'm pretty sure he knows a lot that he can't tell us :p we are just lucky he is doing a good job at kind of telling us things without telling us cirtain things... if that makes any sence lol.
Posted on Reply
#16
CDdude55
Crazy 4 TPU!!!
by: DigitalUK
i think JF knows something hes not telling us...lol blink twice if its dual channel
lol, there's a lot he probably can't tell us, even if he's only the server guy.

I think if they can make it efficient and get near or more memory bandwidth while only using two channels, then im all fine with that. As said, i think of the server side of things efficiency is very important, but i think client wise, triple channel is more then enough even if it's not as efficient.
Posted on Reply
#17
DigitalUK
yea ive complete faith in amd to deliver the goods thats why im saving now., its just JL as hinted at least 3 times or more that 2 channels could be less efficient. his blog is also Very interesting, memory noted there as well.
AMD hit man knocking my door soon.
Posted on Reply
#18
mastrdrver
by: JF-AMD
People seem to be really caught up in how many channels of memory there are, and not necessarily how efficient those channels perform.

What if you had 2 channels that could perform the same as 3? Would you still demand 3 or would you be ok with 2?

It's the same thing with thermals on servers. Intel is at 32nm but their best 2P power score (@ 100% utilization) is 174W. Ours is 126W (on a 45nm process). I have people try to convince me that 32nm is an advantage because you have lower power consumption.

It's not about the technology, it's about the output.
You sir are crazy! Everyone knows that 3 channels pwns all and 4 is teh win! :rockout:

:laugh: j/k

I'm glad we could get some clarity on this straight from the horses mount (per say).
Posted on Reply
#19
Wile E
Power User
All I care about is how much overclocked performance it can achieve within the heat output my cooling setup is able to manage. I don't care how it's achieved, only that it is.

I just want to know how it performs and how it overclocks.

If it's better than Intel, my next rig is AMD. If not, I stick with Intel. That's that.
Posted on Reply
#20
bear jesus
by: Wile E
All I care about is how much performance it can achieve within the heat output my cooling setup is able to manage. I don't care how it's achieved, only that it is.

I just want to know how it performs and how it overclocks.


If it's better than Intel, my next rig is AMD. If not, I stick with Intel. That's that.
I have to agree, i really want to go with some water cooling with my next cpu upgrade so i am really hoping that bulldozer will oc well under water.
Posted on Reply
#21
Super XP
by: Neo4
Dual channel memory is more than enough and Intel proved it with socket 1366 and triple channel designs being an unnecessary expensive. Why do you think they went back to dual channel? Read the reviews it wasn't just for the expense. (By the way, read the reviews on the real world impact on RAM speed as well.) And how can current AM3 designs support a radical and completely new design never before tried by ANY CPU manufacturer? One that doesn't require a Northbridge chipset because it's built into the CPU itself? If current boards supported "Bulldozer" then it would just be a rehash of "Stars" and little faster than what AMD has now. Despite the die shrink to 32 nm which will certainly allow higher clocks and lower TDP's. It certainly wouldn't have a chance against Intel's current and future processors. Allowing current CPU's to work in the Bulldozer boards to come is far more generous than anybody should expect and far more than the Intel camp would ever allow. AMD, I strongly suspect, has a major new performance boost coming with Bulldozer and it's going to strike with even more impact because they will downplay it right up to the day it's released to the server market next April or so. Remember when AMD shocked everybody by how much faster the 4000 video series was to the 3000 series by keeping a low profile up until the day they went on sale? By next August, regular peeps like us will be able to purchase hardware from NewEgg probably no more expensive than current AMD hardware and all we'll need to upgrade our boxes will be a new board and CPU. Next year at this time TechPowerup, HardOCP, Anandtech and all the other hardware review sites will be gushing their enthusiasm for what AMD will have accomplished. Exciting times my friends when you think that you can just buy a new board that supports Bulldozer, use your current Phenom II and buy a Bulldozer CPU later when you have the cash. That's a pretty painless and inexpensive upgrade path compared to ChipZilla.. ;)
Is that a shared NB?
Posted on Reply
#23
Super XP
The NB is still integrated into Bulldozer was what I was trying to point out. Just like the IMC.
Posted on Reply
#24
TheMailMan78
Big Member
by: cadaveca
I think the real point there is that once again, AMD isn't exactly forthcoming with PRECISE information, ever. Or maybe it's those reporting...I am unsure since everyone in those circles is so "buddy-buddy" at this point.
I think thats the case with any major company. You never show all the goods. It can give you an edge or hide your flaw.
Posted on Reply
#25
cadaveca
My name is Dave
by: TheMailMan78
I think thats the case with any major company. You never show all the goods. It can give you an edge or hide your flaw.
Well since, then, JF-AMD is posting here now, so maybe he'll end that confusion. ;)
Posted on Reply
Add your own comment