Tuesday, August 24th 2010

AMD Details Bulldozer Processor Architecture

AMD is finally going to embrace a truly next generation x86 processor architecture that is built from ground up. AMD's current architecture, the K10(.5) "Stars" is an evolution of the more market-successful K8 architecture, but it didn't face the kind of market success as it was overshadowed by competing Intel architectures. AMD codenamed its latest design "Bulldozer", and it features an x86 core design that is radically different from anything we've seen from either processor giants. With this design, AMD thinks it can outdo both HyperThreading and Multi-Core approaches to parallelism, in one shot, as well as "bulldoze" through serial workloads with a broad 8 integer pipeline per core, (compared to 3 on K10, and 4 on Westmere). Two almost-individual blocks of integer processing units share a common floating point unit with two 128-bit FMACs.

AMD is also working on a multi-threading technology of its own to rival Intel's HyperThreading, that exploits Bulldozer's branched integer processing backed by shared floating point design, which AMD believes to be so efficient, that each SMT worker thread can be deemed a core in its own merit, and further be backed by competing threads per "core". AMD is working on another micro-architecture codenamed "Bobcat", which is a downscale implementation of Bulldozer, with which it will take on low-power and high performance per Watt segments that extend from all-in-One PCs all the way down to hand-held devices and 8-inch tablets. We will explore the Bulldozer architecture in some detail.
Bulldozer: The Turbo Diesel Engine
In many respects, the Bulldozer architecture is comparable to a diesel engine. Lower RPM (clock-speeds), high torque (instructions per second). When implemented, Bulldozer-based processors could outperform competing processor architectures at much lower clock speeds, due to one critical area AMD seems to have finally addressed: instructions per clock (IPC), unlike with the 65 nm "Barcelona" or 45 nm "Shanghai" architectures that upped IPC synthetically by using other means (such as backing the cores up with a level-3 cache, upping the uncore/northbridge clock speeds), the 32 nm Bulldozer actually features a broad integer unit with eight integer pipelines split into two portions, each portion having its own scheduler and L1 Data cache.



Parallelism: A Radical Approach?
Back when analysts were pinning high hopes on the Barcelona architecture, their hopes were fueled by early reports suggesting that AMD was using wide 128-bit wide floating point units, leading analysts to believe that AMD may have conquered its biggest nemesis - floating point performance, in turn its pure math crunching abilities. However, that wasn't exactly to be. That's because the processor's overall number crunching abilities were pegged to its floating point performance, ignoring the integer units.



AMD split 8 integers per core into two blocks, each block having four integer pipelines, an integer scheduler for those, and an L1 Data cache. These constitute the lowest level of "dedicated components", dedicated to processor threads. There is a shared floating point unit between the two, with two 128-bit FMACs, arbitrated by a floating point scheduler. The Fetch/Decode, an L2 cache, and the FPU constitute "shared" components.



AMD is implementing a simultaneous multithreading (SMT) technology, it can split each of the "dedicated" components (in this case, the integer unit) to deal with a thread of its own, while sharing certain components with the other integer unit, and effectively make each set of dedicated components a "core" in its own merit of efficiency. This way, the actual core of the Bulldozer die is deemed a "module", a superlative of two cores, and the Bulldozer die (chip) features n-number of modules depending on the model.
So now you have a chip with eight cores with much lower die sizes and transistor counts compared to a hypothetical 32 nm K10 8-core processor. It is unclear whether AMD wants to further push down SMT to the "core" level and run two threads simultaneously over dedicated components, but one thing for sure is that AMD has embraced SMT in some form or another. In all this, the chip-level parallelism is transparent to the operating system, it will only see a fixed number of logical processors, without any special software or driver requirement.

So in one go, AMD shot up its integer performance. Either a thread makes use of one integer unit with its four pipelines, or deals with both the integer units arbitrated by the fetch/decode, and the shared FPU.

Outside the modules
At the chip-level, there's a large L3 cache, a northbridge that integrates the PCI-Express root complex, and an integrated memory controller. Since the northbridge is completely on the chip, the processor does not need to deal with the rest of the system with a HyperTransport link. It connects to the chipset (which is now relegated to a southbridge, much like Intel's Ibex Peak), using A-Link Express, which like DMI, is essentially a PCI-Express link. It is important to note that all modules and extra-modular components are present on the same piece of silicon die. Because of this design change, Bulldozer processors will come in totally new packages that are not backwards compatible with older AMD sockets such as AM3 or AM2(+).
Expectations
Not surprisingly, AMD isn't talking about Bulldozer as the next big thing since dual-core processors (something it did with Barcelona). AMD currently does have an 8-core and 12-core processors codenamed "Magny-Cours", which are multichip modules of Shanghai (4-core) and Istanbul (6-core) dies. AMD expects an 8-core Bulldozer implementation (built with four modules), to have 50% higher performance-per-watt compared to Magny-Cours.



Market Segments
As mentioned in the graphic before, AMD's modular design allows it to create different products by simply controlling the number of modules on the die (by whichever method). With this, AMD will have processors ready with most PC and server market segments, all the way from desktop PCs, enthusiast-grade PCs, notebooks, to servers. AMD expects to have a full-fledged lineup in 2011. The first Bulldozer CPUs will be sold to the server market.


Hotchips 22 Presentation by AMD on the Bobcat Architecture
Below are as-is slides from AMD's Hotchips presentation on the Bobcat architecture.
Add your own comment

283 Comments on AMD Details Bulldozer Processor Architecture

#1
JF-AMD
AMD Rep (Server)
RAM transitions are generally over estimated by the manufacturers. There is the "introduction" timeframe and the "mass acceptance" timeframe.

The earliest availability date typically means huge price premiums, spotty supply and less than stellar capabilites until the companies get their processes in line.
Posted on Reply
#2
cheezburger
bear jesus said:
:laugh: unfortunatly it should be around 2015 we see DDR4, would be nice to have some DDR4 next year but i would be happy with some DDR3 if i can push it over 2000mhz on an amd board.
why bother go DDR4 when you realize it's going to be 20+ in cycle latency......

now all we need is to fix these latency per clock on the current ram technology. not just more clock
Posted on Reply
#3
Wile E
Power User
trt740 said:
Thats not how it worked Palit guy posted for along time but during the USA financial melt down Palit closed up shop in the USA and he lost his job. So thats not really a good comparison but we get your point.
I miss Palit Guy and my free hardware. :(

cheezburger said:
why bother go DDR4 when you realize it's going to be 20+ in cycle latency......

now all we need is to fix these latency per clock on the current ram technology. not just more clock
Latency per clock on DDR3 is already better than both DDR2 and DDR. To get the same latency per clock as my ram on DDR2, you would have to run it at CAS4 1066Mhz, or CAS5 1333Mhz. Not too many ram kits could do that stable and live for very long, and none were sold with those as their stock speeds.
Posted on Reply
#4
Neo4
Wile E said:
I miss Palit Guy and my free hardware. :(

Must have been nice that..

Latency per clock on DDR3 is already better than both DDR2 and DDR. To get the same latency per clock as my ram on DDR2, you would have to run it at CAS4 1066Mhz, or CAS5 1333Mhz. Not too many ram kits could do that stable and live for very long, and none were sold with those as their stock speeds.
Timing and speed matter little in real world applications and games. All the reviews show only a very few frames per second difference. It takes fast L1, L2 and L3 cache to keep our hungry processors feed. Compared to that, system memory is dog slow and only hard drives and optical drives are slower. Thank goodness solid state drives are slowly taking over both those antique mechanical technologies..
Posted on Reply
#5
largon
DDR4 can't come soon enough.

Miniaturization of CPUs and as such, IMCs, has brought problems regarding memory DDQ voltages; remember i7 doesn't run safe with RAM vDD > 1.65v? You can't have huge voltage differences between IMC and the rest of the core or things between them two parts of the die go *poof*. Even ULV DDR3 running a 1.35V vDDQ will cause a conflict with CPU core vDDs, sooner or later. And since CPU vDDs are continuously going down... Anyways, more aggregate bandwidth never hurts, and considering GPUs are getting more and more integrated in the CPU, so in near future the industry will be screaming for faster RAM.
Posted on Reply
#6
Wile E
Power User
Neo4 said:
Timing and speed matter little in real world applications and games. All the reviews show only a very few frames per second difference. It takes fast L1, L2 and L3 cache to keep our hungry processors feed. Compared to that, system memory is dog slow and only hard drives and optical drives are slower. Thank goodness solid state drives are slowly taking over both those antique mechanical technologies..
I know it makes little difference. Just commenting on his apparant misunderstanding of ram performance. Good DDR3 has both lower real world latency and higher bandwidth than both DDR 1 and 2. I was just speaking in terms of the hardwares' raw abilities, not the effect it has on our apps.

And yes, getting free hardware to OC to death was a blast. lol.

largon said:
DDR4 can't come soon enough.

Miniaturization of CPUs and as such, IMCs, has brought problems regarding memory DDQ voltages; remember i7 doesn't run safe with RAM vDD > 1.65v? You can't have huge voltage differences between IMC and the rest of the core or things between them two parts of the die go *poof*. Even ULV DDR3 running a 1.35V vDDQ will cause a conflict with CPU core vDDs, sooner or later. And since CPU vDDs are continuously going down... Anyways, more aggregate bandwidth never hurts, and considering GPUs are getting more and more integrated in the CPU, so in near future the industry will be screaming for faster RAM.
I also wouldn't mind seeing ram and core speeds match, bet that would help latency nicely. Having both ram and cpu running locked at 4Ghz (for example) has to have some sort of positive benefits in overall performance.
Posted on Reply
#7
JF-AMD
AMD Rep (Server)
largon said:
DDR4 can't come soon enough.
What if it is slower, higher latency and more expensive? Will you make the jump then?

You don't need the newest technology, you need the best technology. I haven't seen enough on DDR4 to make me wish it was here any time sooner. And, it's quite a ways off.
Posted on Reply
#8
Super XP
DDR3 is more than enough. I think a good set of DDR3-1866 is perfect with ultra low timings. That should be enough for at least 2+ years for solid gaming with a nice 4GB x 4 = 16GB.
Posted on Reply
#9
bear jesus
I'm still on 1066mhz DDR2 at cas5 and it's still serving me very well, i will be happy if i can get around 2000mhz DDR3 on my next board and would be happy to wait the few years untill DDR4 is in mass production.
Posted on Reply
#10
JF-AMD
AMD Rep (Server)
Anyone that raced out to get DDR-3 when it came out was treated to a pretty significant price premium and the first rounds were at 800MHz, maybe 1066MHz, but definitely no 1333MHz. It took until the first process node change to get prices and speeds in line.

Memory is one area where being an early adopter rarely has a benefit.
Posted on Reply
#11
largon
As was the case with at least DDR2 and DDR3 one can reasonably expect that DDR4 will be worse than DDR3 at start but that doesn't invalidate my statement. It didn't take that long for DDR2, DDR3 to clearly overcome DDR1, DDR2 respectively.

I'm not going to be among the first adopters... Hell, I'm still using DDR2 and personally I don't see any compelling reason to go DDR3 until, of necessity, when I'll do a platform overhaul sometime in 2011.
Posted on Reply
#12
JF-AMD
AMD Rep (Server)
Then your statement should have been "volume second generation DDR4 can't come soon enough"

;)
Posted on Reply
#13
Neo4
largon said:
As was the case with at least DDR2 and DDR3 one can reasonably expect that DDR4 will be worse than DDR3 at start but that doesn't invalidate my statement. It didn't take that long for DDR2, DDR3 to clearly overcome DDR1, DDR2 respectively.

I'm not going to be among the first adopters... Hell, I'm still using DDR2 and personally I don't see any compelling reason to go DDR3 until, of necessity, when I'll do a platform overhaul sometime in 2011.
Seriously, DDR4 is pie in the sky as far as I'm concerned. When I was younger I'd have been fired up about it but not any more. The hardware performance numbers don't lie and each and every memory architectural advancement has been a great big yawn. CPU's, GPU's and the new SSD's are where all the performance action is and I can't wait for a year from now when I'll be able to jump on the Bulldozer platform. :)
Posted on Reply
#14
HalfAHertz
Interesting discussion. What I'd like to ask JF is if AMD is going to introduce any Liano based APUs for the server market. I'd guess openCL programers would be pretty interested in those if they are competitive in Gflops/wat.
Posted on Reply
#15
JF-AMD
AMD Rep (Server)
I actually cover that in my blog. APUs for the server market might happen, but not in the near term. There is a lot of work that has to happen on the software side first before we start embedding GPUs into CPUs.

Today customers want threads. There is a definite need for GPGPU technology, but for now the speeds/sizes that customers want make them difficult to integrate into a CPU package. There is also the issue of CPU:GPU ratio, which is different by application.
Posted on Reply
#16
Super XP
I agree the software needs to catch up but that is not stopping Intel.
Posted on Reply
#17
JF-AMD
AMD Rep (Server)
That was a server statement that I made, not a client statement.
Posted on Reply
#18
Neo4
Super XP said:
DDR3 is more than enough. I think a good set of DDR3-1866 is perfect with ultra low timings. That should be enough for at least 2+ years for solid gaming with a nice 4GB x 4 = 16GB.
http://www.xbitlabs.com/articles/memory/display/phenom-ii-x6-ddr3-2000.html

Conclusion

The main thing we have discovered in our today's tests is that DDR3-2000 SDRAM is indeed possible on Socket AM3 systems. We now know the prerequisites for that: 1) any Phenom II X6 processor, 2) any of the many mainboards based on AMD’s 800 series chipsets, and 3) specially optimized memory modules.

As you can see, the most difficult requirement is to get such optimized memory. We were lucky to have a dual-channel 4GB kit from G.Skill (F3-16000CL7D-4GBFLS) which proved to be capable of working as DDR3-2000 on our Socket AM3 testbed. This memory kit is not without downsides, of course. For example, the modules are rather large because of the cooling elements, but we don't want to find fault with them as there are almost no alternatives available on the market. If you want high-speed DDR3 for your overclocked Phenom II X6-based computer, we do recommend you this memory kit from G.Skill.

Well, you shouldn’t be disappointed if you don’t find DDR3-2000 modules compatible with the Phenom II X6 as the performance benefits of such memory over DDR3-1600 only amount to 1-2% while memory kits like the G.Skill F3-16000CL7D-4GBFLS are some 50% more expensive. So, we are prone to regard the use of DDR3-2000 modules in an overclocked Socket AM3 system as a luxury rather than a necessity.

Although the optimized modules have no problems working with Phenom II X6 processors as DDR3-2000, there are obvious problems with AMD's memory controller in general. The highest memory frequency this controller permits is much lower than what you can get with Intel processors.

Hopefully, AMD will revise its memory controller so that the company’s upcoming Bulldozer and other architectures will work with high-speed memory without any limitations and reservations, especially as JEDEC-approved speeds of DDR3 SDRAM modules may go as high as 2000 and more megahertz in the very near future.
Posted on Reply
#20
Wile E
Power User
But they are right in claiming that running 2000Mhz ram provides little, if any benefits, regardless of platform. 1600 CAS6 is better than 2000 CAS8, for example. So the true value of running 2000Mhz depends on both timings, and price. The gains will be small with significant increases in money at these levels.

But of course, CAS7 @ 2000 is better still, and some darn good sticks.
Posted on Reply
#21
hat
Enthusiast
largon said:
DDR4 can't come soon enough.

Miniaturization of CPUs and as such, IMCs, has brought problems regarding memory DDQ voltages; remember i7 doesn't run safe with RAM vDD > 1.65v? You can't have huge voltage differences between IMC and the rest of the core or things between them two parts of the die go *poof*. Even ULV DDR3 running a 1.35V vDDQ will cause a conflict with CPU core vDDs, sooner or later. And since CPU vDDs are continuously going down... Anyways, more aggregate bandwidth never hurts, and considering GPUs are getting more and more integrated in the CPU, so in near future the industry will be screaming for faster RAM.
I thought it was the QPI voltage that had to be in line with the RAM voltage, not CPU core voltage.
Posted on Reply
#22
Nick89
WhiteLotus said:
Lower clock speeds but better math crunching abilities... interesting.

And am I alone in thinking these will be big chips? What with everything on them...
I'm hoping they will be ether 32nm or 28nm.
Posted on Reply
#23
Steevo
Burn the Intel infadels.


I am not upgrading again until either the bulldozer is real, and really competitive for price and performance, or it fails and I go back Intel.


I need more video processing power m2ts, 1080P with effects in Adobe, pixela, and ATI has failed me on that front too. So. Green and blue might be my new colors if they don't pull their shit together by next spring.
Posted on Reply
#24
TheMailMan78
Big Member
Honestly I'm fine with DDR2 and low timings. My board can run DDR2 @1333. But currently I run at 1067. But look at my timings. :)
Posted on Reply
#25
bear jesus
TheMailMan78 said:
Honestly I'm fine with DDR2 and low timings. My board can run DDR2 @1333. But currently I run at 1067. But look at my timings. :)
I kind of agree, one of the things making it easy for me to wait to go with dd3 is the timings, when i move to ddr3 i want 8gb across 2 modules and the best that's easily available to me is 2000mhz at 9-10-9-27, I'm sure running it slower than 2ghz would possibly let me lower the timings but going from 5-5-5-15 ddr2 i would kind of want at least cas 6 or 7 with ddr3.

I hope in the coming months more memory will be released with lower timings.
Posted on Reply
Add your own comment