Friday, September 24th 2010

AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache

Documents related to the "Orochi" 8-core processor by AMD based on its next-generation Bulldozer architecture reveal its cache hierarchy that comes as a bit of a surprise. Earlier this month, at a GlobalFoundries hosted conference, AMD displayed the first die-shot of the Orochi die, which legibly showed key features including the four Bulldozer modules which hold two cores each, and large L2 caches. In coarse visual inspection, the L2 cache of each module seems to cover 35% of its area. L3 cache is located along the center of the die. The documents seen by X-bit Labs reveal that each Bulldozer module has its own 2 MB L2 cache shared between two cores, and an L3 cache shared between all four modules (8 cores) of 8 MB.

This takes the total cache count of Orochi all the way up to 16 MB. This hierarchy suggests that AMD wants to give individual cores access to a large amount of faster cache (that's a whopping 2048 KB compared to 512 KB per core on Phenom, and 256 KB per core on Core i7), which facilitates faster inter-core, intra-module communication. Inter-module communication is enhanced by the 8 MB L3 cache. Compared to the current "Istanbul" six-core K10-based die, that's a 77% increase in cache amount for a 33% core count increase, 300% increase in L2 cache per core. Orochi is built on a 32 nm GlobalFoundries process, it is sure to have a very high transistor count.Source: Xbit Labs
Add your own comment

152 Comments on AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache

#1
b82rez
BL GG Intel Fanboys. AMD is back! :nutkick:
Posted on Reply
#2
bpgt64
I'll believe it's a performance gain when I see the benchmarks. Regardless of which side you take, competition is always good for the consumer.
Posted on Reply
#3
KainXS
wait for benchmarks before you start that, we've been through that before with amd
Posted on Reply
#4
wolf
Performance Enthusiast
by: b82rez
BL GG Intel Fanboys. AMD is back! :nutkick:
silly :slap:

cache isn't everything, reviews pretty much are.
Posted on Reply
#5
ebolamonkey3
2011 is shaping up to be quite an interesting year :)
Posted on Reply
#6
Completely Bonkers
I remember the "massive cache" Gallatin P4's over Northwood. Didn't make more than 5% difference clock for clock except in very special circumstances.

So let's wait for benchmarks.

I would have thought there would be better gains by rethinking cache and memory entirely, possibly producing a separate socket for L3 cache just like in the old days. It would be so much cheaper to do it that way, you could easily pack 256MB cache. Yes, the latency would be worse than current on-die L3 cache, but with the space, heat and transistors saved, you could bump up L1 and L2 cache and win back any performance losses. Plus you could build your L3 cache to order.
Posted on Reply
#7
DaMulta
My stars went supernova
That's it????? I wait for the day with 16 cores with 64MB of Cache
Posted on Reply
#8
dir_d
Well it seems Bulldozer is going to be faster when communicating with memory and other cores. I think if AMD just did that to a phenom 2 chip it would speed it up significantly. I really cant wait to see bulldozer in action.
Posted on Reply
#9
bear jesus
I would hope more faster cache could be a good thing but the main thing im interested in is how each modual performs, i'm really thinking about getting a high end sandy bridge or bulldozer to last me a couple years or so and that means i want as many and as fast a cores as possible as i would hope over the next few years more software will use more cores.
Posted on Reply
#10
Rebelstar
I'm totally noob in CPU technologies but I think 16MB cache it's a freaking cool, right?
Posted on Reply
#11
xaira
by: btarunr
it is sure to have a very high transistor count.
so does fermi, i hope amd has the tdp under control, otherwise sandy will kick butt
Posted on Reply
#12
bear jesus
by: Rebelstar
I'm totally noob in CPU technologies but I think 16MB cache it's a freaking cool, right?
It could be if put to use well but the core's are really importaint, either way we won't know untill the reviews really.
Posted on Reply
#13
devguy
One design win I really commend AMD for is their use of dynamic cache allocation between the "cores" on a module. While many assume the sharing of cache (and other items like the FPU) will hurt single threaded performance, that really isn't the case. When only one core is active per module, it has complete control over all the resources; thus a single core will have 2mb L2 cache at its disposal! Also, when both cores on a module are active, they can inequitably share the resources (ie one core with .5mb L2 and another with 1.5mb L2 is possible). Very cool technology.

For Bulldozer, there will be the option to have the OS prefer loading one core per module (like cores 1, 3, 5, 7) rather than just filling them up by modules (1, 2, 3, 4). Both have benefits and faults: the first route has higher performance, but also higher power consumption; the second would be the exact opposite.

As far as the sharing of the FPU, in servers it will make hardly any difference. In the desktop segment, AMD argues that should you be doing something that takes up so much FPU performance to slow down our modules, then you should be doing it on the GPU instead.
Posted on Reply
#14
cadaveca
My name is Dave
I like this news. I ahve been saying for a couple of years now that AMD's cache design needed to cahnge, and here, they are doing something about it. That makes me even more interested in Bulldozer tech.
Posted on Reply
#15
bear jesus
by: devguy
One design win I really commend AMD for is their use of dynamic cache allocation between the "cores" on a module. While many assume the sharing of cache (and other items like the FPU) will hurt single threaded performance, that really isn't the case. When only one core is active per module, it has complete control over all the resources; thus a single core will have 2mb L2 cache at its disposal! Also, when both cores on a module are active, they can inequitably share the resources (ie one core with .5mb L2 and another with 1.5mb L2 is possible). Very cool technology.

For Bulldozer, there will be the option to have the OS prefer loading one core per module (like cores 1, 3, 5, 7) rather than just filling them up by modules (1, 2, 3, 4). Both have benefits and faults: the first route has higher performance, but also higher power consumption; the second would be the exact opposite.

As far as the sharing of the FPU, in servers it will make hardly any difference. In the desktop segment, AMD argues that should you be doing something that takes up so much FPU performance to slow down our modules, then you should be doing it on the GPU instead.
I never knew it would be set up like that, kind of makes me even more sure i want to wait for bulldozer for my next full upgrade so that if it is a good cpu at a good price i can go for one or if not then i can get somethign from sandy bridge a little cheaper (hoping price drops will come over the time waited and if the consumer is lucky price drops that come with/after bulldozer).
Posted on Reply
#16
cheezburger
no surprise. they are try to fix the single thread performance hit due to the smaller l1 data/instruction. each core "only" had 8kb l1 data while the instruction cache is share by module which just only 64kb "2 way" in cache(could have be less...i think...) which is roughly 40kb per core compare to core's 64kb per core. big disadvantage. so all they can do is add more l3 cache to increase the performance or hoping not drop performance without tweak too much on the exist architecture that had been tape out and going to be release in 3 months. same thing intel did when realized northwood its poor l1 cache will drag down performance they increase l2 cache from 256kb to 512kb. however orochi is 8 module 16 core processor so featuring 16mb l3 meant each core can use up to 1mb l3. still way below nehalem's 2mb per core. also unlike intel's architecture amd's cache heavily determine by the stage pipeline. lower stage pipeline won't take advantage on bigger cache. but since bulldozer will featuring 4+ghz i doubt this will be at least 20+ stage pipeline in this processor. but despite all these feature as long as intel decide to increase ivy bridge's l2 cache from 256k per core to 512k per core amd will experience same horror they faced when core 2 came out.
Posted on Reply
#17
HTC
I wonder how hot these CPUs will get ...
Posted on Reply
#18
ROad86
by: cheezburger
no surprise. they are try to fix the single thread performance hit due to the smaller l1 data/instruction. each core "only" had 8kb l1 data while the instruction cache is share by module which just only 64kb "2 way" in cache(could have be less...i think...) which is roughly 40kb per core compare to core's 64kb per core. big disadvantage. so all they can do is add more l3 cache to increase the performance or hoping not drop performance without tweak too much on the exist architecture that had been tape out and going to be release in 3 months. same thing intel did when realized northwood its poor l1 cache will drag down performance they increase l2 cache from 256kb to 512kb. however orochi is 8 module 16 core processor so featuring 16mb l3 meant each core can use up to 1mb l3. still way below nehalem's 2mb per core. also unlike intel's architecture amd's cache heavily determine by the stage pipeline. lower stage pipeline won't take advantage on bigger cache. but since bulldozer will featuring 4+ghz i doubt this will be at least 20+ stage pipeline in this processor. but despite all these feature as long as intel decide to increase ivy bridge's l2 cache from 256k per core to 512k per core amd will experience same horror they faced when core 2 came out.
First orochi is 4 module - 8 core design. Second not only the size but how fast is the cache. Third it is very important how the prediction of instructions will work, if the design is good then you dont need big L1 cache which increase cost and die size. And yes 2mb per module 1 mb per core is the amount that bulldozer will have.
Posted on Reply
#19
mechtech
I want one, a server version with 8 or 16 GB of ecc ram :D I don't know why though since I don't even work 1 core on my 955BE
Posted on Reply
#20
cadaveca
My name is Dave
by: HTC
I wonder how hot these CPUs will get ...
Very hot...apparantly we'll see a clockspeed decrease(which I assume is due to the high levels of cache), but IPC will increase. I'm kinda expecting 2.4ghz or so...maybe lower...for launch chips.
Posted on Reply
#21
bear jesus
by: cadaveca
Very hot...apparantly we'll see a clockspeed decrease(which I assume is due to the high levels of cache), but IPC will increase. I'm kinda expecting 2.4ghz or so...maybe lower...for launch chips.
Just a good reason for me to get my first real water cooling setup :D (assuming i am happy with the reviews of bulldozer)
Posted on Reply
#22
cadaveca
My name is Dave
I don't know anything about it, really. However, there is mention of the clockspeed decrease on the AMD blog site. NOw that we have the info on cache size...1+1=2. Of course, there's lots of time between now and launch..seems to me they are refining the process, and a few bugs, at this point.
Posted on Reply
#23
ROad86
by: mechtech
I want one, a server version with 8 or 16 GB of ecc ram :D I don't know why though since I don't even work 1 core on my 955BE
Haha me too!!! :laugh:
Posted on Reply
#24
bear jesus
by: cadaveca
I don't know anything about it, really. However, there is mention of the clockspeed decrease on the AMD blog site. NOw that we have the info on cache size...1+1=2. Of course, there's lots of time between now and launch..seems to me they are refining the process, and a few bugs, at this point.
Hmm i wonder if they will follow intel's lead (refering to the cooler that comes with the top end i7's) by using a better cooler for the high end cpu's if they run hot, would be nice to see a better cooler than the current one's as i am not really a fan of them.
Posted on Reply
#25
afw
by: cadaveca
I don't know anything about it, really. However, there is mention of the clockspeed decrease on the AMD blog site. NOw that we have the info on cache size...1+1=2. Of course, there's lots of time between now and launch..seems to me they are refining the process, and a few bugs, at this point.
Well I read that Buldozer will do more instruction per clock ... so it will be interesting to see what its capable of
Bulldozer: The Turbo Diesel Engine
In many respects, the Bulldozer architecture is comparable to a diesel engine. Lower RPM (clock-speeds), high torque (instructions per second). When implemented, Bulldozer-based processors could outperform competing processor architectures at much lower clock speeds, due to one critical area AMD seems to have finally addressed: instructions per clock (IPC), unlike with the 65 nm "Barcelona" or 45 nm "Shanghai" architectures that upped IPC synthetically by using other means (such as backing the cores up with a level-3 cache, upping the uncore/northbridge clock speeds), the 32 nm Bulldozer actually features a broad integer unit with eight integer pipelines split into two portions, each portion having its own scheduler and L1 Data cache.
source ---> http://www.techpowerup.com/129392/AMD_Details_Bulldozer_Processor_Architecture.html
Posted on Reply
Add your own comment