Friday, September 24th 2010

AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache

Documents related to the "Orochi" 8-core processor by AMD based on its next-generation Bulldozer architecture reveal its cache hierarchy that comes as a bit of a surprise. Earlier this month, at a GlobalFoundries hosted conference, AMD displayed the first die-shot of the Orochi die, which legibly showed key features including the four Bulldozer modules which hold two cores each, and large L2 caches. In coarse visual inspection, the L2 cache of each module seems to cover 35% of its area. L3 cache is located along the center of the die. The documents seen by X-bit Labs reveal that each Bulldozer module has its own 2 MB L2 cache shared between two cores, and an L3 cache shared between all four modules (8 cores) of 8 MB.

This takes the total cache count of Orochi all the way up to 16 MB. This hierarchy suggests that AMD wants to give individual cores access to a large amount of faster cache (that's a whopping 2048 KB compared to 512 KB per core on Phenom, and 256 KB per core on Core i7), which facilitates faster inter-core, intra-module communication. Inter-module communication is enhanced by the 8 MB L3 cache. Compared to the current "Istanbul" six-core K10-based die, that's a 77% increase in cache amount for a 33% core count increase, 300% increase in L2 cache per core. Orochi is built on a 32 nm GlobalFoundries process, it is sure to have a very high transistor count.Source: Xbit Labs
Add your own comment

152 Comments on AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache

#1
JF-AMD
AMD Rep (Server)
bear jesus said:
To be honest i kinda of wish the bulldozer would come in a g34 socket (1974 pin) so that it came with quad channel ram... although i doubt it would make much different apart from benchmarks and maybe virtual machines.
What if you got much greater throughput without having to increase memory channels?
Posted on Reply
#2
bear jesus
JF-AMD said:
What if you got much greater throughput without having to increase memory channels?
Then that would be perfect, even more so with high density moduals becoming more normal as 8gb over a dual channel with a higher bandwith/throughput would be great imo.
Posted on Reply
#3
cheezburger
Completely Bonkers said:
I remember the "massive cache" Gallatin P4's over Northwood. Didn't make more than 5% difference clock for clock except in very special circumstances.

So let's wait for benchmarks.
gallatin has 2mb l3 but the l2 is cut in half which was only 256kb compare to northwood's 512kb. the difference in performance per clock is not increase but decrease as result. but only advantage of gallatin is the clock is very high compare to northwood's 3.06ghz limit


Completely Bonkers said:
I would have thought there would be better gains by rethinking cache and memory entirely, possibly producing a separate socket for L3 cache just like in the old days. It would be so much cheaper to do it that way, you could easily pack 256MB cache. Yes, the latency would be worse than current on-die L3 cache, but with the space, heat and transistors saved, you could bump up L1 and L2 cache and win back any performance losses. Plus you could build your L3 cache to order.
it would be the worst scenario to do so. i still remember how terrible a 850mhz slot1 pentium iii couldn't even pace a 533mhz coppermine because of 1/3 speed cache and extremely high latency. going back to slot will be stupid just like going back from core to netburst architecture. cheap price don't mean anything when you don't even have basic performance...plus why do we need these external low performance cache if we already have high speed ram available?

ROad86 said:
First orochi is 4 module - 8 core design. Second not only the size but how fast is the cache. Third it is very important how the prediction of instructions will work, if the design is good then you dont need big L1 cache which increase cost and die size. And yes 2mb per module 1 mb per core is the amount that bulldozer will have.
the problem is that 64kb l1 instruction cache and l2 cache are uncore. that is a huge difference. it will make each of bulldozer core have theoretically only 8kb l1 cache while no l2 cache built in. it makes bulldozer quitetly different from its counterpart as intel wrap everything inside each core except pcie ctrl, memory ctrl and l3 cache. they need larger l1 cache because their l1 cache is way slower than intel's cache. and now their l1 cache on each core only 8kb. it will be hard to imagine they can outperform any intel line...

instruction prediction, same thing that intel had done long time ago when back to netburst time. such feature only work when you have ridiculous number of pipeline and a trace cache. but despite everything they had done with it they still end up performing pathetic in every benches
Posted on Reply
#4
1Kurgan1
The Knife in your Back
Very cool to see this, can't wait to see what the Bulldozer can really do. Loving my 6 core, hopefully they will have some lower priced ones, thats why I've always been a fan of AMD.
Posted on Reply
#5
CDdude55
Crazy 4 TPU!!!
Shaping out to be an awesome architecture, hopefully it can actually walk above i7 while maintaining a decent price tag. If that's the case and i actually have a job by then, i definitely will be considering moving up to this.:)
Posted on Reply
#6
cheezburger
CDdude55 said:
Shaping out to be an awesome architecture, hopefully it can actually walk above i7 while maintaining a decent price tag. If that's the case and i actually have a job by then, i definitely will be considering moving up to this.:)
wait until bench come up first. but i doubt 8kb l1 cache on each core can do much of shit.......
Posted on Reply
#7
DigitalUK
JF-AMD said:
What if you got much greater throughput without having to increase memory channels?
is that a hint? bulldozer could be dual channel
Posted on Reply
#8
CDdude55
Crazy 4 TPU!!!
cheezburger said:
wait until bench come up first. but i doubt 8kb l1 cache on each core can do much of shit.......
I am waiting for the benchmarks most definitely.
Posted on Reply
#9
ROad86
cheezburger said:
wait until bench come up first. but i doubt 8kb l1 cache on each core can do much of shit.......
http://techreport.com/r.x/bulldozer-uarch/bulldozer-frontend.jpg

The module's front end includes a prediction pipeline, which predicts what instructions will be used next. A separate fetch pipeline then populates the two instruction queues—one for each thread—with those instructions. The decoders convert complex x86 instructions into the CPU's simpler internal instructions. Bulldozer has four of these, like Nehalem, while Barcelona has three.

Each module has a trio of schedulers, one for each integer core and one for the FPU.

This is from techreport and explains just fine. There is no 8kb L1 cache per core. If i am making a mistake please correct me.

And since we have JF-AMD at the forum please explain this clearly!
Posted on Reply
#10
cheezburger
ROad86 said:
http://techreport.com/r.x/bulldozer-uarch/bulldozer-frontend.jpg

The module's front end includes a prediction pipeline, which predicts what instructions will be used next. A separate fetch pipeline then populates the two instruction queues—one for each thread—with those instructions. The decoders convert complex x86 instructions into the CPU's simpler internal instructions. Bulldozer has four of these, like Nehalem, while Barcelona has three.

Each module has a trio of schedulers, one for each integer core and one for the FPU.

This is from techreport and explains just fine. There is no 8kb L1 cache per core. If i am making a mistake please correct me.

And since we have JF-AMD at the forum please explain this clearly!
it was confirm that it would be either 8~16kb incore l1 data cache while the instruction cache is uncored. very unlikely for a typical x86 design. if we all know how slow intel's l3 cache is because it's uncored then why bulldozer put everything out of core and make each of core only has basic functions?. correct5 me about the stage pipeline in bulldozer but it seem to be unlike typical x86 design..... for what i know each prediction pipeline controls two instruction which theoretically make it 4 pipeline per core. but is it really powerful enough just use less pipeline like this? isn't it going to cost the slower clockrate per core? and how is it possible to separate stage pipeline from core and make it uncored?
Posted on Reply
#11
JF-AMD
AMD Rep (Server)
L1 cache is not 8k. Check my blog in a week or so for the answer. There is l1 instruction shared between two cores, l1 data per core and l2 shared between 2 cores. L3 is shared at the die level
Posted on Reply
#12
JF-AMD
AMD Rep (Server)
DigitalUK said:
is that a hint? bulldozer could be dual channel
If am3+ socket also supports am3 chips, what do you think?
Posted on Reply
#13
wahdangun
JF-AMD said:
If am3+ socket also supports am3 chips, what do you think?
wow, if that was true and not cripple bulldozer performance than thats was great,

do you know if AMD will be release 980G chipset ?
Posted on Reply
#14
JF-AMD
AMD Rep (Server)
I am a server guy, I don't know about client stuff.
Posted on Reply
#15
bear jesus
JF-AMD said:
I am a server guy, I don't know about client stuff.
I must admit that i love the fact that you are active here and on other forums i visit, the personal touch along with just the fact that a company has people wiling talking to the bottom end customers really makes the difference when it comes to answering questiong and proving the point of "marketing talk" so i just wated to thank you for taking the time to talk to us through multiple forums and even more so out of office hours.
Posted on Reply
#16
JF-AMD
AMD Rep (Server)
There is no such thing as a "bottom end customer". There are either customers or people who will be customers. And both are the people that pay my salary.
Posted on Reply
#17
ERazer
JF needs a title ;) have u contacted the mods?
Posted on Reply
#18
cadaveca
My name is Dave
I think he's legit. ;)
Posted on Reply
#19
cheezburger
JF-AMD said:
L1 cache is not 8k. Check my blog in a week or so for the answer. There is l1 instruction shared between two cores, l1 data per core and l2 shared between 2 cores. L3 is shared at the die level
incorrect.....the l1 instruction share by one module(2 cores) and l2 is share by two modules and l3 is share by all modules...

and about 8k l1 data....i remember i saw the spec from anandtech three months ago...however i found wiki had 16k l1 cache.....which i'd rather believe anandtech's source...
Posted on Reply
#20
Techtu
bear jesus said:
Hmm i wonder if they will follow intel's lead (refering to the cooler that comes with the top end i7's) by using a better cooler for the high end cpu's if they run hot, would be nice to see a better cooler than the current one's as i am not really a fan of them.
Meh... the current stock cooler/fan comes with heat pipes, 10 years ago that was unheard off... be greatfull :toast:
Posted on Reply
#21
bear jesus
JF-AMD said:
There is no such thing as a "bottom end customer". There are either customers or people who will be customers. And both are the people that pay my salary.
Thank you for correcting me,you are right and once again i'm just thankful amd employes people like you who are willing to put the effort in with the community.
Posted on Reply
#22
bear jesus
Tech2 said:
Meh... the current stock cooler/fan comes with heat pipes, 10 years ago that was unheard off... be greatfull :toast:
Ok i admit i am greatful for he copper based hsf wih copper heatpipes.... even if i did just put it on a cpu thats 2 generation old and used my corsair h50 on the cpu the original hfs came with :p
Posted on Reply
#23
Techtu
bear jesus said:
Thank you for correcting me,you are right and once again i'm just thankful amd employes people like you who are willing to put the effort in with the community.
bear jesus said:
Ok i admit i am greatful for he copper based hsf wih copper heatpipes.... even if i did just put it on a cpu thats 2 generation old and used my corsair h50 on the cpu the original hfs came with :p
Tut tut... double posting, that's a no no :p

I understand where your coming from though but for any enthusiast, the stock coolers are just not enough, but then again if they was we wouldn't be very good enthusiast's would we :D
Posted on Reply
#24
bear jesus
Tech2 said:
Tut tut... double posting, that's a no no :p

I understand where your coming from though but for any enthusiast, the stock coolers just are not enough, but then again if they was we wouldn't be very good enthusiast's would we :D
Sorry too much vodka and it beiong past 5am made me get confused too the fact i was posting within the same thread not 2 seperate ones lol.

but i suppose i can add this, i still have the old all aluminum heatsink that came iwth the athlon x2 that i am currently readying to be chopped up for other uses so i am greatful for the heatsinks that come with amd processors and am glad they are now heatpipe coolers as even with chopping them up i cam make good use of them.

(no more drunken double posting... at least tonight lol :p)
Posted on Reply
#25
JF-AMD
AMD Rep (Server)
cheezburger said:
incorrect.....the l1 instruction share by one module(2 cores) and l2 is share by two modules and l3 is share by all modules...

and about 8k l1 data....i remember i saw the spec from anandtech three months ago...however i found wiki had 16k l1 cache.....which i'd rather believe anandtech's source...
You have 2 opinions to choose from:

1. A reporter who has never touched the product

or

2. The director of product marketing for servers at AMD


Choose carefully, there will be a test at the end of the class.
Posted on Reply
Add your own comment