Friday, September 24th 2010

AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache

Documents related to the "Orochi" 8-core processor by AMD based on its next-generation Bulldozer architecture reveal its cache hierarchy that comes as a bit of a surprise. Earlier this month, at a GlobalFoundries hosted conference, AMD displayed the first die-shot of the Orochi die, which legibly showed key features including the four Bulldozer modules which hold two cores each, and large L2 caches. In coarse visual inspection, the L2 cache of each module seems to cover 35% of its area. L3 cache is located along the center of the die. The documents seen by X-bit Labs reveal that each Bulldozer module has its own 2 MB L2 cache shared between two cores, and an L3 cache shared between all four modules (8 cores) of 8 MB.

This takes the total cache count of Orochi all the way up to 16 MB. This hierarchy suggests that AMD wants to give individual cores access to a large amount of faster cache (that's a whopping 2048 KB compared to 512 KB per core on Phenom, and 256 KB per core on Core i7), which facilitates faster inter-core, intra-module communication. Inter-module communication is enhanced by the 8 MB L3 cache. Compared to the current "Istanbul" six-core K10-based die, that's a 77% increase in cache amount for a 33% core count increase, 300% increase in L2 cache per core. Orochi is built on a 32 nm GlobalFoundries process, it is sure to have a very high transistor count.Source: Xbit Labs
Add your own comment

152 Comments on AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache

#1
Techtu
JF-AMD said:
You have 2 opinions to choose from:

1. A reporter who has never touched the product

or

2. The director of product marketing for servers at AMD


Choose carefully, there will be a test at the end of the class.
Well put :toast:
Posted on Reply
#2
enaher
JF-AMD said:
L1 cache is not 8k. Check my blog in a week or so for the answer. There is l1 instruction shared between two cores, l1 data per core and l2 shared between 2 cores. L3 is shared at the die level
wow that's some agressive increase and management of cache, I hope performance increases, AMD's succes is good for us might bring prices down, and performance up.
Posted on Reply
#3
bear jesus
enaher said:
I hope performance increases, AMD's succes is good for us might bring prices down, and performance up.
I agree, luckly i can wait untill after bulldozer is out to make my next big upgrade so if it is a great cpu/good value i of corse will go for it and if not then i would hope intel's current gen at the time will have a price drop around that time.

Although i admit i would prefer to go with amd as i had an amd k6 back in the day and was so happy with it and was lucky enough to go through 3 amd cpu's on my current motherboard so i would love to keep supporting them and if i am lucky go through another 2/3 cpu's on my next motherboard as i hate the idea of having to change motherboard every time i upgrade my cpu.
Posted on Reply
#4
btarunr
Editor & Senior Moderator
cheezburger said:
incorrect.....the l1 instruction share by one module(2 cores) and l2 is share by two modules and l3 is share by all modules...
L2 is shared between two cores within one module.

And yes, JF-AMD indeed is director of product marketing for servers at AMD. Waiting for W1zzard to give him his title. He may have known details about Orochi months before anyone else did.
Posted on Reply
#5
wahdangun
btarunr said:
L2 is shared between two cores within one module.

And yes, JF-AMD indeed is director of product marketing for servers at AMD. Waiting for W1zzard to give him his title. He may have known details about Orochi months before anyone else did.
so its basically a core2duo ?
Posted on Reply
#6
Wile E
Power User
I can't wait to see the numbers. I also hope that the new architecture still overclocks well. Competition at the top end would be killer. I want my $1000 chips to either become $500 chips, or become twice as fast for the $1000.
Posted on Reply
#7
largon
Oh boy... Here we go.
cheezburger said:
they are try to fix the single thread performance hit due to the smaller l1 data/instruction.
As if they would have had any problems slapping in an equally sized or larger than Hammer's L1s... It's not like this is AMD's first CPU architecture ever, or that adding such and amount would be of any die area concern. And for comparison, Nehalem has 32kB per core, 16kB per thread AND a tiny 256kB L2 - I bet Intel must be struggling with similar performance hit.
cheezburger said:
each core "only" had 8kb l1 data
Err... No.
Each Bulldozer module has two set of integer pipelines and both of them have dedicated 16kB L1D. 16+16kB in total per module, 16kB per thread.
cheezburger said:
while the instruction cache is share by module which just only 64kb "2 way" in cache(could have be less...i think...)
Bulldozer's L1I is 64kB, that's been public for some time now. About the bracketed comment; you think it could have been smaller, or you aren't sure what size it is?
cheezburger said:
which is roughly 40kb per core compare to core's 64kb per core. big disadvantage.
If you say so...
cheezburger said:
so all they can do is add more l3 cache to increase the performance (...) same thing intel did when realized northwood its poor l1 cache will drag down performance they increase l2 cache from 256kb to 512kb.
And by coincidence, Intel is doing the same. "Obviously" they too must be patching Core m-arch's "poor L1s and L2s" by adding cache levels and continuously increasing their size.
cheezburger said:
however orochi is 8 module 16 core processor
No. Orochi is 4 module, 8 thread core.
cheezburger said:
so featuring 16mb l3 meant each core can use up to 1mb l3. still way below nehalem's 2mb per core.
Durrr...
Bulldozer does not have a 16MB L3, even reading the thread title should give away the L3 is 8MB. 2MB L2 + 2MB L3 per module, that is. Thus, per module, Orochi has 8× as much L2 vs. Nehalem and equal L3-ratio.
cheezburger said:
also unlike intel's architecture amd's cache heavily determine by the stage pipeline.
Strange conclusion considering the public, (that includes me and you) don't know Bulldozer's exact pipeline length, yet.
cheezburger said:
lower stage pipeline won't take advantage on bigger cache. but since bulldozer will featuring 4+ghz i doubt this will be at least 20+ stage pipeline in this processor.
Broken sentence. What are you trying to say?
You do believe it is 20+ stage or you do not?
Also, the clock rates are completely unknown to public.
cheezburger said:
but despite all these feature as long as intel decide to increase ivy bridge's l2 cache from 256k per core to 512k per core amd will experience same horror they faced when core 2 came out.
Oh really? Now one can only wonder why didn't Intel see such a shortcoming of their L2 before taping out Nehalem, Sandy Bridge... They must have missed the fact their chips' L2 had shrinked to a fraction of the size compared to Conroe, Penryn.

PS.
In case you find some parts of my reply sarcastic, it is highly likely you are right.

Abstract for those with the "TL;DR" -syndrome:
Burger, please get your facts straight. The factual errors I've pointed out are public knowledge, go read them. And please do pay attention to writing proper English, often it is impossible to figure out what you're trying to say as many of your sentences are missing words and the words that are there are often misspelled.
Posted on Reply
#8
Wyverex
largon, save your breath, he even argued with AMD guy and called his info false, lol
JF-AMD, thank you for your contribution to the thread.

I'm really looking forward to Bulldozer and I hope it succeeds, both in Server and Desktop markets :)
Posted on Reply
#9
btarunr
Editor & Senior Moderator
largon said:
Durrr...
Bulldozer does not have a 16MB L3, even reading the thread title should give away the L3 is 8MB. 2MB L2 + 2MB L3 per module, that is. Thus, per module, Orochi has 8× as much L2 vs. Nehalem and equal L3-ratio.
Sorry largon, but it's 2 MB L2 per module, 8 MB L3 shared between all four modules. There is no L3 cache at the sub-modular level. Hence the total cache is 16 MB (AMD denotes total L2 + L3 as "total cache").
Posted on Reply
#10
largon
btarunr said:
Sorry largon, but it's 2 MB L2 per module, 8 MB L3 shared between all four modules. There is no L3 cache at the sub-modular level. Hence the total cache is 16 MB (AMD denotes total L2 + L3 as "total cache").
You're misinterpreting me. My "2MB L3 per module" is only a way to state a ratio, not actual configuration.


cheezburger said:
the problem is that 64kb l1 instruction cache and l2 cache are uncore. that is a huge difference. it will make each of bulldozer core have theoretically only 8kb l1 cache while no l2 cache built in.
What?
That's just not true. Bulldozer's L1I and L2 are fully integrated parts of the BD module and they run at core freq, and no less.
cheezburger said:
they need larger l1 cache because their l1 cache is way slower than intel's cache.
Bulldozer has 4T L1 latency, same as Nehalem's.
cheezburger said:
and now their l1 cache on each core only 8kb. it will be hard to imagine they can outperform any intel line...
Especially if the one "imagining things" is using incorrect numbers...
cheezburger said:
instruction prediction, same thing that intel had done long time ago when back to netburst time. such feature only work when you have ridiculous number of pipeline and a trace cache.
What can I say, once again you astound (but not surprise) by posting utter nonsense.
cheezburger said:
but despite everything they had done with it they still end up performing pathetic in every benches
Feeling particularly "blue", perhaps? And by saying that I'm not referring to mood.

But what can you do, a troll is a troll is a troll.
Posted on Reply
#12
ROad86
I have a question that may JF-AMD may not now since he is in the server section, but I want to ask will AMD present in the future 6 core(3 module) or 4 core(2 module) products with lower price?

Or it will be variation at the clock rate of the Orochi design?
Posted on Reply
#13
Imsochobo
cadaveca said:
I think he's legit. ;)
He answer stuff like i answer stuff about my company, as short as possible :p
Posted on Reply
#14
Imsochobo
Will amd improve the southbridge, harddrive performance, and such ?
Your nb's is quite good.

2nd, theese will be so diffrent compared to K8 K10 K10,5 that vmotion wont work from K10,5 -> bulldozer?

If we still can i'll be praising amd for my servers for a few more years! :P
Posted on Reply
#15
btarunr
Editor & Senior Moderator
Imsochobo said:
Will amd improve the southbridge, harddrive performance, and such ?
There's nothing particularly bad with AMD's storage performance with a proper mode (AHCI or RAID) and proper driver (AMD over Microsoft) installed. The RAID controller sucked only till SB600 southbridge (which had a Silicon Image logic that wasn't implemented so well). SB700/SB710/SB750 is on par with ICH10/R, SB850 has no match (SATA 6 Gb/s).
Posted on Reply
#16
Imsochobo
btarunr said:
There's nothing particularly bad with AMD's storage performance with a proper mode (AHCI or RAID) and proper driver (AMD over Microsoft) installed. The RAID controller sucked only till SB600 southbridge (which had a Silicon Image logic that wasn't implemented so well). SB700/SB710/SB750 is on par with ICH10/R, SB850 has no match (SATA 6 Gb/s).
Still not up there, I wonder why a SSD scores 7.3 with my SB750 and with my ICH10/R it does 7,5 in windows.
Why it has about 10 mb/sec more sequential, better 4k, 512, and so on. its not by much.
But its getting beaten by both nvidia and intel.

http://www.tomshardware.com/reviews/ich10r-sb750-780a,2374-10.html
I just googled abit to find some review. never trusted toms too much, but yeh :p

Its not like i'm headbanging my head to the wall of my ssd performance, it's just: there are more to get here!
Posted on Reply
#17
btarunr
Editor & Senior Moderator
Those are access times (in the URL you posted). The lower the better. You can see how SB750 and ICH10R are on par in most access time tests. Anyway, 7.3 to 7.5 is a big deviation in WPI but maybe other factors were at play (such as you may have tested the ICH10R system on a clean(er) installation than the SB750 system).
Posted on Reply
#18
ROad86
Windows numbers are inaccurate. In my build a western digital 640gb at a giagabyte 790xt-UD4P was scoring at IDE interface 5,9. After the format I changed the IDE to AHCI and it scores now 7,5. I dont know why I run the test many times and still the same result. (By the way SB750 is the southbridge)

Now I want to make another question to JF-AMD which is related to the previous one. In the blog he mentions that from 33% more cores we take 50% more performance. The test was between magny-cours (12 core) and interlagos(16-core and bulldozer architecture). We will take the same ammount increase of perfomance and the client processors? Because the increase from 6 to 8 cores equals nearly 33, should we expect 50% perfomance jump form phenom II? If this happens will it comes with an equal increase at the price?
Posted on Reply
#19
btarunr
Editor & Senior Moderator
Again, the client processors will perform different. Client systems will use lower number of DIMMs, usually lower latency memory (DDR3 servers use failsafe 1066 MHz @ 9-9-9-24T settings as a standard). Client processors have 3/4 HT links disabled, etc., etc. So server to client comparison isn't apples-to-apples.
Posted on Reply
#20
ROad86
btarunr said:
Again, the client processors will perform different. Client systems will use lower number of DIMMs, usually lower latency memory (servers use failsafe 1066 MHz @ 9-9-9-24T). Client processors have 3/4 HT links disabled, etc., etc. So server to client comparison isn't apples-to-apples.
I agree totally with you:). But imagine (and that is speculations) a perfomance jump up to 40%. It will match or even outperfom sandybridge. If that happens what the prices wiil be? I wish they wont increase the prices as intel does.
Posted on Reply
#21
JF-AMD
AMD Rep (Server)
Folks, all we have disclosed in public about cache is the L1 size (that I posted earlier.)

We have not disclosed L2 or L3 sizes, so whatever you quote is not confirmed, only speculation.

L1 is within the core. L2 is within the module. L3 is within the die.
Posted on Reply
#22
btarunr
Editor & Senior Moderator
ROad86 said:
I agree totally with you:). But imagine (and that is speculations) a perfomance jump up to 40%. It will match or even outperfom sandybridge. If that happens what the prices wiil be?
If AMD has a faster processor architecture, it will ask whatever it wants to. It's a corporation.

Just as Intel asks $999 for its Extreme Edition SKUs, AMD used to ask for the same $999 for its FX SKUs (back when K8 was the best client CPU architecture out there). Even today AMD can try to ask for more than $275, if it wants to develop the QuadFX platform. Enthusiasts always have $999 to spend on one Core i7 or two DSDC-capable Phenom II chips in the s1207 package. It's just that AMD's client CPU team has to wake up to that realization. Power and board costs are lame excuses.
Posted on Reply
#23
JF-AMD
AMD Rep (Server)
ROad86 said:
Windows numbers are inaccurate. In my build a western digital 640gb at a giagabyte 790xt-UD4P was scoring at IDE interface 5,9. After the format I changed the IDE to AHCI and it scores now 7,5. I dont know why I run the test many times and still the same result. (By the way SB750 is the southbridge)

Now I want to make another question to JF-AMD which is related to the previous one. In the blog he mentions that from 33% more cores we take 50% more performance. The test was between magny-cours (12 core) and interlagos(16-core and bulldozer architecture). We will take the same ammount increase of perfomance and the client processors? Because the increase from 6 to 8 cores equals nearly 33, should we expect 50% perfomance jump form phenom II? If this happens will it comes with an equal increase at the price?
You can't do the math that way, but there will be a very good performance gain.

With servers you are measuring throughput, which is hom much stuff you can jam through a pipe at full utilization. Client loads are more bursty, so throughput is a less relevant measure.
Posted on Reply
#24
ROad86
JF-AMD said:
You can't do the math that way, but there will be a very good performance gain.

With servers you are measuring throughput, which is hom much stuff you can jam through a pipe at full utilization. Client loads are more bursty, so throughput is a less relevant measure.
Thanks JF!
Posted on Reply
#25
largon
JF-AMD said:
Folks, all we have disclosed in public about cache is the L1 size (that I posted earlier.)

We have not disclosed L2 or L3 sizes, so whatever you quote is not confirmed, only speculation.
Interesting, the fact you commented...
:D

Now that I took another look on the released (heavily pixelated & manipulated) die shot, it might just be Bulldozer's L3 is either just 6MB, or whopping 12MB. That, or the cells in L3 are actually less dense than those in L2.
Posted on Reply
Add your own comment