• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Orochi ''Bulldozer'' Die Holds 16 MB Cache

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,677 (7.43/day)
Location
Dublin, Ireland
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard Gigabyte B550 AORUS Elite V2
Cooling DeepCool Gammax L240 V2
Memory 2x 16GB DDR4-3200
Video Card(s) Galax RTX 4070 Ti EX
Storage Samsung 990 1TB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Documents related to the "Orochi" 8-core processor by AMD based on its next-generation Bulldozer architecture reveal its cache hierarchy that comes as a bit of a surprise. Earlier this month, at a GlobalFoundries hosted conference, AMD displayed the first die-shot of the Orochi die, which legibly showed key features including the four Bulldozer modules which hold two cores each, and large L2 caches. In coarse visual inspection, the L2 cache of each module seems to cover 35% of its area. L3 cache is located along the center of the die. The documents seen by X-bit Labs reveal that each Bulldozer module has its own 2 MB L2 cache shared between two cores, and an L3 cache shared between all four modules (8 cores) of 8 MB.

This takes the total cache count of Orochi all the way up to 16 MB. This hierarchy suggests that AMD wants to give individual cores access to a large amount of faster cache (that's a whopping 2048 KB compared to 512 KB per core on Phenom, and 256 KB per core on Core i7), which facilitates faster inter-core, intra-module communication. Inter-module communication is enhanced by the 8 MB L3 cache. Compared to the current "Istanbul" six-core K10-based die, that's a 77% increase in cache amount for a 33% core count increase, 300% increase in L2 cache per core. Orochi is built on a 32 nm GlobalFoundries process, it is sure to have a very high transistor count.

View at TechPowerUp Main Site
 
BL GG Intel Fanboys. AMD is back! :nutkick:
 
I'll believe it's a performance gain when I see the benchmarks. Regardless of which side you take, competition is always good for the consumer.
 
wait for benchmarks before you start that, we've been through that before with amd
 
2011 is shaping up to be quite an interesting year :)
 
I remember the "massive cache" Gallatin P4's over Northwood. Didn't make more than 5% difference clock for clock except in very special circumstances.

So let's wait for benchmarks.

I would have thought there would be better gains by rethinking cache and memory entirely, possibly producing a separate socket for L3 cache just like in the old days. It would be so much cheaper to do it that way, you could easily pack 256MB cache. Yes, the latency would be worse than current on-die L3 cache, but with the space, heat and transistors saved, you could bump up L1 and L2 cache and win back any performance losses. Plus you could build your L3 cache to order.
 
That's it????? I wait for the day with 16 cores with 64MB of Cache
 
Well it seems Bulldozer is going to be faster when communicating with memory and other cores. I think if AMD just did that to a phenom 2 chip it would speed it up significantly. I really cant wait to see bulldozer in action.
 
I would hope more faster cache could be a good thing but the main thing im interested in is how each modual performs, i'm really thinking about getting a high end sandy bridge or bulldozer to last me a couple years or so and that means i want as many and as fast a cores as possible as i would hope over the next few years more software will use more cores.
 
I'm totally noob in CPU technologies but I think 16MB cache it's a freaking cool, right?
 
Last edited:
it is sure to have a very high transistor count.

so does fermi, i hope amd has the tdp under control, otherwise sandy will kick butt
 
I'm totally noob in CPU technologies but I think 16MB cache it's a freaking cool, right?

It could be if put to use well but the core's are really importaint, either way we won't know untill the reviews really.
 
One design win I really commend AMD for is their use of dynamic cache allocation between the "cores" on a module. While many assume the sharing of cache (and other items like the FPU) will hurt single threaded performance, that really isn't the case. When only one core is active per module, it has complete control over all the resources; thus a single core will have 2mb L2 cache at its disposal! Also, when both cores on a module are active, they can inequitably share the resources (ie one core with .5mb L2 and another with 1.5mb L2 is possible). Very cool technology.

For Bulldozer, there will be the option to have the OS prefer loading one core per module (like cores 1, 3, 5, 7) rather than just filling them up by modules (1, 2, 3, 4). Both have benefits and faults: the first route has higher performance, but also higher power consumption; the second would be the exact opposite.

As far as the sharing of the FPU, in servers it will make hardly any difference. In the desktop segment, AMD argues that should you be doing something that takes up so much FPU performance to slow down our modules, then you should be doing it on the GPU instead.
 
I like this news. I ahve been saying for a couple of years now that AMD's cache design needed to cahnge, and here, they are doing something about it. That makes me even more interested in Bulldozer tech.
 
One design win I really commend AMD for is their use of dynamic cache allocation between the "cores" on a module. While many assume the sharing of cache (and other items like the FPU) will hurt single threaded performance, that really isn't the case. When only one core is active per module, it has complete control over all the resources; thus a single core will have 2mb L2 cache at its disposal! Also, when both cores on a module are active, they can inequitably share the resources (ie one core with .5mb L2 and another with 1.5mb L2 is possible). Very cool technology.

For Bulldozer, there will be the option to have the OS prefer loading one core per module (like cores 1, 3, 5, 7) rather than just filling them up by modules (1, 2, 3, 4). Both have benefits and faults: the first route has higher performance, but also higher power consumption; the second would be the exact opposite.

As far as the sharing of the FPU, in servers it will make hardly any difference. In the desktop segment, AMD argues that should you be doing something that takes up so much FPU performance to slow down our modules, then you should be doing it on the GPU instead.

I never knew it would be set up like that, kind of makes me even more sure i want to wait for bulldozer for my next full upgrade so that if it is a good cpu at a good price i can go for one or if not then i can get somethign from sandy bridge a little cheaper (hoping price drops will come over the time waited and if the consumer is lucky price drops that come with/after bulldozer).
 
no surprise. they are try to fix the single thread performance hit due to the smaller l1 data/instruction. each core "only" had 8kb l1 data while the instruction cache is share by module which just only 64kb "2 way" in cache(could have be less...i think...) which is roughly 40kb per core compare to core's 64kb per core. big disadvantage. so all they can do is add more l3 cache to increase the performance or hoping not drop performance without tweak too much on the exist architecture that had been tape out and going to be release in 3 months. same thing intel did when realized northwood its poor l1 cache will drag down performance they increase l2 cache from 256kb to 512kb. however orochi is 8 module 16 core processor so featuring 16mb l3 meant each core can use up to 1mb l3. still way below nehalem's 2mb per core. also unlike intel's architecture amd's cache heavily determine by the stage pipeline. lower stage pipeline won't take advantage on bigger cache. but since bulldozer will featuring 4+ghz i doubt this will be at least 20+ stage pipeline in this processor. but despite all these feature as long as intel decide to increase ivy bridge's l2 cache from 256k per core to 512k per core amd will experience same horror they faced when core 2 came out.
 
I wonder how hot these CPUs will get ...
 
no surprise. they are try to fix the single thread performance hit due to the smaller l1 data/instruction. each core "only" had 8kb l1 data while the instruction cache is share by module which just only 64kb "2 way" in cache(could have be less...i think...) which is roughly 40kb per core compare to core's 64kb per core. big disadvantage. so all they can do is add more l3 cache to increase the performance or hoping not drop performance without tweak too much on the exist architecture that had been tape out and going to be release in 3 months. same thing intel did when realized northwood its poor l1 cache will drag down performance they increase l2 cache from 256kb to 512kb. however orochi is 8 module 16 core processor so featuring 16mb l3 meant each core can use up to 1mb l3. still way below nehalem's 2mb per core. also unlike intel's architecture amd's cache heavily determine by the stage pipeline. lower stage pipeline won't take advantage on bigger cache. but since bulldozer will featuring 4+ghz i doubt this will be at least 20+ stage pipeline in this processor. but despite all these feature as long as intel decide to increase ivy bridge's l2 cache from 256k per core to 512k per core amd will experience same horror they faced when core 2 came out.



First orochi is 4 module - 8 core design. Second not only the size but how fast is the cache. Third it is very important how the prediction of instructions will work, if the design is good then you dont need big L1 cache which increase cost and die size. And yes 2mb per module 1 mb per core is the amount that bulldozer will have.
 
I want one, a server version with 8 or 16 GB of ecc ram :D I don't know why though since I don't even work 1 core on my 955BE
 
I wonder how hot these CPUs will get ...

Very hot...apparantly we'll see a clockspeed decrease(which I assume is due to the high levels of cache), but IPC will increase. I'm kinda expecting 2.4ghz or so...maybe lower...for launch chips.
 
Very hot...apparantly we'll see a clockspeed decrease(which I assume is due to the high levels of cache), but IPC will increase. I'm kinda expecting 2.4ghz or so...maybe lower...for launch chips.

Just a good reason for me to get my first real water cooling setup :D (assuming i am happy with the reviews of bulldozer)
 
I don't know anything about it, really. However, there is mention of the clockspeed decrease on the AMD blog site. NOw that we have the info on cache size...1+1=2. Of course, there's lots of time between now and launch..seems to me they are refining the process, and a few bugs, at this point.
 
I want one, a server version with 8 or 16 GB of ecc ram :D I don't know why though since I don't even work 1 core on my 955BE


Haha me too!!! :laugh:
 
I don't know anything about it, really. However, there is mention of the clockspeed decrease on the AMD blog site. NOw that we have the info on cache size...1+1=2. Of course, there's lots of time between now and launch..seems to me they are refining the process, and a few bugs, at this point.

Hmm i wonder if they will follow intel's lead (refering to the cooler that comes with the top end i7's) by using a better cooler for the high end cpu's if they run hot, would be nice to see a better cooler than the current one's as i am not really a fan of them.
 
Back
Top