Monday, December 19th 2011

AMD FX 8150 with Microsoft KB2592546 Put Through 'Before and After' Patch Tests

To the surprise of many, last week, Microsoft rolled out a patch (KB2592546) for Windows that it claimed would improve performance of systems running AMD processors based on the "Bulldozer" architecture. The patch works by making the OS aware of the way Bulldozer cores are structured, so it could effectively make use of the parallelism at its disposal. Sadly, a couple of days later, it pulled that patch. Meanwhile, SweClockers got enough time to do a "before and after" performance test of the AMD FX-8150 processor, using this patch.

The results of SweClockers' tests are tabled below. "tidigare" is before, "nytt" is after, and "skillnad" is change. The reviewer put the chip through a wide range of tests, including synthetic CPU-intensive tests (both single and multi-threaded), and real-world gaming performance tests. The results are less than impressive. Perhaps, that's why the patch was redacted.

Source: SweClockers
Add your own comment

96 Comments on AMD FX 8150 with Microsoft KB2592546 Put Through 'Before and After' Patch Tests

#1
cdawall
where the hell are my stars
by: trickson
Can you say this in English please ? AMD screwed up plane and simple . Sucks to ! We waited 4 years for this ? A launch filled with fluff and when it got into the hands of the consumers it was a total mess . We find out all kinds of things like they did not even know the transistor count ? and that it is out performed by it's older line the Phenom ! REALLY ??? Really you need another hot fix for another CPU line AGAIN AMD ??? WTH ? They put more work and more R&D into there GPU line than there CPU line !
Just throwing it out there Intel has released just as many if not more chips with issues. They had there own TLB bug, P4, bad chipsets, not to mention itanium's fiasco. There is nothing wrong with a hot fix again Intel had had there own batches of them. As of right Bulldozer is the best selling CPU's in AMD's lineup. If they were that shitty everyone would save there couple of bucks and get a Thuban. Get off your Intel high horse and look at the big picture. P4's hyperthreading sucked, but not a single person out there complaining about it on current Intel chips. This is AMD's first new design since K8 which was still heavily K7 based. You might remember Intel's current design is still an end result of a P3. Give AMD a generation to work some kinks out. Hell look at the APU performance jump clock for clock they are doing something right.
Posted on Reply
#2
seronx
by: TheMailMan78
Its not a true 8 core.
If you are talking about Bulldozer it is a true eight core...

Stop defending it...It has 8 Weak Cores...that have 33% less execution throughput than the competition core

Sandy Bridge has 3 ALUs and 3 AGUs per core(Threads compete for those 3 ALUs and 3 AGUs in Hyperthreading)
Bulldozer has 2 ALUs and 2 AGUs per core(Threads don't compete because there is TWO CORES!)

It's not really hard to notice that it has less execution resources not a longer pipeline

It takes 3 Cycles to do six 64bit executions for Bulldozer where it takes 2 cycles to do six 64bit executions for Sandy Bridge
Bulldozer though with all cores can do sixteen 64bit ALUs calcs. and do sixteen 64bit AGUs calcs. while Sandy Bridge can do twelve 64bit ALUs calcs. and twelve 64bit AGUs calcs.(with hyperthreading same old twelve 64bit ALUs calcs. and twelve 64bit AGUs calcs. no increase sillies)

Bulldozer is meant for Servers that need scalability with thread count...and Bulldozer does scale with thread count
Posted on Reply
#3
cdawall
where the hell are my stars
by: seronx
If you are talking about Bulldozer it is a true eight core...


I count 4 modules (cores)
by: seronx

Stop defending it...It has 8 Weak Cores...that have 33% less execution throughput than the competition core
It also performs similar to the previous generation with the same 3ALU/3AGU as Intel. It has 8 differently structured cores.
by: seronx

Sandy Bridge has 3 ALUs and 3 AGUs per core(Threads compete for those 3 ALUs and 3 AGUs in Hyperthreading)
Bulldozer has 2 ALUs and 2 AGUs per core(Threads don't compete because there is TWO CORES!)
The threads don't compete all hyperthreading does is allow another set of instructions to be sent down the pipeline. It was originally a band-aid for Intel's long pipelined netburst based chips. AMD's new design gave you 2 separate threads something Hyperthreading can never do.
by: seronx

It's not really hard to notice that it has less execution resources not a longer pipeline
Whats either of those have to do with anything. It is still a "short" pipeline CPU in comparison to P4. Due to design it is not comparable to Intel in execution resources.
by: seronx

It takes 3 Cycles to do six 64bit executions for Bulldozer where it takes 2 cycles to do six 64bit executions for Sandy Bridge
Bulldozer though with all cores can do sixteen 64bit ALUs calcs. and do sixteen 64bit AGUs calcs. while Sandy Bridge can do twelve 64bit ALUs calcs. and twelve 64bit AGUs calcs.(with hyperthreading same old twelve 64bit ALUs calcs. and twelve 64bit AGUs calcs. no increase sillies)
Again AMD's K7-K10h chips all offered the same 3/3 setup of calcs and did not offer an improvement except with K7/K8 vs netburst. Core 2 Duo and up when Intel went back to 3/3 were the first competitive offerings. The main reason netburst failed in Intel's eyes was a lack of clock scaling. Original design was said to scale to 8ghz and at that speed its long pipelines and 2/2 design would have held a performance edge.
by: seronx

Bulldozer is meant for Servers that need scalability with thread count...and Bulldozer does scale with thread count
Yup Bulldozer does what it was designed for and in heavily multithreaded apps it holds its own. With future chips offering a more refined design it will likely smoke some multithreading benchmarks. Especially since they already have proven it clocks higher.
Posted on Reply
#5
cdawall
where the hell are my stars
by: xenocide
http://arstechnica.com/business/news/2011/11/bulldozer-server-benchmarks-are-here-and-theyre-a-catastrophe.ars

:rolleyes:
According to your source Bulldozer does hold its own. Nowhere in my statement did I call it best. I said it did as it was designed heavy multithreading it is the bulldozer's bread and butter. Price for performance makes no difference to the vast majority of the companies running these style chips. The AMD box at the time of that writing outperformed the Intel and K10h based boxes. It doesn't matter if it had more ram, better hard drives or more cores. The point is the system was designed to do exactly that in a server environment and it succeeded in industry standard benchmarks. CEO's don't look at anandtech they look at the sheet of paper HP hands them that says quite clearly while at a higher cost performance per unit is higher. Less units at higher performance means less space.
Posted on Reply
#6
Completely Bonkers
Looking at this picture



BD doesnt look like a smart design. Really, why would you have L3 cache the same size as L2? L3 is slower than L2... but if it is the same size... what benefit does it add? Only prefetching algorithms aka "netburst"ing opcode and data. It isnt acting as a cache, but as a prefetcher. In which case, it doesnt need to be 2GB... it might at well just be 64K.

Redesign BD right away! A quick win would be to take L3 down to 64K... saving die space and power and making fab cost and end price much cheaper. I bet performance would be within 3% mark. Double L1 if not quadruple and performance would be up 10% and still on lower die footprint and power consumption.

And get the processor to operate symmetrically rather than asymmetrically. All this nonsense about affinity locking 2 threads and getting a "turbo boost" effect. Kill it. Separate those cores with a little space saved from cutting L3. And kill turbo boost but raise all clocks to their max. Cooling will be better now they are spaced and there isnt heat from L3.
Posted on Reply
#7
cdawall
where the hell are my stars
by: Completely Bonkers
Looking at this picture

http://img.techpowerup.org/111220/bulldozer-overlay.jpg

BD doesnt look like a smart design. Really, why would you have L3 cache the same size as L2? L3 is slower than L2... but if it is the same size... what benefit does it add? Only prefetching algorithms aka "netburst"ing opcode and data. It isnt acting as a cache, but as a prefetcher. In which case, it doesnt need to be 2GB... it might at well just be 64K.

Redesign BD right away! A quick win would be to take L3 down to 64K... saving die space and power and making fab cost and end price much cheaper. I bet performance would be within 3% mark. Double L1 if not quadruple and performance would be up 10% and still on lower die footprint and power consumption.

And get the processor to operate symmetrically rather than asymmetrically. All this nonsense about affinity locking 2 threads and getting a "turbo boost" effect. Kill it. Separate those cores with a little space saved from cutting L3. And kill turbo boost but raise all clocks to their max. Cooling will be better now they are spaced and there isnt heat from L3.
Each module can only use its 2MB L2 cache however the module could use the entire 8MB L3 if it needed.


As for the argument early the bulldozer die when analyzed the way AMD designed it has 4 ALU and 4 AGU per module. You would consider each module as a core. You cannot consider individual "cores" within the modules cores since they share the early pipelines. They are called integer cores. Each integer core carries a 4 way 16kB L1 data cache and a 64kB instruction cache. In a nutshell its two halves to a single brain, independent and codependent at the same time.
Posted on Reply
#8
Completely Bonkers
I wonder what the latency is between the different banks of L3. With decent memory controllers and DDR3, the relative performance gain of L3 cache is getting lower and lower... perhaps time to drop L3 and beef up L1/L2 and separate those pipelines.
Posted on Reply
#9
seronx
by: cdawall

I count 4 modules (cores)
Those are 8 cores just looking at that you wouldn't notice the repeated alu/agu subsets and the dedicated datapaths each one have

by: abinstein

You CLAIM a core is but unfortunately what you claimed is not true.

What has always been the meaning of a "core" is the circuit used for the management of a thread and its memory context. This usually includes the datapath

The threads don't compete all hyperthreading does is allow another set of instructions to be sent down the pipeline. It was originally a band-aid for Intel's long pipelined netburst based chips. AMD's new design gave you 2 separate threads something Hyperthreading can never do.
Hyperthreading competes for the execution resources....

by: cdawall
Each module can only use its 2MB L2 cache however the module could use the entire 8MB L3 if it needed.
The 8MB of L3 is used mostly for big prefetches and it used by all modules and by all cores

by: cdawall

As for the argument early the bulldozer die when analyzed the way AMD designed it has 4 ALU and 4 AGU per module. You would consider each module as a core. You cannot consider individual "cores" within the modules cores since they share the early pipelines. They are called integer cores. Each integer core carries a 4 way 16kB L1 data cache and a 64kB instruction cache. In a nutshell its two halves to a single brain, independent and codependent at the same time.
No the design was that the 2 AGLUs were able to execute non-memory workloads(With the later versions being able to having all EX/AGLUs be AGLUs that can be able to output 4 Adds, 4 Subtracts, 4 Multiply, 4 Divide, 4 Memory ops per cycle in any order as long as it outputted four and this is per core)....Each module has two cores. You can consider the individual cores in the module cores since they have dedicated datapaths, instruction buses, data buses, and control units...

Don't impose your definition of what a core is if you are 100% wrong!
Posted on Reply
#10
ensabrenoir
Simply mind blowing how a single cjop can cause such a fusd
Posted on Reply
#11
erocker
by: ensabrenoir
Simply mind blowing how a single cjop can cause such a fusd
I think you keyboard is broken. :ohwell:
Posted on Reply
#12
devguy
by: erocker
I think you keyboard is broken. :ohwell:
No, I'm pretty sure he's gyt it rufht. Why make a fusd?
Posted on Reply
#13
ensabrenoir
by: erocker
I think you keyboard is broken. :ohwell:
Speed posting on new fangled smart phone screen needs calibration
Posted on Reply
#14
cdawall
where the hell are my stars
by: seronx
Those are 8 cores just looking at that you wouldn't notice the repeated alu/agu subsets and the dedicated datapaths each one have
They don't have entirely seperate datapaths. The initial pipelines are shared between the integer cores.




by: seronx

Hyperthreading competes for the execution resources....
These share the instruction set per module not core, and share all of the other resources.


by: seronx

The 8MB of L3 is used mostly for big prefetches and it used by all modules and by all cores
Which is what was already said.


by: seronx

No the design was that the 2 AGLUs were able to execute non-memory workloads(With the later versions being able to having all EX/AGLUs be AGLUs that can be able to output 4 Adds, 4 Subtracts, 4 Multiply, 4 Divide, 4 Memory ops per cycle in any order as long as it outputted four and this is per core)....Each module has two cores. You can consider the individual cores in the module cores since they have dedicated datapaths, instruction buses, data buses, and control units...
So each module acts as one core? giving 4 ALU/4 AGU per cycle. Thats exactly what I just said. Each module has dedicated datapaths, instruction buses, data buses and control units.



All shared within the module not within the integer core. The integer cores are not independant of the modules if they were it would be a true 8 core unit. No different than a Phenom X8 of sorts. This is not that. The integer cores share everything except a 16kB L1.

by: seronx

Don't impose your definition of what a core is if you are 100% wrong!
There are two definitions of a core and bulldozer fis neither.
Posted on Reply
#15
seronx
by: cdawall

So each module acts as one core? giving 4 ALU/4 AGU per cycle. Thats exactly what I just said. Each module has dedicated datapaths, instruction buses, data buses and control units.
Each core has 2 EXALUs and 2 AGLUs the original specification is that there was going to be 4 AGLUs but that was a rumour made by Dresdenboy

Again each core has dedicated datapaths, instruction buses, data buses, and control units..

2 DATAPATHS, 2 IBUSES, 2DBUSES, 2ConUNITS => 2 CORES NOTHING IS SHARED

IT IS EIGHT CORES!

TECHNICAL DEFINITIONS PLACE BULLDOZER of the OROCHI DIE AT EIGHT CORES



by: abinstein

You CLAIM a core is but unfortunately what you claimed is not true.

What has always been the meaning of a "core" is the circuit used for the management of a thread and its memory context. This usually includes the datapath, control, and bus. This usually excludes the caches and accelerators (incl. FPU).

To be more precise, a processor can be partitioned into the following functional units: data cache, instruction data, { control unit, instruction bus, data bus, (integer) datapath } and (floating point) accelerator datapath. Those inside the {} above form a "core". You may ask: why is integer datapath special? Because any process (thread + memory context) is *always* managed by the integer datapath. Any branch instruction, ILP, OOO, speculation, is performed by the integer datapath.

So the question to ask is how many sets of the {} above does a Bulldozer module have? The answer is 2. There are two cores. This has nothing to do with marketing. It's a technical definition.

Now, you don't need to like this definition. You can be bone headed enough to insist on your own definition of core. That is fine. Just like you can insist 1+1=1. Perhaps you are right in an alternative naming convention (if `+' means the logic-or to you), but you should at least understand that a Bulldozer module is said to have "two cores" for very with sound technical reasons.
There are TWO Integer DATAPATHS, 2 CONTROL UNITS, 2 INSTRUCTION BUSES, 2 DATA BUSES GET IT IN YOUR GODDAMN BONEHEAD OF YOURS THAT THIS IS TWO CORES

Accept the facts and move on cdawall I am tired of your idiocy
Posted on Reply
#16
brandonwh64
Addicted to Bacon and StarCrunches!!!
*Reads through this thread and shakes head*

You guys are relentless on debating...

Posted on Reply
#17
cdawall
where the hell are my stars
by: seronx
Each core has 2 EXALUs and 2 AGLUs the original specification is that there was going to be 4 AGLUs but that was a rumour made by Dresdenboy

Again each core has dedicated datapaths, instruction buses, data buses, and control units..

2 DATAPATHS, 2 IBUSES, 2DBUSES, 2ConUNITS => 2 CORES NOTHING IS SHARED

IT IS EIGHT CORES!

TECHNICAL DEFINITIONS PLACE BULLDOZER of the OROCHI DIE AT EIGHT CORES
Lets go through your image specifically. Having separate datapaths means nothing when there is still only a single unit. 2 roads to the same place if you will.



The module is not actually split into 2 cores that is the idea behind Bulldozer fit more into the package. In the image I split it for simplicity the only section physically separate for the cores is the actual integer calculation sections with their cache. Everything else is shared again separate paths to the same place don't make the place anymore split. The cores would still have to share. Any communications outside of the module go core->module->IO not core->IO once again making the dependent of the module itself further making them not into a true core as is normal for a K10 or SB style CPU. This is a new design with separate integer cores within modules. They are not the same cores as anything else to this point utilizes. While an 8150 has 8 integer cores it does not have 8 separate processing modules like a Phenom X8 would.
Posted on Reply
#18
xenocide
I've been making that argument since before it launched and nobody seemed to care. Thank you for perfectly detailing what I couldn't.
Posted on Reply
#19
cdawall
where the hell are my stars
by: xenocide
I've been making that argument since before it launched and nobody seemed to care. Thank you for perfectly detailing what I couldn't.
It took me about 3 hours of reading a looking at different architectural designs to figure out how to finally phrase it. :laugh: Thanks AMD for making shit more difficult again :slap:
Posted on Reply
#20
TheMailMan78
Big Member
Um......like I said. Its not an 8 core.

BTW thanks for detailing it cdawall. Really. I honestly didn't have the time and you did a better job then I could have. (Internet high five!) Bulldozer is only a fail to people who rested all of thier childhood expectations on a piece of silicon to enhance their mortality OR Intel fanboys who have small manhood's. Anyone with a brain can see what its design is for. Sometimes you don't get what you want, you get what you need.
Posted on Reply
#21
Super XP
by: xenocide
I think you're deluding yourself by expecting 10% on average. I saw some Linux Benchmarks and W8 Benchmarks with a "fixed" scheduler, and it was MAYBE a 5% gain in some situations, and in other upowards of a 5% loss in performance. This patch may offer a slight performance gain, but I don't expect it to really change much.
Think about this, you have a scheduling issue, something that obviously needs to get fixed. I don't think 10% on average is unreasonable.
The same can be said in a busy doctor's office or in a hospital, if you don't schedule appointments properly, you end up running into patient bottlenecks.
We will know the facts soon enough in Q1 2012, and hopefully this 2 part patch will help efficiency within the Bulldozer and make it close to the way it was meant to run and perform.
Posted on Reply
Add your own comment