Saturday, December 17th 2011

AMD Bulldozer Threading Hotfix Pulled

Since we reported on the AMD Bulldozer hotfix, The Tech Report reports in an updated post, that the Bulldozer threading hotfix said to improve performance of the processor, has been pulled:
We've spoken with an industry source familiar with this situation, and it appears the release of this hotfix was either inadvertent, premature, or both. There is indeed a Bulldozer threading patch for Windows in the works, but it should come in two parts, not just one. The patch that was briefly released is only one portion of the total solution, and it may very well reduce performance if used on its own. We're hearing the full Windows update for Bulldozer performance optimization is scheduled for release in Q1 of 2012. For now, Bulldozer owners, the best thing to do is to sit tight and wait.
It will be very interesting indeed to see how this much maligned processor benchmarks after the fully developed patch is released. It's true, actually attempting to download the hotfix and agreeing to the licence terms, at the moment, one is lead to a page that shows it as unavailable.
Add your own comment

90 Comments on AMD Bulldozer Threading Hotfix Pulled

#1
seronx
cadaveca said:
LuLz. The problem is the shared L2 of a module is not fast enough to feed two threads. /end story. There's no big mystery as to why BD is slow. I knew it before the CPU was out.
The L2 doesn't feed the two cores...It stores results from the Floating Point Unit and provides instructions to the L1i there is no problem there but It can be improved but I wouldn't fix what isn't broken
Posted on Reply
#2
cadaveca
My name is Dave
OK. Please explain to me how, then, moving two threads from running on two cores within a module, to one thread per module, is faster, on workloads that don't require that the shared resources are used exclusively per thread?


In an Intel CPU, cache gets slower, going from L1, to L2, to L3. Then ram is even a bit slower yet. The speed differences are offset by having a larger data store.

In Bulldozer, the L2 cache is a fraction the speed of both the L1 and L3. Why? What benefit does this serve?
Posted on Reply
#3
seronx
cadaveca said:
OK. Please explain to me how, then, moving two threads from running on two cores within a module, to one thread per module, is faster?
In what benchmark



If I had to make a guess without knowing the benchmark it would be the dispatch not the L2
Posted on Reply
#4
qubit
Overclocked quantum bit
cadaveca said:
LuLz. The problem is the shared L2 of a module is not fast enough to feed two threads. /end story. There's no big mystery as to why BD is slow. I knew it before the CPU was out.
Isn't it also slow because there's only 4 FPU's in the 8 core model? I was aghast when I first saw this.
Posted on Reply
#5
cadaveca
My name is Dave
qubit said:
Isn't it also slow because there's only 4 FPU's in the 8 core model? I was aghast when I first saw this.
There is not jsut 4 FPUs. There are 4 256-bit FPUs, which can each handle dual 128-bit operations. Nearly nothing currently uses the 256-bit capability.
Posted on Reply
#6
seronx
qubit said:
Isn't it also slow because there's only 4 FPU's in the 8 core model? I was aghast when I first saw this.
It's 4 Floating Point Units but you have two units for each core if it was 256bit Units you wouldn't be complaining
Posted on Reply
#7
qubit
Overclocked quantum bit
cadaveca said:
There is not jsut 4 FPUs. There are 4 256-bit FPUs, which can each handle dual 128-bit operations. Nearly nothing currently uses the 256-bit capability.
seronx said:
It's 4 Floating Point Units but you have two units for each core if it was 256bit Units you wouldn't be complaining
In other words it's four double-width FPU's making it equivalent to 8 single width ones? And it can be logically split into two? If so, that would make it just fine, yes.

Now I think about it, what is the word size of an FPU on previous 64-bit processors? Should be the same of AMD and Intel, I'd expect. (Yes, I know I could google it, but I'd rather you guys just explain it to me. :p )
Posted on Reply
#8
cadaveca
My name is Dave
Is this better?

Posted on Reply
#9
seronx
qubit said:

Now I think about it, what is the word size of an FPU on previous 64-bit processors? Should be the same of AMD and Intel, I'd expect. (Yes, I know I could google it, but I'd rather you guys just explain it to me. :p )
Mostly 128bits

ARM is just getting 128bit SIMD with A15-Cortexs
Posted on Reply
#11
cadaveca
My name is Dave
That's how AMD explains it.


The whole thing about a single scheduler revolves around the FP scheduler being shared for the seperate 128-bit "pipes"(as is plain the image I posted above), but it seem to me, evne workloads that don't have any floating point, and are integer based, benefit from moving dual threads to individual cores.

And to me, the figure of 10% performacne increases, seems to fit with the L2 cache being slow, rather than with the FP scheduler not being wide enough.
Posted on Reply
#12
qubit
Overclocked quantum bit
It sounds like the damned thing just wants some hand-tuned optimisation, doesn't it? Perhaps that would make it fly? We really need Intel to be lifted bodily out of its comfort zone.
Posted on Reply
#13
cadaveca
My name is Dave
As far as I can tell, it's more than having workloads balanced.


Now, there's a differnce in Windows 8 and Windows 7, in how workloads are managed in a CPU, due to Windows 8 allowing what is called "core parking". This is basically fully shutting off a core when it's not in use, for power savings. Naturally, such control needs to be finely tuned so that threads do not stall, and bringing similar functionality to Windows 7 is what this patch s supposed to be all about. The ability to dynamically move threads from one core to the next without stalling the thread is not really a big thing, and if it really was an issue with the FP scheduler, there'd be much more than just a 10% boost possible...sometimes it would be a doubling of speed.

That said, no, I do not think there is any "saving grace" for BD in this. I really feel the L2 cache is to slow, and the numbers seem to agree. When someone can tell us why the L2 cache seems to be slow, it might be more clear why BD "sucks".

Price the 8150 @ $200, and it's killer. There's really nothing wrong with BD's design. The only thing that makes it look wrong is the pricing, and that's because everyone considers BD to compete with SB(rightly so).
Posted on Reply
#14
seronx
cadaveca said:
When someone can tell us why the L2 cache seems to be slow, it might be more clear why BD "sucks".
The L2 has to handle writes from two L1ds and it is handled by the WCC unit(The WCC can combine 4 x 8192Kb or send 4KB to the L2, write through)

The problem with multithreading once again won't be there

Memory Sub System isn't really the problem other than the L3 but the L3 problem only starts with more than one module being used
Posted on Reply
#15
NC37
OneCool said:
We all know AMDs in Microsofts pocket.They will do what ever they can software/driver wise to help them.
Because Windows is clearly optimized for AMD :rolleyes:

M$ may like AMD some because they are tired of Intel's monopoly, but ultimately they make more money thanks to Intel than not. Money talks. But the patch for BD was a given. There were features of it that were not being implemented right now. Or at least, not as well as they could. M$ would do the same for Intel if they came out with a new tech.

They aren't directly in M$'s pocket. More likely they are in Intel's pocket because without them, Intel faces antitrust.
Posted on Reply
#16
cadaveca
My name is Dave
seronx said:
The L2 has to handle writes from two L1ds and it is handled by the WCC unit(The WCC can combine 4 x 8192Kb or send 4KB to the L2, write through)

The problem with multithreading once again won't be there

Memory Sub System isn't really the problem other than the L3 but the L3 problem only starts with more than one module being used
L3 is shared between ALL cores. The problem isn't multithreaded workloads. THe problem with BD is that single-threaded perforamcne is lower than even Thuban. Teh most obvious difference, to me, between the two, is cache design and speed.

Nobody cares about BD's multi-threaded performance. I'm not sure we're on the same topic here.
Posted on Reply
#17
seronx
cadaveca said:
L3 is shared between ALL cores. The problem isn't multithreaded workloads. THe problem with BD is that single-threaded perforamcne is lower than even Thuban. Teh most obvious difference, to me, between the two, is cache design and speed.

Nobody cares about BD's multi-threaded performance. I'm not sure we're on the same topic here.
Cache Design isn't at fault and Speed isn't at fault

Single Threaded performance -> Dispatch
Scroll up, I said dispatch already

Dispatch is shared between the two cores and the shared FP....It is divided into 4 dispatches per clock(2 macro-ops per unit(Core A, Core B, FPU x 2)...unless you disable a cluster then it will be 2 dispatches per clock(Core Ax2, FPUx2)...(4 Macro-ops to Core A and 4 Macro-ops to FPU and the FPU will only need to use Core A stuff making a ~17 stage pipeline effectively a ~14 stage pipeline(each core only needs 2 macro-ops and to complete core commands the FPU only really needs 4 macro-ops(decoder can do 8 macro-ops))
Posted on Reply
#18
pantherx12
cadaveca said:
OK. Please explain to me how, then, moving two threads from running on two cores within a module, to one thread per module, is faster, on workloads that don't require that the shared resources are used exclusively per thread?
On my system running a 4 threaded program on 4 cores ( 2 modules) is the same speed as running a 4 threaded program on 4 cores ( 4 modules)


On Cinebench anyway, not sure about anything else as I've not tested it.

But Cinebench should be a program that would highlight this right?
Posted on Reply
#19
xenocide
I chose to believe shared resources AND slow L2 Cache are problems.
Posted on Reply
#20
nt300
There 3 problems with Bulldozer in order. If AMD can fix this in time for Piledriver, then they would have the ability to compete with Intel much better.

1 -It lacks hand-tuned optimisation (somebody already mention this)
2 -Dispatch Unit needs major tweaking
3 -L1 and L2 cache needs a speed boost.
pantherx12 said:
On my system running a 4 threaded program on 4 cores ( 2 modules) is the same speed as running a 4 threaded program on 4 cores ( 4 modules)

On Cinebench anyway, not sure about anything else as I've not tested it.

But Cinebench should be a program that would highlight this right?
Something wrong there :confused: Ive seen tests done that shows a 4C4M beats out a 4C2M setup in almost all tests done. And the higher you scale the CPU clock the better the 4C4M becomes versus the 4C2M. This sharing within the bulldozer design needs some real fine tuning IMO.
Posted on Reply
#21
EastCoasthandle
This is exciting stuff. Can't wait to see what can of performance people can expect once the patch is released.
Posted on Reply
#22
pantherx12
nt300 said:
There 3 problems with Bulldozer in order. If AMD can fix this in time for Piledriver, then they would have the ability to compete with Intel much better.

1 -It lacks hand-tuned optimisation (somebody already mention this)
2 -Dispatch Unit needs major tweaking
3 -L1 and L2 cache needs a speed boost.

Something wrong there :confused: Ive seen tests done that shows a 4C4M beats out a 4C2M setup in almost all tests done. And the higher you scale the CPU clock the better the 4C4M becomes versus the 4C2M. This sharing within the bulldozer design needs some real fine tuning IMO.
What bios and boards were used?
Posted on Reply
#23
Paulieg
The Mad Moderator
cadaveca said:
As far as I can tell, it's more than having workloads balanced.


Now, there's a differnce in Windows 8 and Windows 7, in how workloads are managed in a CPU, due to Windows 8 allowing what is called "core parking". This is basically fully shutting off a core when it's not in use, for power savings. Naturally, such control needs to be finely tuned so that threads do not stall, and bringing similar functionality to Windows 7 is what this patch s supposed to be all about. The ability to dynamically move threads from one core to the next without stalling the thread is not really a big thing, and if it really was an issue with the FP scheduler, there'd be much more than just a 10% boost possible...sometimes it would be a doubling of speed.

That said, no, I do not think there is any "saving grace" for BD in this. I really feel the L2 cache is to slow, and the numbers seem to agree. When someone can tell us why the L2 cache seems to be slow, it might be more clear why BD "sucks".

Price the 8150 @ $200, and it's killer. There's really nothing wrong with BD's design. The only thing that makes it look wrong is the pricing, and that's because everyone considers BD to compete with SB(rightly so).
Agreed, At the end of the day, through all of the technical jargon, this is all that matters to the majority of users. The pricing just doesn't parallel it's performance. If AMD adjusts this, and makes it clear that the chip is not designed to really compete with SB, then the chip goes from POS to a good budget, mid level enthusiast chip. I paid $195 for mine, and for that (even though I haven't received it yet), it feels like a bargain based on benchmarks.
Posted on Reply
#24
theoneandonlymrk
imho the scheduling patch may well fix it, as for the issues with its L2 cache its quite simple if one program runs 2 threads on different modules and one thread requires data from the other to proceed theres a halt while the data is pulled from one modules L2 to the other slowing down speed so in this instance they would be better scheduled on the same module,,

however if one program runs two threads that dont share data or two progs run a thread each then in this case the scheduler needs to run one thread per module to optimise perfomance and none of this is presently being done by windows correctly hence lower performance higher heat and watts so a patch should reap rewards if it works right
Posted on Reply
Add your own comment