• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Dragged to Court over Core Count on "Bulldozer"

don't stop!

this thread has given me hours of entertainment :D

and some insight tbh :)
 
But again, what matters in the end is performance. AMD opted for such core design. Call them half cores or not true cores all you want, they are cores presented to the system and there are 8 of them. If they don't perform as expected, why the fuck are there 5 trillion review sites for then? Clueless people will get screwed (or shall we say they screw themselves) for not asking the right people or checking reviews. Technically speaking, if CPU had just 1 core and companies advertised it as such, no one would buy it, even if that single core literally raped all the multi-core CPU's in the market. Without looking at reviews, you can't possibly tell how well it performs. So, how different is going to the other extreme, 8 cores that supposedly aren't "real" cores?

Intel's HT really can't be called a core, because it can't be called so on any level even though I've seen really weird namings of i7 CPU's with HT on very popular German webpage Computer Universe. AMD can't just call it quad core with 6,5 threads. It would confuse the fuck out of users. So they opted for calling cores the way they are presented to the system.

Also, look at the task manager...

FX8320.jpg


It's not exactly a tightly kept secret that required rocket scientists to figure it out. 1 processor, 4 cores, 8 logical units. Difference is, those are actually cores, even though different design than one used by Intel. HT on the other hand doesn't have any kind of core appearance. It's just a side logic that tricks OS into thinking it's another core and gives CPU ability to stack more computation on the same physical core. It's confusing to casual users, but I wouldn't call it cheating on the AMD's end...
 
Intel's HT really can't be called a core, because it can't be called so on any level even though I've seen really weird namings of i7 CPU's with HT on very popular German webpage Computer Universe. AMD can't just call it quad core with 6,5 threads. It would confuse the fuck out of users. So they opted for calling cores the way they are presented to the system.

Has nothing to do with the topic, but stores sold the first generation i3/i5/i7 CPU's as CPU's with three, five and seven cores.
 
It has to do with the topic. Because what people consider as 4 core 8 thread Intel CPU cannot be applied to AMD CPU's. If it says 8 cores, it actually has that many cores. If they are really as effective as Intel's cores number vs number, that's debatable. And that's why reviews exist. In the end, it doesn't matter if number of cores is the same or how effective they are per core or in multi-core arrangement. You have to see benchmarks in either case.
 
phenom2-bulldozer-block-diagram-compare034.png

Microsoft would call them cores if they fit the definition of a core.
 
Last edited:
L2 is part of the core, huh Ford? I'm pretty sure that Core 2 duos, having a shared L2, still were individual cores. Might want to work on that diagram a bit instead of posting it incessantly. Just like control logic is part of the core too, huh? Lets stick with facts and less home-made bullshit.
 
The reason why they prefer to use split dedicated caches is to avoid cache trashing. L3 is so far ahead it's almost like a RAM so it's not important anymore.
 
L2 is part of the core, huh Ford? I'm pretty sure that Core 2 duos, having a shared L2, still were individual cores. Might want to work on that diagram a bit instead of posting it incessantly. Just like control logic is part of the core too, huh? Lets stick with facts and less home-made bullshit.
A core doesn't share any resources with another core. If an L2 cache is shared between two or more cores, none of the cores can claim it as theirs.

In the case of Bulldozer, the L2 cache is shared between the FPU and the two integer clusters. It is not shared with another core so, as the diagram shows, it is correct. One bulldozer core (containing two integer clusters) includes the L2 cache.


In the case of Core 2 Duo, the L2 cache is shared between two cores so the L2 cache is not part of either core. The two discreet cores (purple background) packaged together with the L2 cache is a module (green square):
conroe_block.jpg


Core 2 Quad was created by combining two dual-core modules producing a multi-chip module (MCM) quad-core CPU:
intel_diagram.jpg



The reason why they prefer to use split dedicated caches is to avoid cache trashing. L3 is so far ahead it's almost like a RAM so it's not important anymore.
L3 was added because of the massive performance drop between L2 and RAM. Some processors are getting an L4 cache because of the massive performance drop between L3 and RAM.
 
Last edited:
The reason why they prefer to use split dedicated caches is to avoid cache trashing. L3 is so far ahead it's almost like a RAM so it's not important anymore.
Yessir. The hit rates on CPU cache nowadays are nutty high, north of 85-90% in a lot of cases, which probably explains why faster memory doesn't do a whole lot of good.
Some processors are getting an L4 cache because of the massive performance drop between L3 and RAM.
You mean the eDRAM cache? That's strictly for the iGPU if I recall correctly because the only chips that sport it are ones with Iris Pro.
 
Last edited:
I think that depends on what version of windows you are using. Under win 7 fx8's show up as 8 cpus and win 10 they show up as 4 cpus with 8 threads. I think this was done to help with the performance of Amd processors, but not totally sure on that.
 
I think that depends on what version of windows you are using. Under win 7 fx8's show up as 8 cpus and win 10 they show up as 4 cpus with 8 threads. I think this was done to help with the performance of Amd processors, but not totally sure on that.
There is a minor performance hit when using the second core in the module. Probably as @FordGT90Concept described as how the decoder was getting overwhelmed which is why they added a second one in Steamroller.
 
You mean the eDRAM cache? That's strictly for the iGPU if I recall correctly because the only chips that sport it are ones with Iris Pro.
The eDRAM can be used by Iris Pro and the CPU:
http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3
AnandTech said:
Unlike previous eDRAM implementations in game consoles, Crystalwell is true 4th level cache in the memory hierarchy. It acts as a victim buffer to the L3 cache, meaning anything evicted from L3 cache immediately goes into the L4 cache. Both CPU and GPU requests are cached. The cache can dynamically allocate its partitioning between CPU and GPU use. If you don’t use the GPU at all (e.g. discrete GPU installed), Crystalwell will still work on caching CPU requests. That’s right, Haswell CPUs equipped with Crystalwell effectively have a 128MB L4 cache.
It does not act as a frame buffer for the Iris Pro. Intel hinted at a separate, 16-32 MiB ESRAM could be used exclusively for Iris Pro's frame buffer in the future. Skylake-H will likely be getting the same Crystalwell L4 cache as Broadwell. We could see the same Crytalwell cache spring up on even more chips in the future (Kaby Lake, maybe even Cannonlake).


There is a minor performance hit when using the second core in the module. Probably as @FordGT90Concept described as how the decoder was getting overwhelmed which is why they added a second one in Steamroller.
Even in Excavator, the prefetch and FPUs are still shared. There's going to be a performance hit from them too. A legitimate dual-core doesn't share those things as demonstrated by the Core 2 Duo and Phenom II block diagrams.


I did some more digging on Core 2 Duo and it appears that neither core can be disabled. Conroe-L (single-core) appears to be a different chip altogether. This makes Core 2 Duo a true module because it has two of everything except L2 and control which makes them inseparable. Bulldozer is not a module because it doesn't have two of everything--it has one of some things. This is why FX-8350 should be considered a quad-core. What was previously understood as a module (complete but inseparable cores) is absent (needs two prefetchers at minimum).
 
Last edited:
I was hoping Skylake would get L4 by default (current i7 6700k for example), but after I've seen it's basically just a smaller i7 5000 series, I just didn't bother and opted for more cores instead on 5820K.
 
Scaling would show that an FX 8 core has more than 4 cores. Math would say it is physically impossible to say differently.
 
That was Windows XP and XP only has two states: uniprocessor (one thread at a time) and multiprocessor (two or more threads at a time). Multiprocessor could mean two physical sockets with one core each, one socket with two cores, or one physical + one logic processor. It was updated to better handle the three variations.

Bulldozer did the same thing with Vista. Vista (I believe 7 too) called it eight-cores because it was incapable of distinguishing them but that apparently caused problems because updates were released to fix core parking issues. Come Windows 8 and newer, Microsoft updated the operating system to definitively account for sockets, cores, and logic processors which is where we see 4 cores and 8 logic processors.


CPU-Z doesn't need to schedules threads. Windows does. Microsoft did what they did deliberately so the scheduler best utilizes the processor resources.


Caches have always been tiered. The closer the tier is to the ALUs and FPUs, the faster it is. Caches completely lack logic and there's numerous advantages, and virtually no disadvantages, to sharing caches (scheduler will allot the cache evenly when the load is even).

There's only a handful of FPUs shared in the computing world outside of Bulldozer (and derivatives) and all of them are set up in a way that resembles a co-processor. That is, it has it's own scheduler and all of the cores can queue work to it--effectively its own core. They don't market it as having an extra core though because that would be misleading.
Still on 7 myself.
Yes I got all updates plus core unparker tool. The FX8350 does more than I could imagine.
 
In the case of Core 2 Duo, the L2 cache is shared between two cores so the L2 cache is not part of either core. The two discreet cores (purple background) packaged together with the L2 cache is a module (green square):

Let me show you the Intel Silvermont, C2000, eight cores architecture.
All new Atoms have modules, with 2 cores in it, sharing the same L2 cache. Are they liars too?
4_17.jpg



Your graphic you made is also completely wrong. It shows how you don't understand OoO, PRF, branch prediction, resource monitoring. http://www.anandtech.com/show/3922/intels-sandy-bridge-architecture-exposed/3

In short, you don't understand how their microarchitecture work. 95% of the time, the module will work just the same as 2 cores, because both can share the resource in SAME TIME. In most circumstances, it will use both Integer core and each one will have a 128-bit FMAC with 128-bit Integer execution. So they can simultaneously execute most of the instructions independently without having to wait for it's turn like for hyperthreading. Totally different microarchitecture. When things begin degrading itself is when both floating point pipelines have to get together for a single integer core, to execute a single 256-bit AVX instruction, or two symmetrical SSE instructions. Then the entire FPU is taken and leave no resources to the other integer core. In theory the dispatch controller should give the integer core some instructions not needing any FPU interaction, by going to see in the instructions fetch buffer, and being able to keep it busy while the other complete it's cycles needing all the FPU. On paper it looks awesome, but it's a very very complex operation, sadly not bringing much success. Luckily, those instructions are not very often used. Still, it's a major problem AMD tried to improve in Piledriver, Steamroller and finally Excavator. It was their way to deal with new instructions too, and stay in competition.

It's a good technology, but a little too audacious for today's market. Instead of focusing on having better IPC, they mostly developed way to better dispatch the instructions. That's why they decided to come back to more traditional microarchitectures and be more competitive IPC-wise. It doesn't change the fact a module behave like 2 cores and are in fact 2 cores in a single module. Even Intel agree to that and are using modules for their Atoms. Maybe we should drag them in court too, no?
 
Let me show you the Intel Silvermont, C2000, eight cores architecture.
All new Atoms have modules, with 2 cores in it, sharing the same L2 cache. Are they liars too?
That's an octo-core so Intel is not lying. The compute cores aren't broken up at all--nothing is shared except L2 cache.

A "core" only requires data + instruction cache. Additional caches are added for boosting performance (decreasing the gaps in latency between core and system RAM).
latency.png

up to 32k = L1
up to 256k = L2
up to 4M = L3
up to 64M = L4 eDRAM in 4950HQ, system RAM otherwise.

As I specified above, if a quad-core processor has 4 L2 caches, then those L2 caches are part of the core because it is not a shared resource. If the resource is shared (as is the case with Silvermont) then the resource doesn't belong to a core--it's part of the CPU package (like L3, QPI, HyperTransport, memory controller, etc. usually are).


Then the entire FPU is taken and leave no resources to the other integer core.
This blocking situation is never encountered on Silvermont nor Core 2 Duo. If a blocking situation is possible, I'd argue (and have argued) the whole of it is a multithreaded core, not multi-core.

A core can take an instruction and execute the whole of it without sharing any parts with any other processor. Bulldozer and sons, when executing a floating point unit task, do not fit that definition. Silvermont will happily execute eight 256-bit AVX instructions simultaneously across all cores, unlike an FX-8350. It'll do that with ANY instruction because none of the execution hardware is shared.
 
Last edited:
A core can take an instruction and execute the whole of it without sharing any parts with any other processor. Bulldozer and sons, when executing a floating point unit task, do not fit that definition. Silvermont will happily execute eight 256-bit AVX instructions simultaneously across all cores, unlike an FX-8350. It'll do that with ANY instruction because none of the execution hardware is shared.
...but the FPU isn't what did Bulldozer in, it was the reduction in the number of uOps per clock that could be accomplished by either the FPU or the integer cores. Fewer uOps per cycle means that if the bandwidth resources aren't available, full instructions could take more clock cycles to complete which could further harm performance by essentially stalling the pipeline due to these limited resources on each integer core. The net result is relatively garbage performance.

If you look at Intel, all they've been doing is beefing up their cores when it comes to uOp bandwidth.
 
If you look at Intel, all they've been doing is beefing up their cores when it comes to uOp bandwidth.
Additionally, let's not forget how late AMD is introducing uOp cache with Zen now, almost 6 years after Intel's Sandy Bridge ... I don't know how much, but absence of uOp cache in bulldozer should also contribute for lesser total net uOps/cycle
 
Last edited:
...but the FPU isn't what did Bulldozer in, it was the reduction in the number of uOps per clock that could be accomplished by either the FPU or the integer cores. Fewer uOps per cycle means that if the bandwidth resources aren't available, full instructions could take more clock cycles to complete which could further harm performance by essentially stalling the pipeline due to these limited resources on each integer core. The net result is relatively garbage performance.

If you look at Intel, all they've been doing is beefing up their cores when it comes to uOp bandwidth.
That's irrelevant. What is relevant is that if the FX-8350 had 8 FPUs (one to go with each integer core like a traditional core), it's multithreaded FPU performance would be better because there would no longer be any chance for blocking. The lawsuit is about AMD calling it an "8 core" processor when it is an "8 integer core" processor. AMD does not make that distinction on the box or in marketing material. It has mislead the public selling 4 multithreaded cores as 8. It would be akin to Intel calling the i7-6700 an "8 core" processor. It doesn't matter that AMD shored up the symmetrical multithreading in Bulldozer and sons with extra hardware for a performance boost. It's still a quad-core when you throw heavy FPU loads at it and they sold it as an eight-core.
 
That's irrelevant. What is relevant is that if the FX-8350 had 8 FPUs (one to go with each integer core like a traditional core), it's multithreaded FPU performance would be better because there would no longer be any chance for blocking. The lawsuit is about AMD calling it an "8 core" processor when it is an "8 integer core" processor. AMD does not make that distinction on the box or in marketing material. It has mislead the public selling 4 multithreaded cores as 8. It would be akin to Intel calling the i7-6700 an "8 core" processor. It doesn't matter that AMD shored up the symmetrical multithreading in Bulldozer and sons with extra hardware for a performance boost. It's still a quad-core when you throw heavy FPU loads at it and they sold it as an eight-core.

Is there a universal definition of an x86 core though? They could have handled it better, but I wouldn't say they were lying.
 
AMD pretty much established it with Athlon 64 X2 and Intel followed suit with Pentium D: two processors, one die. The only anomaly is Bulldozer and sons.

The only other modern exception which I believe @Aquinus pointed out earlier was SPARC processors for databases. In that case, the FPU is a practically a separate core (8:1 ratio) unto itself because databases usually don't have to deal with floating-point operations. If the cores encountered floating-point work, they'd farm it out to the floating-point core and wait for a response.
 
Last edited:
Is there a universal definition of an x86 core though? They could have handled it better, but I wouldn't say they were lying.

Simply, if it can execute all the instructions in the x86, or in this case x86_64 instruction set, then it is an x86_64 core. You don't need an FPU to execute any of the instruction in the basic x86_64 instruction set, it just helps performance greatly for some of them.

Intel followed suit with Pentium D: two processors, one die.

Yeah, the Pentium D wasn't two processor on one die...oh and the Core 2 Quad wasn't 4 processors on 1 die either.
 
Last edited:
I think the when push comes to push, the core count isn't really what people are pissed off about. This is all about the lackluster performance of these CPUs and I think that this is just a facade for that. No one ever said 8 cores had to be fast. :laugh:
 
Back
Top