Friday, November 6th 2015

AMD Dragged to Court over Core Count on "Bulldozer"

This had to happen eventually. AMD has been dragged to court over misrepresentation of its CPU core count in its "Bulldozer" architecture. Tony Dickey, representing himself in the U.S. District Court for the Northern District of California, accused AMD of falsely advertising the core count in its latest CPUs, and contended that because of they way they're physically structured, AMD's 8-core "Bulldozer" chips really only have four cores.

The lawsuit alleges that Bulldozer processors were designed by stripping away components from two cores and combining what was left to make a single "module." In doing so, however, the cores no longer work independently. Due to this, AMD Bulldozer cannot perform eight instructions simultaneously and independently as claimed, or the way a true 8-core CPU would. Dickey is suing for damages, including statutory and punitive damages, litigation expenses, pre- and post-judgment interest, as well as other injunctive and declaratory relief as is deemed reasonable.
Source: LegalNewsOnline
Add your own comment

511 Comments on AMD Dragged to Court over Core Count on "Bulldozer"

#426
FordGT90Concept
"I go fast!1!11!1!"
cdawallNot to nit pick, but isn't this the exact opposite of what you said earlier? I though AMD was the only CPU to ever attempt this...
Didn't know about POWER7/8 previously.
cdawallDoubtful. AMD can create words to describe things just as well as the next guy. If AMD can't call what they consider a module a module, I guess Intel will have to ditch HyperThreading in favor for SMT. That is literally what you are saying needs to happen.
You can't call donkeys elephants and sell them as elephants without getting sued. AMD did as much, and got sued.
cdawallDifference is those only have ONE integer and ONE FPU, not TWO and ONE.
Sure doesn't look like it in the diagram. It actually looks like there are two (one is just math, the other is math + load/store) and each one is two-wide. The only difference is that IBM didn't draw a box around it and say "herp, derp, dis is a 'core'"
cdawallI was very specific with the workloads that would show near 100% scaling, I would wager you cannot prove me wrong.
I don't have FX-8350 to test. I've written a lot of programs that get near 100% scaling. Random Password Generator would actually be a pretty good test for this.

10 million attempts
uncheck special characters
check require special charaters (creates an unsolvable situation)
minimum characters 32 Edit: Added this one because it can massively impact time if it randomly does a lot of short ones
5.9142104
Disabled even number cores in Task Manager (it still spawns 8 threads)
13.2191610

123% faster

I'm gonna add a thread limiter to make it easier to test...
Posted on Reply
#427
64K
FordGT90ConceptYou can't call donkeys elephants and sell them as elephants without getting sued. AMD did as much, and got sued.
That's true and this madness needs to end right now. I'm fed up with Door To Door Donkey Salesmen trying to swindle me out of my hard earned money.

When this trial ends I wager that there will be a legal definition of a core if nothing else. It will be interesting to watch AMD backpedal at that time.
Posted on Reply
#428
cdawall
where the hell are my stars
FordGT90ConceptDidn't know about POWER7/8 previously.
Obviously...
FordGT90ConceptYou can't call donkeys elephants and sell them as elephants without getting sued. AMD did as much, and got sued.
This is they didn't call a donkey an elephant they stepped away from what you believe is the status quo and produced something that was scalable in a way no manufacturer had done before.
FordGT90ConceptSure doesn't look like it in the diagram. It actually looks like there are two (one is just math, the other is math + load/store) and each one is two-wide. The only difference is that IBM didn't draw a box around it and say "herp, derp, dis is a 'core'"
You do know why IBM doesn't have to draw boxes around things and explain what cores are correct? Thing about enterprise level equipment is function matters not the nonsense this lawsuit is about. I say it again this is literally an argument of a definition that doesn't exist.
FordGT90ConceptI don't have FX-8350 to test. I've written a lot of programs that get near 100% scaling. Random Password Generator would actually be a pretty good test for this.
I will give it a shot. I am slightly curious how much of a difference task schedulers make in the situation as well.
64KThat's true and this madness needs to end right now. I'm fed up with Door To Door Donkey Salesmen trying to swindle me out of my hard earned money.

When this trial ends I wager that there will be a legal definition of a core if nothing else. It will be interesting to watch AMD backpedal at that time.
Thing is all AMD has to do is stand strong instead of backpedaling. If they strong arm the lawsuit they will win, if they back pedal it will be assumed due to known guilt.
Posted on Reply
#429
FordGT90Concept
"I go fast!1!11!1!"
cdawallYou do know why IBM doesn't have to draw boxes around things and explain what cores are correct? Thing about enterprise level equipment is function matters not the nonsense this lawsuit is about. I say it again this is literally an argument of a definition that doesn't exist.
No, it's because IBM knew they couldn't get away with selling the chip as a 16 or 32 "core" processor when it clearly only has 8 cores. You know, like FX-8350 clearly only have 4 cores.
cdawallThing is all AMD has to do is stand strong instead of backpedaling. If they strong arm the lawsuit they will win, if they back pedal it will be assumed due to known guilt.
You don't think Seagate tried to do the same when sued over HDD capacity? There is no path for AMD to win here.

This is debugging data...I'll upload updated program shortly...
1: 24.6987115
2: 13.1477996
3: 9.1914374
4: 7.4688438
5: 6.8086950
6: 6.2363480
7: 5.8927118
8: 5.7746498
...the application is working correctly. Big jumps between 1-4 where there's an actual core to do the work. Small jumps between 5-8 where HTT is kicking in. Beyond that, performance is expected to fall because the threads are fighting each other for time.

...once W1z lets me edit it that is...


1.1.4, 6700K, final...
8: 5.6961283
7: 5.7390397
6: 6.2014922
5: 6.7108575
4: 7.1342991
3: 8.8729954
2: 12.6389990
1: 24.1833987
Posted on Reply
#430
cdawall
where the hell are my stars
How do you edit minimum number of characters?
Posted on Reply
#432
Aquinus
Resident Wat-man
FordGT90ConceptHe also said blocking was possible. Cores never block other cores ergo not a dual core.
DMA is blocking and memory writes are usually write-back to cache, so does a shared L2 negate the possibility of being a core?
FordGT90ConceptExcept that those "cores" don't understand x86 instructions. They understand opcodes given to them by the instruction decoder and fetcher. On the other hand, a real core (even the POWER7 and POWER8 behemoths) has the hardware to interpret instruction to a result without leaving the core. So either AMD's definition is wrong or Intel, IBM, ARM Holdings, and Sun are wrong. Considering IBM produces chips that are nearly identical to Bulldozer with four integer clusters and they don't call that a quad-core, I'd say AMD is definitively wrong.
POWER7 is only a behemoth in the sense that it has a strangely large number of discrete FPUs but, the smallest constant unit is the fixed point unit or combo of ALUs and AGUs. IBM produces CPUs that actually have a pretty large amount of floating point hardware given the fact that it's a general purpose CPU.
FordGT90ConceptAll modern operating systems call FX-8350 a quad-core with 8 logical processors, not just Windows. When *nix has to work on POWER7 and Bulldozer, are they really going to use AMD's marketing terms to describe what is actually there? I'd hope not.
You say that like it's because of the definition of a core and not for the sake of how processes are scheduled in the kernel.
FordGT90ConceptAsyncronous multithreading is always capable of loading systems to 100% so long as it can spawn enough threads and those threads are sufficiently heavy. Overhead is only encountered at the start in the main thread and at the end of the worker thread (well under 1% of compute time).
That depends on how the application is architected. Most applications don't have 100% independent threads and even if they do, they usually require getting joined by a control thread that completes the calculation or whatever is going on. That one thread is going to wait for all the other dispatched ones to complete. Purely functional workloads are going to benefit the most from multiple cores because they have properties that allow for good memory locality (data will primarily reside in cache.) I've been writing multithreaded applications for several years now and I can tell you that in most cases, these workloads aren't purely async. More often than not, there are contested resources that limit overall throughput. Applications that can be made to be purely functional are prime examples of things that should be run on the GPU because of the lack of data dependencies on calculated values.

As a developer, if I have a thread that is not limited in most situation and will almost always give me a speed up of over 50% versus another thread, I consider it a core. It's tangible bandwidth that can be had and to me, that's all that matters. HTT only helps in select cases but more often than not, I can't get speed up beyond one or two threads over 4 on a quad-core Intel setup using hyperthreading where I can when bulldozer integer cores.

I'll agree with you that the integer core isn't what we've traditionally recognized as a core but, it has far too many dedicated resources to call it SMT. So while it might not be a traditional core, it's a lot more like a traditional core than like SMT.
Posted on Reply
#433
FordGT90Concept
"I go fast!1!11!1!"
AquinusDMA is blocking and memory writes are usually write-back to cache, so does a shared L2 negate the possibility of being a core?
No because that can happen with any pool of memory with multiple threads accessing it.
AquinusPOWER7 is only a behemoth in the sense that it has a strangely large number of discrete FPUs but, the smallest constant unit is the fixed point unit or combo of ALUs and AGUs. IBM produces CPUs that actually have a pretty large amount of floating point hardware given the fact that it's a general purpose CPU.
It has a lot of hardware on both accounts. Unlike UltraSPARC T1, it was designed to do well at everything...so long as it could be broken into a lot of threads.
AquinusYou say that like it's because of the definition of a core and not for the sake of how processes are scheduled in the kernel.
They go hand-in-hand. Because a "module" really represents a "core" operating systems need to issue heavy threads to each core before scheduling a second heavy thread to the same cores. Windows XP (or was it 7?) got a patch to fix that on Bulldozer because the order of the cores reported to the operating system differed from HTT's. The OS needs to treat the two technologies similarly to maximize performance.
AquinusMost applications don't have 100% independent threads and even if they do, they usually require getting joined by a control thread that completes the calculation or whatever is going on.
Virtually every application I multithread does so asynchronously. The only interrupt is updating on progress (worker thread invokes main thread with data which the main thread grabs and carries out). That is probably why there is up to a 20% hit. I suppose I could reduce the number of notifications but...meh. 10 million 32 character passwords generated in <6 seconds is good enough. :roll:

SMT is a concept, not a hardware design. SMT is what Bulldozer does (two threads in one core).
Posted on Reply
#434
Aquinus
Resident Wat-man
FordGT90ConceptNo because that can happen with any pool of memory with multiple threads accessing it.
That doesn't mean that latency is going to be consistent between cores without shared cache. Common L2 makes it advantageous to call two integer cores as a pair of logical cores because they share a local cache. Context switching between those two cores will result in better cache hit rates because data is likely to already reside in L2 if it was used on the other integer core. That improves performance because accessing memory is always slower than hitting cache. It improves latency because you're preserving memory locality, not because you don't understand that an integer core is that thing that does most of the heavy lifting in a general purpose CPU.
FordGT90ConceptIt has a lot of hardware on both accounts. Unlike UltraSPARC T1, it was designed to do well at everything...so long as it could be broken into a lot of threads.
Except it doesn't do everything well. It has a huge emphasis on floating point performance which the POWER7 might be able to keep up with Haswell with but, when it comes to integer performance, it gets smacked down just like AMD does with floating point performance. Intel is successful because it beefs the hell out of its cores. It doesn't mean that what AMD provided is not a core, it just means that it's ability to compute per clock cycle is less than Intel's due to the difference inside the cores themselves, not because AMD hasn't produced something that can operate independently. If I really need to dig it out, I can pull up the table of dispatch rates per clock cycle for the most common x86 instructions on several x86-based CPUs. Haswell straight down dominates everything because it can do the most per clock cycle which makes sense because the FPU is gigantic and Intel has just been adding ALUs and AGUs.
FordGT90ConceptThey go hand-in-hand. Because a "module" really represents a "core" operating systems need to issue heavy threads to each core before scheduling a second heavy thread to the same cores. Windows XP (or was it 7?) got a patch to fix that on Bulldozer because the order of the cores reported to the operating system differed from HTT's. The OS needs to treat the two technologies similarly to maximize performance.
As I stated earlier, it's due to memory locality. The Core 2 Quad, being two C2D dies on one chip, could have had improved performance by using this tactic as well because context switching on cores with a shared cache is faster than where there isn't. This isn't because they're not real cores, it's for scheduling purposes but, I'm sure you work with kernels and think about process scheduling all the time and would know this so, I'm just preaching to the choir.
FordGT90ConceptVirtually every application I multithread does so asynchronously. The only interrupt is updating on progress (worker thread invokes main thread with data). That is probably why there is up to a 20% hit. I suppose I could reduce the number of notifications but...meh. 10 million 32 character passwords generated in <6 seconds is good enough. :roll:
Sounds like a real world situation to me and by no means a theoretical one. :slap:
FordGT90ConceptSMT is a concept, not a hardware design. SMT is what Bulldozer does (two threads in one core).
SMT is most definitely a hardware design and to say otherwise is insane. Are you telling me that Intel didn't make any changes to their CPUs to support hyper-threading? That is a boatload of garbage. Bulldozer is two cores with shared hardware, most definitely not SMT. SMT is making your core wider to allow for instruction level parallelism which SMT can take advantage of during operations that don't utilize the entire core or eat up the entire part of a single stage of the pipeline. Bulldozer has dedicated hardware and registers. SMT implementations most definitely don't have a dedicated set of registers, ALUs, and AGUs. They utilize the extra hardware already in the core to squeeze in more throughput which is why hyperthreading gets you anywhere between 0 and 40% performance of a full core.
Posted on Reply
#435
FordGT90Concept
"I go fast!1!11!1!"
AquinusThat doesn't mean that latency is going to be consistent between cores without shared cache. Common L2 makes it advantageous to call two integer cores as a pair of logical cores because they share a local cache. Context switching between those two cores will result in better cache hit rates because data is likely to already reside in L2 if it was used on the other integer core. That improves performance because accessing memory is always slower than hitting cache. It improves latency because you're preserving memory locality, not because you don't understand that an integer core is that thing that does most of the heavy lifting in a general purpose CPU.
All caches, exist to improve performance. The more caches, with progressively longer response times, the better overall performance will be. L2 can be made part of the a core's design but it doesn't have to be. A core generally only needs L1 caches. An example of cores that have dedicated L1 and L2 is much of the Core I# family. Here's Sandy Bridge:

Again, distinguishing feature of a core is nothing is shared. L2 can't be considered part of Core 2 Duo's core because it is shared with the neighboring core.
AquinusExcept it doesn't do everything well. It has a huge emphasis on floating point performance which the POWER7 might be able to keep up with Haswell with but, when it comes to integer performance, it gets smacked down just like AMD does with floating point performance. Intel is successful because it beefs the hell out of its cores. It doesn't mean that what AMD provided is not a core, it just means that it's ability to compute per clock cycle is less than Intel's due to the difference inside the cores themselves, not because AMD hasn't produced something that can operate independently. If I really need to dig it out, I can pull up the table of dispatch rates per clock cycle for the most common x86 instructions on several x86-based CPUs. Haswell straight down dominates everything because it can do the most per clock cycle which makes sense because the FPU is gigantic and Intel has just been adding ALUs and AGUs.
Which AMD did too, but stupidly required a separate thread to access them.
AquinusSounds like a real world situation to me and by no means a theoretical one. :slap:
Well, you made me test it (everything else the same):
8: 5.6961283 sec -> 5.2848560 sec
1: 24.1833987 -> 24.1773771 sec

It stands to reason that 7 threads wouldn't see that difference because only where the main thread lies does it compete with the worker thread. 8 is still faster than 7 so...it's just a boost cutting out the UI updates.
AquinusSMT is most definitely a hardware design and to say otherwise is insane. Are you not telling me that Intel didn't make any changes to their CPUs to support hyper-threading? That is a boatload of garbage. Bulldozer is two cores with shared hardware, most definitely not SMT. SMT is making your core wider to allow for instruction level parallelism which SMT can take advantage of during operations that don't utilize the entire core or eat up the entire part of a single stage of the pipeline. Bulldozer has dedicated hardware and registers. SMT implementations most definitely don't have a dedicated set of registers, ALUs, and AGUs. They utilize the extra hardware already in the core to squeeze in more throughput which is why hyperthreading gets you anywhere between 0 and 40% performance.
SMT isn't defined by one implementation. It does describe HTT and Bulldozer well. Bulldozer takes away from single-threaded performance to boost multi-threaded performance where Intel does the opposite. At the end of the day, they are different means to the same end (more throughput without adding additional cores).
Posted on Reply
#436
cdawall
where the hell are my stars
FordGT90ConceptSMT is a concept, not a hardware design. SMT is what Bulldozer does (two threads in one core).
but it has two integer clusters that not only behave, but look like cores that merely lack an FPU which isn't used for 90% of instruction sets?

And again it can process two threads per core or 4 per module.
Posted on Reply
#437
Aquinus
Resident Wat-man
FordGT90ConceptAll caches, exist to improve performance. The more caches, with progressively longer response times, the better overall performance will be. L2 can be made part of the a core's design but it doesn't have to be. A core generally only needs L1 caches. An example of cores that have dedicated L1 and L2 is much of the Core I# family. Here's Sandy Bridge:
That's not the point. There are benefits to scheduling processes that on cores with a shared cache. It doesn't really matter if you consider it to be part of the core or not. Where it is and how it operates is all that matters and what matters is that calling two cores logical pairs has the benefit of using local cache improving hit rates which improves overall performance. You're pretty picture doesn't really add anything to the discussion, it just shows that you know how to use Google.
FordGT90ConceptWhich AMD did too, but stupidly required a separate thread to access them.
What are you talking about? AMD did the exact opposite by sharing an FPU and doubling the number of dedicated integer cores. IBM put an emphasis of doing pseudo-GPGPU-like floating point parallelism on the CPU where AMD put an emphasis on independent integer operation. You're comparing these two like they're the same but they're almost as different as a CPU versus a GPU.
FordGT90ConceptWell, you made me test it (everything else the same):
8: 5.6961283 sec -> 5.2848560 sec
1: 24.1833987 -> 24.1773771 sec
Feels pretty theoretical to me. I'm sure that's serving some purpose in the real world that's earning someone money.
FordGT90ConceptSMT isn't defined by one implementation. It does describe HTT and Bulldozer well. Bulldozer takes away from single-threaded performance to boost multi-threaded performance where Intel does the opposite. At the end of the day, they are different means to the same end (more throughput without adding additional cores).
SMT is defined by the implementation just as discrete computational core is. This isn't software we're talking about. The bold part is exactly what happened but, that doesn't mean they're not cores.
Posted on Reply
#438
FordGT90Concept
"I go fast!1!11!1!"
cdawallbut it has two integer clusters that not only behave, but look like cores that merely lack an FPU which isn't used for 90% of instruction sets?
They do not behave like nor look like cores and FPU is a major exclusion.
cdawallAnd again it can process two threads per core or 4 per module.
It cannot.
AquinusThat's not the point. There are benefits to scheduling processes that on cores with a shared cache. It doesn't really matter if you consider it to be part of the core or not. Where it is and how it operates is all that matters and what matters is that calling two cores logical pairs has the benefit of using local cache improving hit rates which improves overall performance. You're pretty picture doesn't really add anything to the discussion, it just shows that you know how to use Google.
You're talking code that would have to be written for specific processors. That's not something that generally happens in the x86 world. I doubt even Intel's compiler (which is generally considered the best) exploits the shared L2 of Core 2 Duo in the way you are claiming.
AquinusWhat are you talking about? AMD did the exact opposite by sharing an FPU and doubling the number of dedicated integer cores. IBM put an emphasis of doing pseudo-GPGPU-like floating point parallelism on the CPU where AMD put an emphasis on independent integer operation. You're comparing these two like they're the same but they're almost as different as a CPU versus a GPU.
POWER7 pretty clearly has at least two integer clusters. The only difference between Bulldozer and POWER7 is that POWER7 has a "Unified Issue Queue" where Bulldozer had three separate schedulers (two integer and one floating). That said, each unit could have it's own scheduler (not finding anything that details the inner workings of the units).
There are a total of 12 execution units within each core: two fixed-point units, two loadstore units, four double-precision floatingpoint unit (FPU) pipelines, one vector, one branch execution unit (BRU), one condition register logic unit (CRU), and one decimal floating-point unit pipeline. The two loadstore pipes can also execute simple fixedpoint operations. The four FPU pipelines can each execute double-precision multiplyadd operations, accounting for 8 flops/cycle per core.
www.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=kalla_power7.pdf
It has quite the mix of hardware accelerating pretty much every conceivable task.
AquinusSMT is defined by the implementation just as discrete computational core is. This isn't software we're talking about. The bold part is exactly what happened but, that doesn't mean they're not cores.
I've yet to see any evidence that proves the module isn't a core and much to the contrary.
Posted on Reply
#439
BiggieShady
FordGT90ConceptI've yet to see any evidence that proves the module isn't a core and much to the contrary.
While you are at it you should also seek evidence if a big rock and the small rock are both rocks, when clearly you can fit several small rocks inside of a big rock. (almost went with the car analogy :laugh: because rocks have no inner workings but hey, they are silicon and also monolithic albeit not by design).
Nobody expects for small car to do same as a big car, but when it comes to cores people are suddenly acting like they are dealing with SI units, people valuing cpus by the core count might as well pay for cpus by the kilogram ... btw, I'm gladly trading one kilogram of celerons for same mass of i7s.
Maybe good automotive analogy would be 8 cylinder engine using one spark plug for pair of cylinders twice as often :laugh:

Anyway, operating systems are always dealing with pairs of logical processors to accommodate all possible (existing and non yet existing) physical organizations of execution units in modern super scalar cpu where both thread data dependency and pure thread parallelism are exploited for optimal gains in all scenarios. This setup is too generic and way too flexible to use logical processor from the OS as an argument in this case. AMD half a module core is a core albeit less potent and less scalable, it's not hyper threading - it's more scalable. Only in terms of scaling you could argue AMD core is less of a core than what is the norm (hence my market share tangent - intel is the norm). So the underdog in the duopoly is not putting enough asterisks and fine print on the marketing material = slap on the wrist (symbolic restitution and obligatory asterisk with fine print for the future*)

* may scale differently with different types of workload
Posted on Reply
#440
Aquinus
Resident Wat-man
FordGT90ConceptYou're talking code that would have to be written for specific processors. That's not something that generally happens in the x86 world. I doubt even Intel's compiler (which is generally considered the best) exploits the shared L2 of Core 2 Duo in the way you are claiming.
You have absolutely no idea what you're talking about, Ford. The application doesn't need to know anything about cache because it's used automatically. When a memory access occurs, cache is usually hit first because latency to check it is relatively fast. A thread moving from one core to another core on the same L2 is likely to have better hit rates at lower latencies because it's using write-back data from when it was executing on the other core. There is no code that has to be written to do this, it just happens because when the memory address is looked up, is in a cached range, and exists, it will use it. I find it laughable you think the compiler is responsible for this. It's not like software is recompiled to handle different cache configurations.
FordGT90ConceptThey do not behave like nor look like cores and FPU is a major exclusion.
Actually for general purpose computation, it's not a major execution unit because the core can run without it. Just because you think it's necessary doesn't mean everyone agrees with you. The FPU also has never been treated as a core, always as an addition to it and additions can be removed.
FordGT90ConceptIt cannot.
I'm pretty sure he meant a thread per core and two threads per module and yes, it can. Just because speed up isn't perfect doesn't mean that it isn't but, speed up is a hell of a lot better than just about every SMT implementation.
FordGT90ConceptPOWER7 pretty clearly has at least two integer clusters.
Two fixed point units and two load store units is another way of saying two ALUs and two AGUs.
FordGT90Conceptwww.ece.cmu.edu/~ece740/f13/lib/exe/fetch.php?media=kalla_power7.pdf
It has quite the mix of hardware accelerating pretty much every conceivable task.
Interesting, other than being able to partition cores into virtual CPUs for the purpose of dispatching along with an actual SMT implementation, it sounds exactly like x86. You do realize this is exactly how just about every implementation of a super scalar architecture begins but, I'm sure you'll Google that in no time.
FordGT90ConceptI've yet to see any evidence that proves the module isn't a core and much to the contrary.
That's because you're, a: drawing hard lines on something that's a bit arm wavy and a bit vague and b: using Google to help you make that case. That doesn't mean that you understand what you're reading even if you think you do. Just because you read it on the internet doesn't instantly make you an expert on the subject, it means you know how to use Google.
Posted on Reply
#441
cdawall
where the hell are my stars
AquinusI'm pretty sure he meant a thread per core and two threads per module and yes, it can. Just because speed up isn't perfect doesn't mean that it isn't but, speed up is a hell of a lot better than just about every SMT implementation.
Nope 2 and 4 is what the scheduler can handle. With many many more in the queue.
Posted on Reply
#442
FordGT90Concept
"I go fast!1!11!1!"
AquinusYou have absolutely no idea what you're talking about, Ford. The application doesn't need to know anything about cache because it's used automatically. When a memory access occurs, cache is usually hit first because latency to check it is relatively fast. A thread moving from one core to another core on the same L2 is likely to have better hit rates at lower latencies because it's using write-back data from when it was executing on the other core. There is no code that has to be written to do this, it just happens because when the memory address is looked up, is in a cached range, and exists, it will use it. I find it laughable you think the compiler is responsible for this. It's not like software is recompiled to handle different cache configurations.
Oh, you're talking context switching. Most of the data is going to be on in L3 which virtually all desktop processors have now. This is why the Core I# series has a small L1, small L2 and big L3. Only the L3 is shared across all of the cores. That said, some architectures let cores make requests of other core's L2 for this very purpose.
AquinusInteresting, other than being able to partition cores into virtual CPUs for the purpose of dispatching along with an actual SMT implementation...
Except that IBM consistently calls the whole monolithic block a "core" accepting 8 threads. You know, like sane people do. :laugh:
AquinusThat's because you're, a: drawing hard lines on something that's a bit arm wavy and a bit vague and b: using Google to help you make that case.
a) Only AMD is "wavy and a big vague" because they see profit in lying to the public.
b) I haven't used Google in a long time.
cdawallNope 2 and 4 is what the scheduler can handle. With many many more in the queue.
I don't know whether that claims is true or not but I do know that if there is more than two threads in the core (as in "module"), those threads will have to be in a wait state. Bulldozer can't execute more than two at a time.
Posted on Reply
#443
BiggieShady
FordGT90ConceptI don't know whether that claims is true or not but I do know that if there is more than two threads in the core (as in "module"), those threads will have to be in a wait state. Bulldozer can't execute more than two at a time.
Confusion comes from the fact it can dispatch 16 instructions per clock which means nothing for superscalar processor core count wise ... other than whole super scaling aspect additionally mudding the definition of a core ... add to that list also using word thread for both hardware thread and software thread
Posted on Reply
#444
Aquinus
Resident Wat-man
FordGT90ConceptOh, you're talking context switching. Most of the data is going to be on in L3 which virtually all desktop processors have now. This is why the Core I# series has a small L1, small L2 and big L3. Only the L3 is shared across all of the cores. That said, some architectures let cores make requests of other core's L2 for this very purpose.
Or maybe a smaller L2 is faster, takes up less room, and has better latency characteristics than a large one. When L2 is large, you want hit rates to be high because going down to L3 or memory is going to be extra costly given the initial added latency for accessing a larger SRAM array. Switching contexts to a core with a common cache improves performance more than you would think because the further away you get from it, the more time it's going to take to get data in that context. It's the same reason why you have the kernel aware of "cores" and "processors" because generally speaking, switching between processors is less costly than switching between cores within a processor which is less costly than switching between logical cores. It's just exploiting how a kernel scheduler works.
FordGT90ConceptExcept that IBM consistently calls the whole monolithic block a "core" accepting 8 threads. You know, like sane people do. :laugh:
That's because any less integer hardware and it couldn't do much of anything at all by itself. :laugh:
FordGT90Concepta) Only AMD is "wavy and a big vague" because they see profit in lying to the public.
Slimming out their cores to get more of them isn't misleading the public. The public in general simply doesn't understand what more cores means and it doesn't always mean better performance. That's not AMD's fault. Maybe Intel should be sued for Netburst being shit despite having high clocks. "But it runs at 3.6Ghz!" Give me a freaking break and tell people to stop being so damn lazy and learn about what they're using.
FordGT90Conceptb) I haven't used Google in a long time.
I'm sure you have all of those images stored on your hard drive, ready to go a moments notice. :roll:
FordGT90ConceptI don't know whether that claims is true or not but I do know that if there is more than two threads in the core (as in "module"), those threads will have to be in a wait state. Bulldozer can't execute more than two at a time.
Being a super-scalar CPU, it can execute several instructions at once but, it depends on which instructions they are and how they're ordered and that's per integer core. They have their own dedicated hardware that can do multiple instructions at once depending at which part of the pipeline is going to be utilized. Two different mechanisms to handle superscalar instruction level parallelism, to me, says core. Each integer core having its own L1-d cache also seems to indicate to me a core since a core cares about its own data and not the other cores on the calculation at hand.
Posted on Reply
#445
FordGT90Concept
"I go fast!1!11!1!"
Still waiting on FX-8### data. Chart really doesn't prove anything without it.
Posted on Reply
#446
cdawall
where the hell are my stars
FordGT90ConceptStill waiting on FX-8### data. Chart really doesn't prove anything without it.
I just set my FX9370 back up at home, haven't had time to test it yet.
Posted on Reply
#447
BiggieShady
I'm also interested in benchmark numbers and scaling @cdawall, although I'm not sure about effect of @FordGT90Concept application being .NET based. Instructions are in common intermediate language executed on stack based "virtual machine" process running on register based CPU. We must assume .NET runtime is well optimized for bulldozer arch (or maybe someone knows :laugh:).

Sadly, there are not many compiler flags usable for bulldozer in windows even when built directly to machine code with ms compiler. All we have is generic optimizations, more generic /favor:AMD64 and some more generic /arch:[IA32|SSE|SSE2|AVX|AVX2]

Linux folks like bulldozer bit more because they have GCC with "magical" -march=bdver1 compiler option, and AMD's own continuation of Open64 compiler ... also all libraries are easily rebuilt in the appropriate "flavor"
Posted on Reply
#448
FordGT90Concept
"I go fast!1!11!1!"
This uses WPF so the only way it would work on Linux is emulated.
Posted on Reply
#449
cdawall
where the hell are my stars
I used it on my 5960x and couldn't get consistent results...
Posted on Reply
#450
BiggieShady
FordGT90ConceptThis uses WPF so the only way it would work on Linux is emulated.
Of course, best you can do is port long running thread job part of your app to c or c++, build win32 DLL using MinGW with gcc 4.7 and bulldozer compile flags, then use [DllImport] in your .net app :laugh: and I wouldn't wish that on anyone.

Also WinForms is not WPF.

Coincidentally, as I felt like I have seen this before I managed to find an almost year old similar case www.leagle.com/decision/In FDCO 20160408M22/DICKEY v. ADVANCED MICRO DEVICES, INC.
and it was dismissed:
...the court GRANTS defendant's motion to dismiss with leave to amend.
Posted on Reply
Add your own comment
May 5th, 2024 07:49 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts