• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Dragged to Court over Core Count on "Bulldozer"

cdawall

where the hell are my stars
Joined
Jul 23, 2006
Messages
27,680 (4.24/day)
Location
Houston
System Name All the cores
Processor 2990WX
Motherboard Asrock X399M
Cooling CPU-XSPC RayStorm Neo, 2x240mm+360mm, D5PWM+140mL, GPU-2x360mm, 2xbyski, D4+D5+100mL
Memory 4x16GB G.Skill 3600
Video Card(s) (2) EVGA SC BLACK 1080Ti's
Storage 2x Samsung SM951 512GB, Samsung PM961 512GB
Display(s) Dell UP2414Q 3840X2160@60hz
Case Caselabs Mercury S5+pedestal
Audio Device(s) Fischer HA-02->Fischer FA-002W High edition/FA-003/Jubilate/FA-011 depending on my mood
Power Supply Seasonic Prime 1200w
Mouse Thermaltake Theron, Steam controller
Keyboard Keychron K8
Software W10P

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,148 (2.92/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Even though the FPU is separate in SPARC, it behaves like an internal coprocessor.
Hint: It acts like an internal co-processor when it's dedicated per core as well. There is a very fine line where the FPU starts and ends and isn't fully coupled into the integer core like you claim. Yes, it does allow the result generated by the FPU to flow back to the integer core but, that's usually so the AGU can figure out where to put it in memory after the calculation is complete.

A FPU can not function as a processor of any kind by itself. Integer math is a requirement for any modern day machine used personally or in servers. Even GPUs which are designed to do massively parallel floating point computations must have the ability to do integer math because floating point means nothing without it. Is it really so hard to comprehend that a CPU can exist without a FPU but a CPU can't exist without integer logic?

Also, IBM's POWER7 has four DP FPUs per core and can do SMT with up to 4 threads per core. The dedicated FPUs didn't make it a core but, the singular pairs of ALUs and AGUs did. How is that not any different from the reverse case? If I recall correctly, multi-core POWER CPUs have shared instruction decode logic that gets put on to queues for each core. So not only does it have dedicated FPUs contained within a single "core", it has shared logic for all of the cores to dispatch instructions. By your logic, the POWER7 is a one core CPU because it shared resources between all of the cores but, could be 4 times as many cores because of the number of FPUs.

Either way, even if BD had a more FPUs or a beefier FPU, I think people would have still called foul on the terrible integer performance which begins with single-threaded applications running alone. AMD hoped that more cores was going to offset the degradation of IPC but, they were wrong. Haswell's integer core has twice as many ALUs as BD and one more AGU. That alone should tell you something.

Simple fact is that AMD told the public that Bulldozer was going to have a 256-bit FMA FPU per module. There was no deception. The problem is that most people don't know what the hell that means. People also don't probably know that their Intel CPU probably has dual dispatch 256-bit FPUs per integer core. Different CPUs with different goals. That's it.
 
Last edited:

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
If it was established why can't you find a definition?
I already gave one from Webopedia. You can look at most architectures and see it matches Webopedia's definition. Example UltraSPARC T2 (UltraSPARC T1 had the FPU connected to the crossbar):


Hint: It acts like an internal co-processor when it's dedicated per core as well.
The FPU is like x87 where it is connected to a system bus (crossbar in UltraSPARC T1). It's a discreet processor that handles its own instructions with its own caches. It shares nothing with any core. In Bulldozer, one instruction decoder handles three components (FPU + two integer clusters). No processor exists before or since with that kind of layout.

Is it really so hard to comprehend that a CPU can exist without a FPU but a CPU can't exist without integer logic?
I never said it couldn't but in recent history, everytime it was done, it was considered an error in hindsight. Examples: UltraSPARC T1 had one FPU to 8 cores; UltraSPARC T2 moved the FPU into the 8 cores so there's a total of 8. Bulldozer and sons had one FPU per two integer clusters; Zen is moving to one FPU per core. Gimping the FPU is a great way to lose processor sales to the competition. So technically it can be done but in application, it's foolish.

Also, IBM's POWER7 has four DP FPUs per core and can do SMT with up to 4 threads per core. The dedicated FPUs didn't make it a core but, the singular pairs of ALUs and AGUs did. How is that not any different from the reverse case? If I recall correctly, multi-core POWER CPUs have shared instruction decode logic that gets put on to queues for each core. So not only does it have dedicated FPUs, it has shared logic for all of the cores. By your logic, the POWER7 is a one core CPU because it shared resources between all of the cores.
Oh look, it's all packed into each core like expected:

Seriously, stop thinking so hard. It is very simple.
 
Last edited:

cdawall

where the hell are my stars
Joined
Jul 23, 2006
Messages
27,680 (4.24/day)
Location
Houston
System Name All the cores
Processor 2990WX
Motherboard Asrock X399M
Cooling CPU-XSPC RayStorm Neo, 2x240mm+360mm, D5PWM+140mL, GPU-2x360mm, 2xbyski, D4+D5+100mL
Memory 4x16GB G.Skill 3600
Video Card(s) (2) EVGA SC BLACK 1080Ti's
Storage 2x Samsung SM951 512GB, Samsung PM961 512GB
Display(s) Dell UP2414Q 3840X2160@60hz
Case Caselabs Mercury S5+pedestal
Audio Device(s) Fischer HA-02->Fischer FA-002W High edition/FA-003/Jubilate/FA-011 depending on my mood
Power Supply Seasonic Prime 1200w
Mouse Thermaltake Theron, Steam controller
Keyboard Keychron K8
Software W10P
I already gave one from Webopedia. You can look at most architectures and see it matches Webopedia's definition. Example UltraSPARC T2 (UltraSPARC T1 had the FPU connected to the crossbar):

So where in that image does it say every core has to be setup in this exact configuration to qualify as a core? That isn't even an x86-64 CPU so design on that end alone would allow differences.

The FPU is like x87 where it is connected to a system bus (crossbar in UltraSPARC T1). It's a discreet processor that handles its own instructions with its own caches. It shares nothing with any core. In Bulldozer, one instruction decoder handles three components (FPU + two integer clusters). No processor exists before or since with that kind of layout.

It took 3 generations of CPU's for Intel to implement HT again after the fiasco that was netburst. Remember that before playing the "never existed" card.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
So where in that image does it say every core has to be setup in this exact configuration to qualify as a core? That isn't even an x86-64 CPU so design on that end alone would allow differences.
Each core is fully autonomous. That is the defining feature of a core. Nothing is shared. Bulldozer shares a lot, UltraSPARC T1 shares nothing (has to leave the core to reach it making it a coprocessor).

It took 3 generations of CPU's for Intel to implement HT again after the fiasco that was netburst. Remember that before playing the "never existed" card.
They're separate lineages:
Long pipelines: Pentium 4 --USA -> Core I#
Short pipelines: Pentium M --Israel-> Core/Core 2 (I think it lives on today as Atom)

HTT was never technically gone--they just weren't launching new processors of its design because Netburst was a clusterfuck that took years to clean up. That said, I really don't get your line of thought with this comment.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,148 (2.92/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Seriously, stop thinking so hard. It is very simple.
Take your own advice. A core is something that can (by itself,) execute instructions independently.
Oh look, it's all packed into each core like expected:

Seriously, stop thinking so hard. It is very simple.
You do realize that each one of those POWER7 cores has the same integer hardware as Bulldozer's integer core and even has shared dispatch hardware not shown on that diagram which is only describing the memory hierarchy.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Take your own advice. A core is something that can (by itself,) execute instructions independently.
Except that the integer cluster gets instructions decoded by separate hardware that it does not possess. It is dependent on the hardware around it--completely useless without it.

You do realize that each one of those POWER7 cores has the same integer hardware as Bulldozer's integer core and even has shared dispatch hardware not shown on that diagram which is only describing the memory hierarchy.
I can't find any thing to support this claim. All I could find is POWER8 which does have "predecode" but look further down the pipeline and each core still has a dedicated decoder:

It almost appears that it has at least two ALUs and two FPUs. And why not? With 8 threads in the core, it can certainly keep them busy. I got no problem with multiple integer clusters and floating point clusters inside a core. The point is, each one does not constitute a core--the whole of it does. Instruction to result, it never leaves the core. The same should be said of Bulldozer's "module."
 
Last edited:
Joined
Feb 8, 2012
Messages
3,013 (0.67/day)
Location
Zagreb, Croatia
System Name Windows 10 64-bit Core i7 6700
Processor Intel Core i7 6700
Motherboard Asus Z170M-PLUS
Cooling Corsair AIO
Memory 2 x 8 GB Kingston DDR4 2666
Video Card(s) Gigabyte NVIDIA GeForce GTX 1060 6GB
Storage Western Digital Caviar Blue 1 TB, Seagate Baracuda 1 TB
Display(s) Dell P2414H
Case Corsair Carbide Air 540
Audio Device(s) Realtek HD Audio
Power Supply Corsair TX v2 650W
Mouse Steelseries Sensei
Keyboard CM Storm Quickfire Pro, Cherry MX Reds
Software MS Windows 10 Pro 64-bit
I can't find any thing to support this claim.
pwr7.jpg

Looks to me that instruction dispatcher is shared between 4 fixed point units, and it's all inside core boundary ... and since it's already shared isn't that what really matter how wide it is - how many instructions per clock can it dispatch ... how is this different than having a single double wide dispatcher out of core boundaries shared between two cores?
The answer is, it doesn't matter, this power 7 core could be split into 2 weaker cores that would be less super scalar on their own, each would need more cycles for wider instructions, it would be truly two independent but weaker cores.
 
Last edited:

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Like POWER8, it appears to be a complete processor with lots of extra hardware to increase throughput. "Core boundary" is right.

I so see the similarities between that and Bulldozer yet IBM calls it what it is: a core. AMD does not. Like I said, all data points to AMD lying to making the processors look better next to Intel.

To be very clear: I have no issue with Bulldozer's design. I have an issue with AMD doubling the "core" count.
 
Joined
Feb 8, 2012
Messages
3,013 (0.67/day)
Location
Zagreb, Croatia
System Name Windows 10 64-bit Core i7 6700
Processor Intel Core i7 6700
Motherboard Asus Z170M-PLUS
Cooling Corsair AIO
Memory 2 x 8 GB Kingston DDR4 2666
Video Card(s) Gigabyte NVIDIA GeForce GTX 1060 6GB
Storage Western Digital Caviar Blue 1 TB, Seagate Baracuda 1 TB
Display(s) Dell P2414H
Case Corsair Carbide Air 540
Audio Device(s) Realtek HD Audio
Power Supply Corsair TX v2 650W
Mouse Steelseries Sensei
Keyboard CM Storm Quickfire Pro, Cherry MX Reds
Software MS Windows 10 Pro 64-bit
To be very clear: I have no issue with Bulldozer's design. I have an issue with AMD doubling the "core" count.
It is clear, you have an issue with code made of pure AVX 256bit instructions not scaling beyond 4 threads, you are completely fine with bad cache hits and gimped uop scheduler. IMO it should be other way round.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Look at the FX-8350 from the perspective of being a quad-core. AVX 256-bit becomes a non-issue.

Single-threaded performance is peripheral to the lawsuit. Yeah, it isn't the best but there's really nothing misleading about that part. AMD struggled in that department since Intel has prioritized it.

Looks to me that instruction dispatcher is shared between 4 fixed point units, and it's all inside core boundary ... and since it's already shared isn't that what really matter how wide it is - how many instructions per clock can it dispatch ... how is this different than having a single double wide dispatcher out of core boundaries shared between two cores?
Because the whole of it is one core--not a component inside. If IBM called those two "Fixed Point Units" "cores," I'd be as up in arms over that as I am over Bulldozer. But they didn't because sense. If only AMD had sense.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,148 (2.92/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
AVX 256-bit becomes a non-issue.
AVX 256-bit is already a non-issue because hardly any software relies on quad precision floating point math.
If IBM called those two "Fixed Point Units" "cores," I'd be as up in arms over that as I am over Bulldozer.
The other name for those "fixed point units" are ALUs. Remember when I said POWER7 has the same integer hardware as a single BD core? That's two ALUs and two AGUs.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
The other name for those "fixed point units" are ALUs. Remember when I said POWER7 has the same integer hardware as a single BD core? That's two ALUs and two AGUs.
Yet, nothing is shared with a neighboring "core."

Zen is going to have 4 ALUs and 2 AGUs. Does that redefine what a core is? Nope, it just increases the amount of parallelism the processor is capable of. Adding a second integer cluster does the same damn thing (not a "core").
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,148 (2.92/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
Yet, nothing is shared with a neighboring "core."

Zen is going to have 4 ALUs and 2 AGUs. Does that redefine what a core is? Nope, it just increases the amount of parallelism the processor is capable of. Adding a second integer cluster does the same damn thing (not a "core").
I see the same gimped FMA FPU though. Weren't you complaining about FP throughput?
 
Joined
Feb 8, 2012
Messages
3,013 (0.67/day)
Location
Zagreb, Croatia
System Name Windows 10 64-bit Core i7 6700
Processor Intel Core i7 6700
Motherboard Asus Z170M-PLUS
Cooling Corsair AIO
Memory 2 x 8 GB Kingston DDR4 2666
Video Card(s) Gigabyte NVIDIA GeForce GTX 1060 6GB
Storage Western Digital Caviar Blue 1 TB, Seagate Baracuda 1 TB
Display(s) Dell P2414H
Case Corsair Carbide Air 540
Audio Device(s) Realtek HD Audio
Power Supply Corsair TX v2 650W
Mouse Steelseries Sensei
Keyboard CM Storm Quickfire Pro, Cherry MX Reds
Software MS Windows 10 Pro 64-bit
What would you say if Zen was presented as a 2 cores per module cpu like this? :laugh:
AMD-Zen-CPU-Architecture-7.png
 

cdawall

where the hell are my stars
Joined
Jul 23, 2006
Messages
27,680 (4.24/day)
Location
Houston
System Name All the cores
Processor 2990WX
Motherboard Asrock X399M
Cooling CPU-XSPC RayStorm Neo, 2x240mm+360mm, D5PWM+140mL, GPU-2x360mm, 2xbyski, D4+D5+100mL
Memory 4x16GB G.Skill 3600
Video Card(s) (2) EVGA SC BLACK 1080Ti's
Storage 2x Samsung SM951 512GB, Samsung PM961 512GB
Display(s) Dell UP2414Q 3840X2160@60hz
Case Caselabs Mercury S5+pedestal
Audio Device(s) Fischer HA-02->Fischer FA-002W High edition/FA-003/Jubilate/FA-011 depending on my mood
Power Supply Seasonic Prime 1200w
Mouse Thermaltake Theron, Steam controller
Keyboard Keychron K8
Software W10P
Each core is fully autonomous. That is the defining feature of a core. Nothing is shared. Bulldozer shares a lot, UltraSPARC T1 shares nothing (has to leave the core to reach it making it a coprocessor).

So by that logic sharing an L2 is not a core.

They're separate lineages:
Long pipelines: Pentium 4 --USA -> Core I#
Short pipelines: Pentium M --Israel-> Core/Core 2 (I think it lives on today as Atom)

HTT was never technically gone--they just weren't launching new processors of its design because Netburst was a clusterfuck that took years to clean up. That said, I really don't get your line of thought with this comment.

Simple HT showed a performance degradation in a lot of scenarios back when it first came out. Software and hardware evolved and now SMT is the status quo. So the idea that sharing reasources and an FPU is the devil and "isn't a real core" might be an issue right now, but this shit will come back. These chips were meant for an HPC cluster and performed better than Intel's offerings at the time and they did so for a reason. As you said yourself size wise the modules look more like a tradition core than what the cores do, yet in a massively multithreaded, non-biased environment you were seeing scaling near 100% per core. Something Intel hasn't been able to emulate until haswell was released.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
I see the same gimped FMA FPU though. Weren't you complaining about FP throughput?
There is one per core. It is not gimped because it is not shared. 8 cores = 8 FPUs. In Bulldozer, not only were there 4 FPUs, but each one was only adequate for one core.

What would you say if Zen was presented as a 2 cores per module cpu like this? :laugh:
View attachment 79610
If the called the combined object a "module" and not a "core," throw Zen into the lawsuit.

So by that logic sharing an L2 is not a core.
L2 has always been optional. The same goes with L3 and L4 (eDRAM). They only exist to speed up memory latency. They are not critical to the function of a core. That said, L1 -> system memory would be painfully slow.

Simple HT showed a performance degradation in a lot of scenarios back when it first came out. Software and hardware evolved and now SMT is the status quo. So the idea that sharing reasources and an FPU is the devil and "isn't a real core" might be an issue right now, but this shit will come back. These chips were meant for an HPC cluster and performed better than Intel's offerings at the time and they did so for a reason. As you said yourself size wise the modules look more like a tradition core than what the cores do, yet in a massively multithreaded, non-biased environment you were seeing scaling near 100% per core. Something Intel hasn't been able to emulate until haswell was released.
Pentium 4 didn't originally come with HTT. Intel saw all of the cache misses with Pentium 4 and thought a solution to minimize performance loss when that happens is to give it a second thread to work on while the first thread was retrieving data. This was when most software was coded for a single processor. It was also something added in hindsight--not a very good implementation. When they went to design Nehalem, they started designing the architecture from the perspective of having HTT. That's why its implementation was much better.

Remember that Bulldozer was AMD's first attempt at simultaneous multithreading. First try was pretty bad (Bulldozer) and they improved it with each iteration but they couldn't fundamentally fix the blocking problems and poor single-threaded performance. Zen throws out Bulldozer's ideas and replaces it with HTT-like simultaneous multithreading. I'm not expecting AMD's Zen SMT performance to match HTT because Intel has lot of practice. At least it is a step in the right direction.

8 Intel cores is going to beat 8 Bulldozer "cores." Intel is going to charge you a lot more for the privilege though.

Diagrams above showed 75% gain at best, 25% at worst, not "near 100%" (that would be a real dual core, not a hybrid like Bulldozer is). AMD sacrificed single-threaded performance for that though where Intel did not for 0-50% gain.
 
Last edited:

cdawall

where the hell are my stars
Joined
Jul 23, 2006
Messages
27,680 (4.24/day)
Location
Houston
System Name All the cores
Processor 2990WX
Motherboard Asrock X399M
Cooling CPU-XSPC RayStorm Neo, 2x240mm+360mm, D5PWM+140mL, GPU-2x360mm, 2xbyski, D4+D5+100mL
Memory 4x16GB G.Skill 3600
Video Card(s) (2) EVGA SC BLACK 1080Ti's
Storage 2x Samsung SM951 512GB, Samsung PM961 512GB
Display(s) Dell UP2414Q 3840X2160@60hz
Case Caselabs Mercury S5+pedestal
Audio Device(s) Fischer HA-02->Fischer FA-002W High edition/FA-003/Jubilate/FA-011 depending on my mood
Power Supply Seasonic Prime 1200w
Mouse Thermaltake Theron, Steam controller
Keyboard Keychron K8
Software W10P
L2 has always been optional. The same goes with L3 and L4 (eDRAM). They only exist to speed up memory latency. They are not critical to the function of a core. That said, L1 -> system memory would be painfully slow.

FPU is optional as well. Hence the lack of it's existence, obviously.


Pentium 4 didn't originally come with HTT. Intel saw all of the cache misses with Pentium 4 and thought a solution to minimize performance loss when that happens is to give it a second thread to work on while the first thread was retrieving data. This was when most software was coded for a single processor. It was also something added in hindsight--not a very good implementation. When they went to design Nehalem, they started designing the architecture from the perspective of having HTT. That's why its implementation was much better.

Bad argument, my point stands, Intel released a hunk of shit. Took something that worked in theory and applied it to a later CPU. There is no reason why we wont see the module ideology expand and continue. The design was ahead of it's time and not targeted at peasant workloads. It is and always will be an HPC chip.

Remember that Bulldozer was AMD's first attempt at simultaneous multithreading. First try was pretty bad (Bulldozer) and they improved it with each iteration but they couldn't fundamentally fix the blocking problems and poor single-threaded performance. Zen throws out Bulldozer's ideas and replaces it with HTT-like simultaneous multithreading. I'm not expecting AMD's Zen SMT performance to match HTT because Intel has lot of practice. At least it is a step in the right direction.

Technically bulldozer could handle 2 threads per core or 4 per module on top of the whole two core idea, so where in the Windows task manager did that fall?

8 Intel cores is going to beat 8 Bulldozer "cores." Intel is going to charge you a lot more for the privilege though.

Which generation? Massively multithreaded environments outside of windows tell a tale...

Diagrams above showed 75% gain at best, 25% at worst, not "near 100%" (that would be a real dual core, not a hybrid like Bulldozer is). AMD sacrificed single-threaded performance for that though where Intel did not for 0-50% gain.

Cool I can make diagrams where it shows nearly 100% scaling depending hugely on OS it sits inside of. Even using your numbers what scaling does HT show? It sure isn't 75%. Another proof that these are "real" cores.
 

Frick

Fishfaced Nincompoop
Joined
Feb 27, 2006
Messages
18,968 (2.84/day)
Location
Piteå
System Name Black MC in Tokyo
Processor Ryzen 5 5600
Motherboard Asrock B450M-HDV
Cooling Be Quiet! Pure Rock 2
Memory 2 x 16GB Kingston Fury 3400mhz
Video Card(s) XFX 6950XT Speedster MERC 319
Storage Kingston A400 240GB | WD Black SN750 2TB |WD Blue 1TB x 2 | Toshiba P300 2TB | Seagate Expansion 8TB
Display(s) Samsung U32J590U 4K + BenQ GL2450HT 1080p
Case Fractal Design Define R4
Audio Device(s) Line6 UX1 + some headphones, Nektar SE61 keyboard
Power Supply Corsair RM850x v3
Mouse Logitech G602
Keyboard Cherry MX Board 1.0 TKL Brown
Software Windows 10 Pro
Benchmark Scores Rimworld 4K ready!
One day I'll read this thread and dole out thanks whenever I learn something. Should be good. :D
 
Joined
Sep 15, 2011
Messages
6,520 (1.40/day)
Processor Intel® Core™ i7-13700K
Motherboard Gigabyte Z790 Aorus Elite AX
Cooling Noctua NH-D15
Memory 32GB(2x16) DDR5@6600MHz G-Skill Trident Z5
Video Card(s) ZOTAC GAMING GeForce RTX 3080 AMP Holo
Storage 2TB SK Platinum P41 SSD + 4TB SanDisk Ultra SSD + 500GB Samsung 840 EVO SSD
Display(s) Acer Predator X34 3440x1440@100Hz G-Sync
Case NZXT PHANTOM410-BK
Audio Device(s) Creative X-Fi Titanium PCIe
Power Supply Corsair 850W
Mouse Logitech Hero G502 SE
Software Windows 11 Pro - 64bit
Benchmark Scores 30FPS in NFS:Rivals
Last edited:

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
FPU is optional as well. Hence the lack of it's existence, obviously.
In theory, not in practice.

Bad argument, my point stands, Intel released a hunk of shit. Took something that worked in theory and applied it to a later CPU. There is no reason why we wont see the module ideology expand and continue. The design was ahead of it's time and not targeted at peasant workloads. It is and always will be an HPC chip.
They are wide cores. This lawsuit will likely force AMD to call them cores too.

Technically bulldozer could handle 2 threads per core or 4 per module on top of the whole two core idea, so where in the Windows task manager did that fall?
It would still be a 4-threaded core. A lot of enterprise RISC processors already handle 8-threads per core (many FPUs and ALUs in each) so that isn't exactly new.

Which generation? Massively multithreaded environments outside of windows tell a tale...
Sandybridge/Ivybridge which were out about the same time as Bulldozer.

Cool I can make diagrams where it shows nearly 100% scaling depending hugely on OS it sits inside of. Even using your numbers what scaling does HT show? It sure isn't 75%. Another proof that these are "real" cores.
Go ahead and run your benchmarks then. I'm waiting. Here's the post, by the way. Spoiler: it will never reach 95%+ that an actual dual core would.

Still haven't got my answer, if you can oc each of the 8 cores independently?
@MalakiLab claims it is possible to change the clockspeeds on the integer clusters which begs the question what speed is the FPU, instruction decoder, and so on running at? Also note in the picture how Linux calls the FX-6350 a tri-core.
 
Last edited:
Joined
Sep 15, 2011
Messages
6,520 (1.40/day)
Processor Intel® Core™ i7-13700K
Motherboard Gigabyte Z790 Aorus Elite AX
Cooling Noctua NH-D15
Memory 32GB(2x16) DDR5@6600MHz G-Skill Trident Z5
Video Card(s) ZOTAC GAMING GeForce RTX 3080 AMP Holo
Storage 2TB SK Platinum P41 SSD + 4TB SanDisk Ultra SSD + 500GB Samsung 840 EVO SSD
Display(s) Acer Predator X34 3440x1440@100Hz G-Sync
Case NZXT PHANTOM410-BK
Audio Device(s) Creative X-Fi Titanium PCIe
Power Supply Corsair 850W
Mouse Logitech Hero G502 SE
Software Windows 11 Pro - 64bit
Benchmark Scores 30FPS in NFS:Rivals

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Power management circuits can be added pretty much anywhere in a processor to shut parts of it off. It only proves that Bulldozer has those circuits.
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,148 (2.92/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
In theory, not in practice.
Theory is having only a FPU and no integer cores. Every x86 CPU since its inception to date has had an integer pipeline. Every single one. Whereas not everyone one has had an integrated FPU. In modern times it happens to be the case that the benefit of having a FPU is enough to include it all the time but, there is absolutely nothing to suggest that the FPU is required for the definition of a core because it used to be done. It was done once before, it can be done again. Once again, as the guy from AMD said in an interview, 90% of the work CPUs handle is integer in nature (and my work as a software engineer aligns with this statement.) It only makes sense to beef out a CPU to accommodate that kind of workload if die space is at a premium.
They are wide cores. This lawsuit will likely force AMD to call them cores too.
2 ALUs and 2 AGUs makes them skinny cores just as the single issue 256-bit FMA FPU (which can be split into dual issue 128-bit,) is a skinny FPU. They're also independent ALUs and AGUs which can receive their own instructions which feels a whole lot like a core. They have their own registers, its own control lines, and even its own instruction cache. Even the way that they scale feels, smells, and tastes like cores and not SMT. They're also not wide cores if you're comparing the integer pipeline against Haswell's 4 ALUs and 3 AGUs or the FPU against Intel's double wide FPU that can quad-issue 128-bit ops and dual issue 256-bit AVX.
It would still be a 4-threaded core. A lot of enterprise RISC processors already handle 8-threads per core (many FPUs and ALUs in each) so that isn't exactly new.
So now we're letting Microsoft define a core? Are you ever going to make up your mind or are you going to keep changing it to suit your argument?
Go ahead and run your benchmarks then. I'm waiting. Here's the post, by the way. Spoiler: it will never reach 95%+ that an actual dual core would.
Spoiler: Most multi-threaded workloads that aren't purely parallel in nature will never have 100% speed up indefinitely. More cores means more overhead.
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.60/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Once again, as the guy from AMD said in an interview, 90% of the work CPUs handle is integer in nature (and my work as a software engineer aligns with this statement.)
He also said blocking was possible. Cores never block other cores ergo not a dual core.

2 ALUs and 2 AGUs makes them skinny cores just as the single issue 256-bit FMA FPU (which can be split into dual issue 128-bit,) is a skinny FPU. They're also independent ALUs and AGUs which can receive their own instructions which feels a whole lot like a core. They have their own registers, its own control lines, and even its own instruction cache. Even the way that they scale feels, smells, and tastes like cores and not SMT. They're also not wide cores if you're comparing the integer pipeline against Haswell's 4 ALUs and 3 AGUs or the FPU against Intel's double wide FPU that can quad-issue 128-bit ops and dual issue 256-bit AVX.
Except that those "cores" don't understand x86 instructions. They understand opcodes given to them by the instruction decoder and fetcher. On the other hand, a real core (even the POWER7 and POWER8 behemoths) has the hardware to interpret instruction to a result without leaving the core. So either AMD's definition is wrong or Intel, IBM, ARM Holdings, and Sun are wrong. Considering IBM produces chips that are nearly identical to Bulldozer with four integer clusters and they don't call that a quad-core, I'd say AMD is definitively wrong.

So now we're letting Microsoft define a core? Are you ever going to make up your mind or are you going to keep changing it to suit your argument?
All modern operating systems call FX-8350 a quad-core with 8 logical processors, not just Windows. When *nix has to work on POWER7 and Bulldozer, are they really going to use AMD's marketing terms to describe what is actually there? I'd hope not.

Spoiler: Most multi-threaded workloads that aren't purely parallel in nature will never have 100% speed up indefinitely. More cores means more overhead.
Asyncronous multithreading is always capable of loading systems to 100% so long as it can spawn enough threads and those threads are sufficiently heavy. Overhead is only encountered at the start in the main thread and at the end of the worker thread (well under 1% of compute time).
 
Top