The 8 and 9 series FX have 4 cores (modules, as they were called), so they are octa cores in an ALU sense, but quad cores in an FPU sense.
This sentence is self-contradictory. Here are the premises it contains:
1) The 8 and 9 series FX have
4 cores (modules, as they were called)
2) they are
octa cores in an ALU sense
3)
quad cores in an FPU sense
Firstly, a module is not the same thing as a core. Secondly, it's impossible, as Vya Domus said (and I was about to), for a CPU to be both 8 and 4 cores simultaneously.
bridgmanAMD said:
there is a single FPU per module but it has two independent 128-bit FMAC pipes to allow executing two instructions (one from each thread) in parallel. So arguably each module has two FPUs when running 128-bit instructions and one FPU when running AVX-256 instructions (or MMX instructions).
AMD_James said:
The FPU is able to process two 128-bit FP threads simultaneously. It combines into a single unit to process 256-bit operations. Either core in the module can dispatch instructions to the FP unit, be it 2 x 128 or 1 x 256 (and even 4x 64). Contention will occur when both cores in the module need to process 256-bit FP at the same time.
deleted said:
the "only has 4 fpus" situation only really applies on 256 bit operations. In 128 bit and lower operations it is a split unit and can do two separate instructions at the same time. Modules also have a single L2 cache, and that is part of AMD betting on CMT instead of SMT
Kromaatikse said:
It's not hyperthreading. That's Intel's trademarked brand name for SMT (Simultaneous Multithreading). It's not generic SMT, either. The identifying characteristic of SMT is that all the core's resources are shared between two (or more) threads. That's not the case in Bulldozer's design.
There are two distinct cores, but they share significant resources in pairs in an attempt to improve efficiency. The resources shared in the FX-8350 (in which the cores are "Piledrivers") are: L1 I-cache, prefetch, decode, and FPU. The resources shared in the 28nm APUs (in which the cores are "Steamrollers" or "Excavators") are: L1 I-cache, prefetch, and FPU. Hence these cores have independent decoders for each core.
Joel Hruska said:
Since Bulldozer shipped, AMD has made
modest improvements to the CPU’s overall efficiency and performance. Kaveri
cut the penalty for multi-threading in half, from ~20% to 10% compared with typical core scaling. If AMD hadn’t been forced to lower clock speeds to compensate for its 28nm manufacturing process, Kaveri would’ve outperformed Richland across the board. Bulldozer is absolutely capable of executing eight threads simultaneously, and executing eight threads on an eight-core FX-8150 is faster than running that same chip in a four-thread, four-module mode. Bulldozer can decode 16 instructions per clock (not eight) and it can keep far more than eight instructions in flight simultaneously.
Hey, not my fault AMD decided to create a Frankenstein's monster with the FX. Check the performance for yourself.
If you want a much more Frankensteinian design, take a look at Broadwell-C.
GPU integrated
separate eDRAM chip
SMT
Bulldozer/Piledriver are much simpler designs. Elegance and simplicity don't speak to performance. The Frankensteinian L4 on Broadwell-C certainly didn't hurt its performance, nor did the inclusion of an IGP or SMT. Broadwell-C was quite impressive considering its low clocks and power consumption. In fact, Peter Bright of Ars complained that Intel was robbing people of performance by refusing to put that L4 on Skylake.
Bulldozer/Piledriver had poor IPC due to a combination of being forced to last much longer in the market than their Intel counterparts as well as design inefficiencies that may have been able to be overcome with strong further development. One of those was AMD's bet to make a pipeline that was very deep. Deep and narrow failed with the P4 but AMD tried that strategy, although adjusted, again. Had AMD devoted the kind of resources to improving their CMT design that Intel devoted to creating Sandy the performance would have been better. Better cache performance (e.g. Piledriver's L3 isn't so much better than fast DDR3, as far as I recall). Better instruction caching and prediction. Possibly a shallower and wider design. Possibly the addition of SMT to supplement the CMT. Possibly better Windows scheduling and optimization. Also, had Intel not created Sandy and had, instead, created a weaker CPU — then perhaps AMD would have decided to create a successor to Piledriver and it could have been quite improved.
How many years have we seen people brag in forums about still using their Sandy chip because it was such a great value? There are three reasons for that. First is that AMD didn't compete well. Second is that Intel slowed down IPC gain. Third is that Intel got rid of solder. However, how much code optimization has to do with performance remains an open question. The game Deserts of Kharak showed Piledriver hanging in there quite nicely, despite being well out of date. Many other games showed terrible FX performance.
What is the special sauce in Deserts of Kharak that enable the FX to hang in there with Haswell? We never found out because it was basically an outlier. It would be interesting for someone to interview the developers to discover why their code ran so well on FX. The most basic assumption is that it, unlike other games of the time, did a much better job of leveraging the 8 cores. However, other factors are most likely also in play, like a lack of a DRAM speed bottleneck.
Personally, I think it's much more interesting to know what is possible with CMT than it is to complain about AMD's failure with it.