- Nov 6, 2016
- 681 (0.42/day)
- NH, USA
|Processor||Ryzen 7 2700X|
|Motherboard||Asus ROG Strix X470-F Gaming|
|Cooling||Enermax Liqmax Iii 360mm AIO|
|Memory||G.Skill Trident Z RGB 32GB (8GBx4) 3200Mhz CL 14|
|Video Card(s)||Sapphire RX 5700XT Nitro+|
|Storage||Hp EX950 2TB NVMe M.2, HP EX950 1TB NVMe M.2, Samsung 860 EVO 2TB|
|Display(s)||LG 34BK95U-W 34" 5120 x 2160|
|Case||Lian Li PC-O11 Dynamic (White)|
|Power Supply||BeQuiet Straight Power 11 850w Gold Rated PSU|
|Mouse||Glorious Model O (Matte White)|
|Keyboard||Royal Kludge RK71|
Actually, high frequencies have made there way back into the server market due to expensive per core licensing for softwareThe server market is the obvious hit for Intel's 10nm technology.
It seems like Intel's main problem with 10nm is the low clock rates, which leads to worse performance compared to their 14nm+++++++++++ process clocked at well over 5GHz (boost). Servers don't want to boost so high. Low and slow benefits servers, as these computers are designed for maximum compute in a given datacenter (Datacenters are limited by two things: either power, or cooling... and cooling is itself limited by power).
The cache game is very interesting now. AMD offers more L3 cache than anyone else, but I've discussed the downsides of AMD's "split L3 cache" before. In short: AMD's L3 cache is only useful to a "CCX" (for Zen3: a cluster of 8 processors / 16 threads max), while L3 cache from traditional processors (Various ARM chips or Intel) are unified. Consider if 1MB from address 0x80008000 were being read-shared on a 64-core AMD chip with its 8x "split cache", there would be 8-copies of that one address (!!!), leading to 8MBs of read-data across AMD's 8x split cache (of size 256MBs): 3% of its cache. In contrast, Intel would only have 1MB in its unified 60MB L3 cache, or 1.6% of Intel's cache.
As such, large singular datasets (like Databases) will likely perform better on Intel's "smaller" 60MB L3 cache, despite AMD having 256MBs of L3 cache available. On the other hand, any workload with "split data", such as a 8-core / 16thread cloud-servers, will benefit grossly from AMD's architecture. So Azure / AWS / Google Compute are likely leaning towards AMD, while database engineers at Oracle / IBM are thinking Intel still. Or from a LAMP stack perspective: your PHP / Apache processes are most efficient on AMD, but your MySQL processes might be more efficient on Intel.
In the past, I've talked about how AMD's L3 cache has latency penalties. Lets consider that 1MB region again. If one core were to change the data, it'd have to send a MESI message to Exclusive control of the data it wishes to change. The 63 other processors then need to communicate that their L1 and L2 caches remain unchanged (Total Store Ordering: all writes must be written in order on x86 architecture). Then the core changes the data, and broadcasts the new data to all 63 other processors.
However, as latency intensive as that operation is: there's a huge benefit to AMD's architecture I never thought of till now. AMD's "chiplet + I/O" architecture, starting with Zen2 (and later), means that this kind of MESI-like negotiation of data takes place over IF without affecting DDR4 bandwidth. (Zen1 on the other hand: because DDR4 was "behind" the IFOPs, the necessary MESI messages would reduce DDR4 bandwidth). So core#63, if it was NOT reading/writing to the changed region of memory, would have full memory bandwidth to DDR4 while the cores #0 through #62 were playing with MESI messages.
Of course, Intel's unified L3 and "flatter" NUMA profile (40-cores per NUMA zone instead of AMD's 8-cores per NUMA). But it should be noted that AMD's 4x NUMA (64-core single-socket EPYC) is "more efficient" than Intel's 4x NUMA (160-core quad-socket Xeon Gold) MESI + DDR4 bandwidth perspective. Each of Intel's chips has a memory controller attached (similar to Zen1), instead of a networked I/O chip that can more efficiently allocate bandwidth (like Zen2+). That is to say: Intel's UPI has to share both DDR4 bandwidth AND MESI messages. While AMD's on-package CCXes probably don't suffer much of a DDR4 bandwidth penalty.