AMD EPYC Architecture & Technical Overview 29

AMD EPYC Architecture & Technical Overview

AMD Secure Processor & Infinity Fabric »

Core Design & Cache System

The new so-called "Zen" architecture was designed from the ground up with datacenters in mind, for optimal balance of performance and power, and has four key points we will be exploring:
  • New high-performance core design
  • New high bandwidth, low latency cache system
  • Simultaneous Multithreading (SMT)
  • Energy efficient 14 nm FinFET process

There are similarities with the Zen microarchitecture used in Ryzen, but I see more differences at the same time. As a quick summary, this new core design can fetch and decode four x86 assembler instructions per cycle. Like on all modern CPU designs, the x86 instructions are broken down into a sequence of micro-ops, which can be processed more efficiently. In order to accelerate this process, Zen includes Op Cache capable of storing 2K instructions.

The left side of the diagram shows the integer machinery, which contains four integer units with a total of 168 registers capable of 192 simultaneous instructions, two load/store units with support for up to 72 out-of-order loads and smart prefetch.

The right side of the diagram contains two floating point units with 128 FMACs (fused multiply-accumulate execution units) for compute. The floating point units are built as four pipes with two Fadd and Fmul operation units and have dual AES (advanced encryption standard) units to aid in fast single-threaded performance or increase SMT-based throughput with SHA (secure hash algorithm) encryption support for SHA-1 and SHA-256 algorithms.


The cache system here has four sub-components:
  • 4-way. 64 KB I-cache (Instruction cache)
  • 8-way, 32 KB D-cache (Data cache)
  • 8-way, 512 KB L2 cache
  • Large shared 2 MB/core L3 cache
As expected, there is separate I-cache and D-cache to avoid pipelining issues and support parallel operation for more bandwidth. As seen above, there is a fast private L2 cache and a fast shared L3 cache. The large size of the shared L3 cache is used such that it acts as a victim cache for L2, with L2 tags also duplicated in L3 for fast cache transfer and probe filtering. Due to the cache design being a mixture of shared and private, it is possible that cache entries are promoted and end up in a different core than the one currently needing that data. AMD uses shadow tags so that these entries can be found efficiently by any core. EPYC supports up to 50 victims/outstanding misses from L2 to L3 per core and a further 96 outstanding misses from L3 to the system memory. This helps provide multiple large high bandwidth queues for datacenter operations.

The L3 cache is shared as part of a CPU complex (CCX) comprised of four cores connected to a L3 cache (shared, total of 8 MB per CCX). This L3 cache is made out of four slices, 16-way associative and mostly exclusive of L2. The association works such that each core can access every cache with the same average latency.


As an example, AMD has provided details on how a neural net prediction would work with this new core and cache design. We see here the prediction of two branches per cycle; a large L1/L2 branch target buffer, 32 entry-return stack (large, good for VM), and 512 entry ITA (indirect target array) are all capable of predicting multiple targets from the same branch.


Here's a quick comparison of the core design and cache systems for AMD "Bulldozer" and "Zen" based microarchitectures and Intel's Broadwell-E/EP families. Note that Intel still has to unveil their 2017 Skylake Purley architecture, so what we are seeing from AMD today will have to compete against that as well.

Simultaneous Multithreading (SMT) and Instructions


Simultaneous Multithreading (SMT) was introduced with Ryzen and carries on with the EPYC platform as well. The structures seen in the SMT overview all are valid in a single-threaded mode as well, with front-end queues following a round-robin-style priority override. SMT, as with Ryzen, allows for increased workload throughput, and this will be especially handy for datacenter applications. As a color code, the light blue structures in the overview above are competitively shared, the darker blue structures are competitively shared and SMT tagged, the green structures are competitively shared with algorithmic priority, and the white structures are statistically partitioned.


Another place of improvement that "Zen" has had over "Bulldozer" are in the instruction set, including some that are exclusive to AMD. CLZERO clears a cache line, which is useful for when you have to fill a block of memory with zeros and don't want the zeros to take up precious space in cache. PTE Coalescing is used to optimize the TLB: adjacent 4K memory pages can be grouped to occupy a single 32K page table entry.

We also see key improvements in virtualization. For example, the addition of the virtual APIC helps to prevent passing execution to the hypervisor for interrupt execution, which has a slight performance penalty associated with it. As quoted by AMD, these improvements result in up to 50% less latency overall in VM workloads relative to "Bulldozer" (Piledriver microarchitecture).
Next Page »AMD Secure Processor & Infinity Fabric
View as single page
Apr 23rd, 2024 23:47 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts