AMD EPYC Architecture & Technical Overview

on Jun 20th, 2017,

Manufacturer: AMD

AMD Secure Processor & Infinity Fabric »

Core Design & Cache System

The new so-called "Zen" architecture was designed from the ground up with datacenters in mind, for optimal balance of performance and power, and has four key points we will be exploring:

New high-performance core design
New high bandwidth, low latency cache system
Simultaneous Multithreading (SMT)
Energy efficient 14 nm FinFET process

There are similarities with the Zen microarchitecture used in Ryzen, but I see more differences at the same time. As a quick summary, this new core design can fetch and decode four x86 assembler instructions per cycle. Like on all modern CPU designs, the x86 instructions are broken down into a sequence of micro-ops, which can be processed more efficiently. In order to accelerate this process, Zen includes Op Cache capable of storing 2K instructions.

The left side of the diagram shows the integer machinery, which contains four integer units with a total of 168 registers capable of 192 simultaneous instructions, two load/store units with support for up to 72 out-of-order loads and smart prefetch.

The right side of the diagram contains two floating point units with 128 FMACs (fused multiply-accumulate execution units) for compute. The floating point units are built as four pipes with two Fadd and Fmul operation units and have dual AES (advanced encryption standard) units to aid in fast single-threaded performance or increase SMT-based throughput with SHA (secure hash algorithm) encryption support for SHA-1 and SHA-256 algorithms.

The cache system here has four sub-components:

4-way. 64 KB I-cache (Instruction cache)
8-way, 32 KB D-cache (Data cache)
8-way, 512 KB L2 cache
Large shared 2 MB/core L3 cache

As expected, there is separate I-cache and D-cache to avoid pipelining issues and support parallel operation for more bandwidth. As seen above, there is a fast private L2 cache and a fast shared L3 cache. The large size of the shared L3 cache is used such that it acts as a victim cache for L2, with L2 tags also duplicated in L3 for fast cache transfer and probe filtering. Due to the cache design being a mixture of shared and private, it is possible that cache entries are promoted and end up in a different core than the one currently needing that data. AMD uses shadow tags so that these entries can be found efficiently by any core. EPYC supports up to 50 victims/outstanding misses from L2 to L3 per core and a further 96 outstanding misses from L3 to the system memory. This helps provide multiple large high bandwidth queues for datacenter operations.

The L3 cache is shared as part of a CPU complex (CCX) comprised of four cores connected to a L3 cache (shared, total of 8 MB per CCX). This L3 cache is made out of four slices, 16-way associative and mostly exclusive of L2. The association works such that each core can access every cache with the same average latency.

As an example, AMD has provided details on how a neural net prediction would work with this new core and cache design. We see here the prediction of two branches per cycle; a large L1/L2 branch target buffer, 32 entry-return stack (large, good for VM), and 512 entry ITA (indirect target array) are all capable of predicting multiple targets from the same branch.

Here's a quick comparison of the core design and cache systems for AMD "Bulldozer" and "Zen" based microarchitectures and Intel's Broadwell-E/EP families. Note that Intel still has to unveil their 2017 Skylake Purley architecture, so what we are seeing from AMD today will have to compete against that as well.

Simultaneous Multithreading (SMT) and Instructions

Simultaneous Multithreading (SMT) was introduced with Ryzen and carries on with the EPYC platform as well. The structures seen in the SMT overview all are valid in a single-threaded mode as well, with front-end queues following a round-robin-style priority override. SMT, as with Ryzen, allows for increased workload throughput, and this will be especially handy for datacenter applications. As a color code, the light blue structures in the overview above are competitively shared, the darker blue structures are competitively shared and SMT tagged, the green structures are competitively shared with algorithmic priority, and the white structures are statistically partitioned.

Another place of improvement that "Zen" has had over "Bulldozer" are in the instruction set, including some that are exclusive to AMD. CLZERO clears a cache line, which is useful for when you have to fill a block of memory with zeros and don't want the zeros to take up precious space in cache. PTE Coalescing is used to optimize the TLB: adjacent 4K memory pages can be grouped to occupy a single 32K page table entry.

We also see key improvements in virtualization. For example, the addition of the virtual APIC helps to prevent passing execution to the hypervisor for interrupt execution, which has a slight performance penalty associated with it. As quoted by AMD, these improvements result in up to 50% less latency overall in VM workloads relative to "Bulldozer" (Piledriver microarchitecture).

Jun 6th, 2025 11:12 CDT change timezone

Latest GPU Drivers

New Forum Posts

11:11 by avidgamer121
how come windows 11 is using so much memory with only 2 programs open? (8)
11:06 by mb194dc
Open AI disobeys shut down command! (3)
10:58 by Frick
What are you playing? (23752)
10:49 by oxrufiioxo
Are The "Newer" NVIDIA GPU Driver Series Now Good Enough To Install (Bugs Fixed?) (29)
10:44 by Bomby569
Upgrade 6600xt ti 9060xt 16gb (37)
10:42 by Shrek
To distill or not distill what say ye? (121)
10:41 by Antes533
Modified drivers for X-Fi sound cards. (83)
10:38 by Shrek
Reverse engineering a 12VHPWR cable (58)
10:26 by the54thvoid
Stalker 2 is looking great. (170)
10:10 by Greenslade
TPU's Nostalgic Hardware Club (20368)

Popular Reviews

Jun 4th, 2025 ASUS Radeon RX 9060 XT Prime OC 16 GB Review
Jun 5th, 2025 Sapphire Radeon RX 9060 XT Nitro+ OC Review - The Fastest RX 9060 XT
Jun 2nd, 2025 TechPowerUp Best of Computex 2025
May 27th, 2025 NVIDIA GeForce RTX 5060 8 GB Review
Jun 3rd, 2025 AQIRYS Andromeda Pro Review
Jun 5th, 2025 XFX Radeon RX 9060 XT Swift OC 16 GB Review
May 13th, 2025 Upcoming Hardware Launches 2025 (Updated May 2025)
Jun 5th, 2025 ASRock Radeon RX 9060 XT Steel Legend OC 16 GB Review
Nov 6th, 2024 AMD Ryzen 7 9800X3D Review - The Best Gaming Processor
Mar 5th, 2025 Sapphire Radeon RX 9070 XT Nitro+ Review - Beating NVIDIA