Copy and Paste
Multi-Core Processors said:
AMD Family 15h processors have multiple compute units, each containing its own L2 cache and two
cores. The cores share their compute unit’s L2 cache. Each core incorporates the complete x86
instruction set logic and L1 data cache. Compute units share the processor’s L3 cache and
Northbridge.
Internal Instruction Formats said:
AMD Family 15h processors perform four types of primitive operations:
• Integer (arithmetic or logic)
• Floating-point (arithmetic)
• Load
• Store
The AMD64 instruction set is complex. Instructions have variable-length encoding and many
perform multiple primitive operations. AMD Family 15h processors do not execute these complex
instructions directly, but, instead, decode them internally into simpler fixed-length instructions called
macro-ops. Processor schedulers subsequently break down macro-ops into sequences of even simpler
instructions called micro-ops, each of which specifies a single primitive operation.
A macro-op is a fixed-length instruction that:
• Expresses, at most, one integer or floating-point operation and one load and/or store operation.
• Is the primary unit of work managed (that is, dispatched and retired) by the processor.
A micro-op is a fixed-length instruction that:
• Expresses one and only one of the primitive operations that processor can perform (for example, a
load).
Comparing | AMD64 instructions | Macro-ops | Micro-ops
Complexity | Complex | Average | Simple
| A single instruction may specify one or more of each of the following operations: | A single macro-op may specify—at most—one integer or floating-point operation and one of the following operations: | A single micro-op specifies only one of the following primitive operations:
| • Integer or floating-point | • Load | • Integer or floating-point
| • Load | • Store | • Load
| • Store | • Load and store to the same address | • Store
Encoded length | Variable (instructions are different lengths) | Fixed (all macro-ops are the same length) | Fixed (all micro-ops are the same length)
Regularized instruction fields | No (field locations and definitions vary among instructions) | Yes (field locations and definitions are the same for all macro-ops) | Yes (field locations and definitions are the same for all micro-ops)
Instruction Type | Description
FastPath Single | Decodes directly into one macro-op in microprocessor hardware.
FastPath Double | Decodes directly into two macro-ops in microprocessor hardware.
Microcode | Decodes into one or more (usually three or more) macro-ops using the on-chip microcode-engine ROM (MROM).
AMD Instruction Set Enhancements said:
The AMD Family 15h processor has been enhanced with the following new instructions:
• XOP and AVX support—Extended Advanced Vector Extensions provide enhanced instruction
encodings and non-destructive operands with an extended set of 128-bit (XMM) and 256-bit
(YMM) media registers
• FMA instructions—support for floating-point fused multiply accumulate instructions
• Fractional extract instructions—extract the fractional portion of vector and scalar single-precision
and double-precision floating-point operands
• Support for new vector conditional move instructions.
• VPERMILx instructions—allow selective permutation of packed double- and single-precision
floating point operands
• VPHADDx/VPSUBx—support for packed horizontal add and substract instructions
• Support for packed multiply, add and accumulate instructions
• Support for new vector shift and rotate instructions
Floating-Point Improvements said:
AMD Family 15h processors add support for 128-bit floating-point execution units. As a result, the
throughput of both single-precision and double-precision floating-point SIMD vector operations has
improved by 2X over the previous generation of AMD processors.
Users may notice differences in the results of programs when using the fused multiply and add
FMAC. These differences do not imply that the new results are less accurate than using the ADD and
MUL instructions separately. These differences result from the combination of an ADD and a MUL
into a single instruction. As separate instructions, ADD and MUL provide a result which is accurate
to ½ a bit in the least significant bit for the precision provided. However, the combined result of the
ADD and the MUL is not accurate to ½ a bit.
By fusing these two instructions into a “single” instruction, a fused multiply accumulate (FMAC), an
accurate result is provided that is within ½ a bit in the in least significant bit. Thus the difference
between performing “separate” ADDs and MULs and doing a “single” FMAC is the cause of
differences in the least significant bit of program results.
Instruction Fetching Improvements said:
While previous AMD64 processors had a single 32-byte fetch window, AMD Family 15h processors
have two 32-byte fetch windows, from which four μops can be selected. These fetch windows, when
combined with the 128-bit floating-point execution unit, allow the processor to sustain a
fetch/dispatch/retire sequence of four instructions per cycle. Most instructions decode to a single μop,
but fastpath double instructions decode to two μops. ALU instructions can also issue four μops per
cycle and microcoded instructions should be considered single issue. Thus, there is not necessarily a
one-to-one correspondence between the decode size of assembler instructions and the capacity of the
32-byte fetch window and the production of optimal assembler code requires considerable attention
to the details of the underlying programming constraints. Assembly language programmers can now group more instructions together but must still concern
themselves with the possibility that an instruction may span a 32-byte fetch window. In this regard, it
is also advisable to align hot loops to 32 bytes instead of 16 bytes, especially in the case of loops for
large SIMD instructions.
Instruction Decode and Floating-Point Pipe Improvements said:
Several integer and floating-point instructions have improved latencies and decode types on
AMD Family 15h processors. Furthermore, the FPU pipes utilized by several floating-point
instructions have changed. These changes can influence instruction choice and scheduling for
compilers and hand-written assembly code.
Notable Performance Improvements said:
Several enhancements to the AMD64 architecture have resulted in significant performance
improvements in AMD Family 15h processors, including:
• Improved performance of shuffle instructions
• Improved data transfer between floating-point registers and general purpose registers
• Improved floating-point register to floating-point register moves
• Optimization of repeated move instructions
• More efficient PUSH/POP stack operations
• 1-Gbyte paging
Improved Bandwidth Decode Type for Shuffle Instructions said:
The floating-point logic in AMD Family 15h processors uses three separate execution positions or
pipes called FADD, FMUL and FSTORE. This is illustrated in Figure 1 on page 32 in Appendix A.
Current AMD Family 15h processors support two SIMD logical/shuffle units, one in the FMUL pipe
and another in the FADD pipe, while previous AMD64 processors have only one SIMD
logical/shuffle unit in the FMUL pipe. As a result, the SIMD shuffle instructions can be processed at
twice the previous bandwidth on AMD Family 15h processors. Furthermore, the PSHUFD and
SHUFPx shuffle instructions are now DirectPath instructions instead of VectorPath instructions on
AMD Family 15h processors and take advantage of the 128-bit floating point execution units. Hence,
these instructions get a further 2X boost in bandwidth, resulting in an overall improvement of 4X in
bandwidth compared to the previous generation of AMD processors.
It’s more efficient to use SHUFPx and PSHUFD instructions over combinations of more than one
MOVLHPS/MOVHLPS/UNPCKx/PUNPCKx instructions to do shuffle operations.
Floating-Point Register-to-Register Moves said:
On previous AMD processors, floating-point register-to-register moves could only go through the
FADD and FMUL pipes. On AMD Family 15h processors, floating-point register-to-register moves
can also go through the FSTORE pipe, thereby improving overall throughput.
Large Page Support said:
AMD Family 15h processors now have better large page support, having incorporated new 1GB
paging and 2MB and 4KB paging improvements.
The L1 data TLB and L2 data TLB now support 1GB pages, a benefit to applications making large
data-set random accesses.
The L1 instruction TLB, L1 data TLB and L2 data TLB have increased the number of entries for
2MB pages. This improves the performance of software that uses 2MB code or data or code mixed
with data virtual pages.
The L1 data TLB has also increased the number of entries for 4KB pages.
Key Features said:
AMD Family 15h processors include many features designed to improve software performance. The
internal design, or microarchitecture, of these processors provides the following key features:
• Up to 8 Compute Units (CUs) with 2 cores per CU
• Integrated DDR3 memory controller (two on some models) with memory prefetcher
• 64-Kbyte L1 instruction cache per CU
• 16-Kbyte L1 data cache per core
• Unified L2 cache shared between cores of CU
• Shared L3 cache on chip (for supported platforms)
• 32-byte instruction fetch
• Instruction predecode and branch prediction during cache-line fills
• Decoupled prediction and instruction fetch pipelines
• Four-way instruction decoding
• Dynamic scheduling and speculative execution
• Two-way integer execution
• Two-way address generation
• Two-way 128-bit wide floating-point execution
• Legacy single-instruction multiple-data (SIMD) instruction extensions, as well as support for
XOP, FMA4, VPERMILx, and Advanced Vector Extensions (AVX).
• Superforwarding
• Prefetch into L2 or L1 data cache
• Deep out-of-order integer and floating-point execution
• HyperTransport™ technology
Microarchitecture of AMD Family 15h Processors said:
AMD Family 15h processors implement the AMD64 instruction set by means of macro-ops (the
primary units of work managed by the processor) and micro-ops (the primitive operations executed in
the processor's execution units). These are simple fixed-length operations designed to include direct
support for AMD64 instructions and adhere to the high-performance principles of fixed-length
encoding, regularized instruction fields, and a large register set. This enhanced microarchitecture
enables higher processor core performance and promotes straightforward extensibility for future
designs.
Superscalar Processor said:
The AMD Family 15h processor is an aggressive, out-of-order, superscalarprocessor. It can fetch,
decode, and issue up to four instructions per cycle using decoupled fetch and branch prediction units
and three independent instruction schedulers, consisting of two integer schedulers and one floatingpoint
scheduler.
These processors can fetch 32 bytes per cycle and can scan two 16-byte instruction windows for up to
four micro-ops, which can be dispatched together in a single cycle. The actual number of micro-ops
that are dispatched may be lower, depending on a number of factors, such as decode limits like the
number of loads and stores which can issue together and whether instructions can be broken up into
16-byte windows. The processors move integer instructions through the replicated integer clusters
and floating point instructions through the shared floating point unit (FPU).
L1 Instruction Cache said:
The out-of-order execution engine of AMD Family 15h processors contains a 64-Kbyte, 2-way setassociative
L1 instruction cache. Each line in this cache is 64 bytes long. However, only 32 bytes are
fetched in every cycle. Functions associated with the L1 instruction cache are instruction loads,
instruction prefetching, instruction predecoding, and branch prediction. Requests that miss in the L1
instruction cache are fetched from the L2 cache or, subsequently, from the L3 cache or system
memory.
On misses, the L1 instruction cache generates fill requests to a naturally aligned 64-byte line
containing the instructions and the next sequential line of bytes (a prefetch). Because code typically
exhibits spatial locality, prefetching is an effective technique for avoiding decode stalls. Cache-line
replacement is based on a least-recently-used replacement algorithm.
Predecoding begins as the L1 instruction cache is filled. Predecode information is generated and
stored alongside the instruction cache. This information is used to help efficiently identify the
boundaries between variable length AMD64 instructions.
L1 Data Cache said:
The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128-
bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle. It is divided
into 16 banks, each 16 bytes wide. In addition, the L1 cache is protected from single bit errors through
the use of parity. There is a hardware prefetcher that brings data into the L1 data cache to avoid
misses. The L1 data cache has a 4-cycle load-to-use latency. Only one load can be performed from a
given bank of the L1 cache in a single cycle.
L2 Cache said:
The AMD Family 15h processor has one shared L2 cache per compute unit. This full-speed on-die L2
cache is mostly inclusive relative to the L1 cache. The L2 is a write-through cache. Every time a store
is performed in a core, that address is written into both the L1 data cache of the core the store belongs
to and the L2 cache (which is shared between the two cores). The L2 cache has an 18-20 cycle load to
use latency.
L3 Cache said:
The AMD Family 15h processor supports a maximum of 8MB of L3 cache per die, distributed among
four L3 sub-caches which can each be up to 2MB in size. The L3 cache is considered a non-inclusive
victim cache architecture optimized for multi-core AMD processors. Only L2 evictions cause
allocations into the L3 cache. Requests that hit in the L3 cache can either leave the data in the L3
cache—if it is likely the data is being accessed by multiple cores—or remove the data from the L3
cache (and place it solely in the L1 cache, creating space for other L2 victim/copy-backs), if it is likely the data is only being accessed by a single core. Furthermore, the L3 cache of the AMD Family
15h processor also features a number of micro-architectural improvements that enable higher
bandwidth.
Branch-Prediction said:
To predict and accelerate branches, AMD Family 15h processors employ a combination of nextaddress
logic, a 2-level branch target buffer (BTB) for branch identification and direct target
prediction, a return address stack used for predicting return addresses, an indirect target predictor for
predicting indirect jump and call addresses, a hybrid branch predictor for predicting conditional
branch directions, and a fetch window tracking structure (BSR). Predicted-taken branches incur a 1-
cycle bubble in the branch prediction pipeline when they are predicted by the L1 BTB, and a 4-cycle
bubble in the case where they are predicted by the L2 BTB. The minimum branch misprediction
penalty is 20 cycles in the case of conditional and indirect branches and 15 cycles for unconditional
direct branches and returns.
The BTB is a tagged two-level set associative structure accessed using the fetch address of the current
window. Each BTB entry includes information about a branch and its target. The L1 BTB contains
128 sets of 4 ways for a total of 512 entries, while the L2 BTB has 1024 sets of 5 ways for a total of
5120 entries.
The hybrid branch predictor is used for predicting conditional branches. It consists of a global
predictor, a local predictor and a selector that tracks whether each branch is correlating better with the
global or local predictor. The selector and local predictor are indexed with a linear address hash. The
global predictor is accessed via a 2-bit address hash and a 12-bit global history.
AMD Family 15h processors implement a separate 512- entry indirect target array used to predict
indirect branches with multiple dynamic targets.
In addition, the processors implement a 24-entry return address stack to predict return addresses from
a near or far call. Most of the time, as calls are fetched, the next return address is pushed onto the
return stack and subsequent returns pop a predicted return address off the top of the stack. However,
mispredictions sometimes arise during speculative execution. Mechanisms exist to restore the stack to
a consistent state after these mispredictions.
Instruction Fetch and Decode said:
AMD Family 15h processors can theoretically fetch 32B of instructions per cycle and send these
instructions to the Decode Unit (DE) in 16B windows through the 16-entry (per-thread) Instruction
Byte Buffer (IBB). The Decode Unit can only scan two of these 16B windows in a given cycle for up
to four instructions. If four instructions partially or wholly exist in more than two of these windows,
only those instructions within the first and second windows will be decoded. Aligning to 16B
boundaries is important to achieve full decode performance.
Integer Execution said:
The integer execution unit for the AMD Family 15h processor consists of two components:
• the integer datapath
• the instruction scheduler and retirement control
These two components are responsible for all integer execution (including address generation) as well
as coordination of all instruction retirement and exception handling. The instruction scheduler and
retirement control tracks instruction progress from dispatch, issue, execution and eventual retirement.
The scheduling for integer operations is fully data-dependency driven; proceeding out-of-order based
on the validity of source operands and the availability of execution resources.
Since the Bulldozer core implements a floating point co-processor model of operation, most
scheduling and execution decisions of floating-point operations are handled by the floating point unit.
However, the scheduler does track the completion status of all outstanding operations and is the final
arbiter for exception processing and recovery.
Translation-Lookaside Buffer said:
A translation-lookaside buffer (TLB) holds the most-recently-used page mapping information. It
assists and accelerates the translation of virtual addresses to physical addresses.
The AMD Family 15h processors utilize a two-level TLB structure.
L1 Instruction TLB Specifications said:
The AMD Family 15h processor contains a fully-associative L1 instruction TLB with 48 4-Kbyte
page entries and 24 2-Mbyte or 1-Gbyte page entries. 4-Mbyte pages require two 2-Mbyte entries;
thus, the number of entries available for 4-Mbyte pages is one half the number of 2-Mbyte page
entries.
L1 Data TLB Specifications said:
The AMD Family 15h processor contains a fully-associative L1 data TLB with 32 entries for 4-
Kbyte, 2-Mbyte, and 1-Gbyte pages. 4-Mbyte pages require two 2-Mbyte entries; thus, the number of
entries available for 4-Mbyte pages is one half the number of 2-Mbyte page entries.
L2 Instruction TLB Specifications said:
The AMD Family 15 processor contains a 4-way set-associative L2 instruction TLB with 512 4-
Kbyte page entries.
L2 Data TLB Specifications said:
The AMD Family 15h processor contains an L2 data TLB and page walk cache (PWC) with 1024 4-
Kbyte, 2-Mbyte or 1-Gbyte page entries (8-way set-associative). 4-Mbyte pages require two 2-Mbyte
entries; thus, the number of entries available for 4-Mbyte pages is one half the number of 2-Mbyte
page entries.
Integer Unit said:
The integer unit consists of two components, the integer scheduler, which feeds the integer execution
pipes, and the integer execution unit, which carries out several types of operations discussed below.
The integer unit is duplicated for each thread pair.
Integer Scheduler said:
The scheduler can receive and schedule up to four micro-ops (μops) in a dispatch group per cycle.
The scheduler tracks operand availability and dependency information as part of its task of issuing
μops to be executed. It also assures that older μops which have been waiting for operands are
executed in a timely manner. The scheduler also manages register mapping and renaming.
*Might be an error the four micro-ops are actually four marco-ops because in the next section it says "Macro-ops are broken down into micro-ops in the schedulers."
Integer Execution Unit said:
There are four integer execution units per core. Two units which handle all arithmetic, logical and
shift operations (EX). And two which handle address generation and simple ALU operations
(AGLU). This makes an Integer Cluster. There is two such clusters per compute unit.
Macro-ops are broken down into micro-ops in the schedulers. Micro-ops are executed when their
operands are available, either from the register file or result buses. Micro-ops from a single operation
can execute out-of-order. In addition, a particular integer pipe can execute two micro-ops from
different macro-ops (one in the ALU and one in the AGLU) at the same time. The scheduler can receive up to four macro-ops per cycle. This group of macro-ops is
called a dispatch group.
EX0 contains a variable latency non-pipelined integer divider. EX1 contains a pipelined integer
multiplier. The AGLUs contain a simple ALU to execute arithmetic and logical operations and
generate effective addresses. A load and store unit (LSU) reads and writes data to and from the L1
data cache. The integer scheduler sends a completion status to the ICU when the outstanding microops
for a given macro-op are executed.
The LZCNT and POPCNT operations are handled in a pipelined unit attached to EX0
Floating-Point Unit said:
The AMD Family 15h processor floating point unit (FPU) was designed to provide four times the raw
FADD and FMUL bandwidth as the original AMD Opteron and Athlon 64 processors. It achieves this
by means of two 128-bit fused multiply-accumulate (FMAC) units which are supported by a 128-bit
high-bandwidth load-store system. The FPU is a coprocessor model that is shared between the two
cores of one AMD Family 15h compute unit. As such it contains its own scheduler, register files and
renamers and does not share them with the integer units. This decoupling provides optimal
performance of both the integer units and the FPU. In addition to the two FMACs, the FPU also
contains two 128-bit integer units which perform arithmetic and logical operations on AVX, MMX
and SSE packed integer data.
A 128-bit integer multiply accumulate (IMAC) unit is incorporated into FPU pipe 0. The IMAC
performs integer fused multiply and accumulate, and similar arithmetic operations on AVX, MMX
and SSE data. A crossbar (XBAR) unit is integrated into FPU pipe 1 to execute the permute
instruction along with shifts, packs/unpacks and shuffles. There is an FPU load-store unit which
supports up to two 128-bit loads and one 128-bit store per cycle.
FPU Features Summary and Specifications:
• The FPU can receive up to four ops per cycle. These ops can only be from one thread, but the
thread may change every cycle. Likewise the FPU is four wide, capable of issue, execution and
completion of four ops each cycle. Once received by the FPU, ops from multiple threads can be
executed.
• Within the FPU, up to two loads per cycle can be accepted, possibly from different threads.
• There are four logical pipes: two FMAC and two packed integer. For example, two 128-bit
FMAC and two 128-bit integer ALU ops can be issued and executed per cycle.
• Two 128-bit FMAC units. Each FMAC supports four single precision or two double-precision
ops.
• FADDs and FMULs are implemented within the FMAC’s.
• x87 FADDs and FMULs are also handled by the FMAC.
• Each FMAC contains a variable latency divide/square root machine.
• Only 1 256-bit operation can issue per cycle, however an extra cycle can be incurred as in the case
of a FastPath Double if both micro ops cannot issue together.
*Might be another error when micro ops are said but they mean macro ops
Load-Store Unit said:
The AMD family 15h processor load-store (LS) unit handles data accesses. There are two LS units
per compute unit, or one per core. The LS unit supports two 128-bit loads/cycles and one 128-bit
store/cycle. There is a 24 entry store queue. This queue buffers stored data until it can be written to
the data cache. The load queue has 40 entries and holds load operations until after the load has been
completed and delivered to the integer unit or the FPU. The LS unit is composed of two largely independent
pipelines enabling the execution of two memory operations per cycle.
Finally, the LS unit helps ensure that the architectural load and store ordering rules are preserved (a
requirement for AMD64 architecture compatibility).
Adding more with edits
I hate this Table tool!!!
Also, there might be errors in this some lines are copy and pasted(This has been out since April)
I am postin this here because this is a "Bulldozer" information thread