News Posts matching #Chiplet

Return to Keyword Browsing

AMD Patents Chiplet-based GPU Design With Active Cache Bridge

AMD on April 1st published a new patent application that seems to show the way its chiplet GPU design is moving towards. Before you say it, it's a patent application; there's no possibility for an April Fool's joke on this sort of move. The new patent develops on AMD's previous one, which only featured a passive bridge connecting the different GPU chiplets and their processing resources. If you want to read a slightly deeper dive of sorts on what chiplets are and why they are important for the future of graphics (and computing in general), look to this article here on TPU.

The new design interprets the active bridge connecting the chiplets as a last-level cache - think of it as L3, a unifying highway of data that is readily exposed to all the chiplets (in this patent, a three-chiplet design). It's essentially AMD's RDNA 2 Infinity Cache, though it's not only used as a cache here (and for good effect, if the Infinity Cache design on RDNA 2 and its performance uplift is anything to go by); it also serves as an active interconnect between the GPU chiplets that allow for the exchange and synchronization of information, whenever and however required. This also allows for the registry and cache to be exposed as a unified block for developers, abstracting them from having to program towards a system with a tri-way cache design. There are also of course yield benefits to be taken here, as there are with AMD's Zen chiplet designs, and the ability to scale up performance without any monolithic designs that are heavy in power requirements. The integrated, active cache bridge would also certainly help in reducing latency and maintaining chiplet processing coherency.
AMD Chiplet Design Patent with Active Cache Hierarchy AMD Chiplet Design Patent with Active Cache Hierarchy AMD Chiplet Design Patent with Active Cache Hierarchy AMD Chiplet Design Patent with Active Cache Hierarchy

AMD Files Patent for Chiplet Machine Learning Accelerator to be Paired With GPU, Cache Chiplets

AMD has filed a patent whereby they describe a MLA (Machine Learning Accelerator) chiplet design that can then be paired with a GPU unit (such as RDNA 3) and a cache unit (likely a GPU-excised version of AMD's Infinity Cache design debuted with RDNA 2) to create what AMD is calling an "APD" (Accelerated Processing Device). The design would thus enable AMD to create a chiplet-based machine learning accelerator whose sole function would be to accelerate machine learning - specifically, matrix multiplication. This would enable capabilities not unlike those available through NVIDIA's Tensor cores.

This could give AMD a modular way to add machine-learning capabilities to several of their designs through the inclusion of such a chiplet, and might be AMD's way of achieving hardware acceleration of a DLSS-like feature. This would avoid the shortcomings associated with implementing it in the GPU package itself - an increase in overall die area, with thus increased cost and reduced yields, while at the same time enabling AMD to deploy it in other products other than GPU packages. The patent describes the possibility of different manufacturing technologies being employed in the chiplet-based design - harkening back to the I/O modules in Ryzen CPUs, manufactured via a 12 nm process, and not the 7 nm one used for the core chiplets. The patent also describes acceleration of cache-requests from the GPU die to the cache chiplet, and on-the-fly usage of it as actual cache, or as directly-addressable memory.

AMD is Allegedly Preparing Navi 31 GPU with Dual 80 CU Chiplet Design

AMD is about to enter the world of chiplets with its upcoming GPUs, just like it has been doing so with the Zen generation of processors. Having launched a Radeon RX 6000 series lineup based on Navi 21 and Navi 22, the company is seemingly not stopping there. To remain competitive, it needs to be in the constant process of innovation and development, which is reportedly true once again. According to the current rumors, AMD is working on an RDNA 3 GPU design based on chiplets. The chiplet design is supposed to feature two 80 Compute Unit (CU) dies, just like the ones found inside the Radeon RX 6900 XT graphics card.

Having two 80 CU dies would bring the total core number to exactly 10240 cores (two times 5120 cores on Navi 21 die). Combined with the RDNA 3 architecture, which brings better perf-per-watt compared to the last generation uArch, Navi 31 GPU is going to be a compute monster. It isn't exactly clear whatever we are supposed to get this graphics card, however, it may be coming at the end of this year or the beginning of the following year 2022.

AMD Patents Chiplet Architecture for Radeon GPUs

On December 31st, AMD's Radeon group has filed a patent for a chiplet architecture of the GPU, showing its vision about the future of Radeon GPUs. Currently, all of the GPUs available on the market utilize the monolithic approach, meaning that the graphics processing units are located on a single die. However, the current approach has its limitations. As the dies get bigger for high-performance GPU configurations, they are more expensive to manufacture and can not scale that well. Especially with modern semiconductor nodes, the costs of dies are rising. For example, it would be more economically viable to have two dies that are 100 mm² in size each than to have one at 200 mm². AMD realized that as well and has thus worked on a chiplet approach to the design.

AMD reports that the use of multiple GPU configuration is inefficient due to limited software support, so that is the reason why GPUs were kept monolithic for years. However, it seems like the company has found a way to go past the limitations and implement a sufficient solution. AMD believes that by using its new high bandwidth passive crosslinks, it can achieve ideal chiplet-to-chiplet communication, where each GPU in the chiplet array would be coupled to the first GPU in the array. All the communication would go through an active interposer which would contain many layers of wires that are high bandwidth passive crosslinks. The company envisions that the first GPU in the array would communicably be coupled to the CPU, meaning that it will have to use the CPU possibly as a communication bridge for the GPU arrays. Such a thing would have big latency hit so it is questionable what it means really.

CEA-Leti Makes a 96 core CPU from Six Chiplets

Chiplet design of processors is getting more popular due to many improvements and opportunities it offers. Some of the benefits include lower costs as the dies are smaller compared to one monolithic design, while you are theoretically able to stitch as much of the chiplets together as possible. During the ISSCC 2020 conference, CEA-Leti, a French research institute, created a 96 core CPU made from six 3D stacked 16 core chiplets. The chip is created as a demonstration of what this modular approach offers and what are the capabilities of the chiplet-based CPU design.

The chiplets are manufactured on the 28 nm FD-SOI manufacturing process from STMicroelectronics, while the active interposer die below them that is connecting everything is made using the 65 nm process. Each one of the six dies is housing 16 cores based on MIPS Instruction Set Architecture core. Each chiplet is split into four 4-core clusters that make up for a total of 16 cores per chiplet. When it comes to the core itself, it is a scalar MIPS32v1 core equipped with 16 KiB of L1 instruction and an L1 data cache. For L2 cache, there is 256 KiB per cluster, while the L3 cache is split into four 1 MiB tiles for the whole cluster. The chiplets are stacked on top of an active interposer which connects the chiplets and provides external I/O support.

AMD Gives Itself Massive Cost-cutting Headroom with the Chiplet Design

At its 2020 IEEE ISSCC keynote, AMD presented two slides that detail the extent of cost savings yielded by its bold decision to embrace the MCM (multi-chip module) approach to not just its enterprise and HEDT processors, but also its mainstream desktop ones. By confining only those components that tangibly benefit from cutting-edge silicon fabrication processes, namely the CPU cores, while letting other components sit on relatively inexpensive 12 nm, AMD is able to maximize its 7 nm foundry allocation, by making it produce small 8-core CCDs (CPU complex dies), which add up to AMD's target core-counts. With this approach, AMD is able to cram up to 16 cores onto its AM4 desktop socket using two chiplets, and up to 64 cores using eight chiplets on its SP3r3 and sTRX4 sockets.

In the slides below, AMD compares the cost of its current 7 nm + 12 nm MCM approach to a hypothetical monolithic die it would have had to build on 7 nm (including the I/O components). The slides suggest that the cost of a single-chiplet "Matisse" MCM (eg: Ryzen 7 3700X) is about 40% less than that of the double-chiplet "Matisse" (eg: Ryzen 9 3950X). Had AMD opted to build a monolithic 7 nm die that had 8 cores and all the I/O components of the I/O die, such a die would cost roughly 50% more than the current 1x CCD + IOD solution. On the other hand, a monolithic 7 nm die with 16 cores and I/O components would cost 125% more. AMD hence enjoys a massive headroom for cost-cutting. Prices of the flagship 3950X can be close to halved (from its current $749 MSRP), and AMD can turn up the heat on Intel's upcoming Core i9-10900K by significantly lowering price of its 12-core 3900X from its current $499 MSRP. The company will also enjoy more price-cutting headroom for its 6-core Ryzen 5 SKUs than it did with previous-generation Ryzen 5 parts based on monolithic dies.

AMD Doubles L3 Cache Per CCX with Zen 2 "Rome"

A SiSoft SANDRA results database entry for a 2P AMD "Rome" EPYC machine sheds light on the lower cache hierarchy. Each 64-core EPYC "Rome" processor is made up of eight 7 nm 8-core "Zen 2" CPU chiplets, which converge at a 14 nm I/O controller die, which handles memory and PCIe connectivity of the processor. The result mentions cache hierarchy, with 512 KB dedicated L2 cache per core, and "16 x 16 MB L3." Like CPU-Z, SANDRA has the ability to see L3 cache by arrangement. For the Ryzen 7 2700X, it reads the L3 cache as "2 x 8 MB L3," corresponding to the per-CCX L3 cache amount of 8 MB.

For each 64-core "Rome" processor, there are a total of 8 chiplets. With SANDRA detecting "16 x 16 MB L3" for 64-core "Rome," it becomes highly likely that each of the 8-core chiplets features two 16 MB L3 cache slices, and that its 8 cores are split into two quad-core CCX units with 16 MB L3 cache, each. This doubling in L3 cache per CCX could help the processors cushion data transfers between the chiplet and the I/O die better. This becomes particularly important since the I/O die controls memory with its monolithic 8-channel DDR4 memory controller.

On The Coming Chiplet Revolution and AMD's MCM Promise

With Moore's Law being pronounced as within its death throes, historic monolithic die designs are becoming increasingly expensive to manufacture. It's no secret that both AMD and NVIDIA have been exploring an MCM (Multi-Chip-Module) approach towards diverting from monolithic die designs over to a much more manageable, "chiplet" design. Essentially, AMD has achieved this in different ways with its Zen line of CPUs (two CPU modules of four cores each linked via the company's Infinity Fabric interconnect), and their own R9 and Vega graphics cards, which take another approach in packaging memory and the graphics processing die in the same silicon base - an interposer.
Return to Keyword Browsing