Improved Memory Mangement
Take a good look at the schematic diagram above. It is not a die block diagram, it's not even the layout of the multi-chip module that some of the first "Vega" GPUs will be, but rather illustrates a completely revamped memory architecture which makes sure data moves smoothly in and out of the GPU, and precious resources aren't wasted in fetching data from the host machine. AMD GPUs have traditionally been endowed with vast amounts of memory bandwidth using wide memory bus widths; however, AMD thinks there is room for improvement in the way the GPU juggles data between the host and its local video memory, and that it can no longer simply throw brute memory bandwidth at some fundamental problems.
AMD feels there is a disparity between the memory allocation and actual memory access by apps. An app could load into memory resources it finds relevant to the 3D scene being rendered, but not actually access all of it all the time. This disparity eats up precious memory, hogs memory bandwidth, and wastes clock cycles in trying to move data. The graphics driver development team normally collaborates with game developers to minimize this phenomenon and rectify it both through game patches and driver updates. AMD feels something like this can be corrected at the hardware level. AMD calls this "adaptive fine-grained data movement." It is a comprehensive memory allocation pipeline that senses the relevance of data to preemptively move it to the relevant physical memory, or defers access.
Pulling something like this off requires new hardware components not found on AMD GPUs ever before. It begins with a fast cache memory that sits at a level above the traditional L2 cache, one that is sufficiently large and has extremely low latency. This cache is a separate silicon die that sits on the interposer, the silicon substrate that connects the GPU die to the memory stacks. AMD is calling this the High Bandwidth Memory Cache (HBMC). The GPU's conventional memory controllers won't interface with this cache since a dedicated High Bandwidth Cache Controller (HBCC) on the main GPU die handles it. High Bandwidth Memory Cache isn't the same as the HBM2 memory stacks.
The HBCC has direct access to the other memory along the memory pipeline, including the video memory, system memory, and so on. It has its own 512 TB virtual address space that's isolated / abstract from the machine's general address space. The GPU uses the HBMC to cushion and smooth out data movement between the host machine and the GPU. This approach ensures the GPU has to spend lesser resources on fetching irrelevant data, which greatly improves memory bandwidth utilization. The reason for such a large virtual address space is the same as on the CPU: Adresses can be allocated more efficiently, with the memory-management unit in the GPU managing the virtual-to-physical mapping and also having the ability to move memory pages between storage layers, similar to how the Windows paging file works. Also, you'll notice the little box named "NVRAM." This means the GPU has the ability to directly interface with NAND Flash or 3D X-point SSDs over a localized PCIe connection, which gives it a fast scratchpad for help with processing gargantuan data sets. The "Network" port lets graphics card makers add network PHYs directly onto the card, which would help with rendering farms. This way, AMD is prepping a common silicon for various applications (consumer graphics, professional graphics, and rendering farms).
As reported in the news, "Vega" takes advantage of HBM2 memory, which comes with eight times the maximum density per stack and double the bandwidth as HBM1 memory, which debuted with the Radeon R9 Fury X. In theory, you can deploy up to 32 GB of memory across four stacks, doing away with the crippling 4 GB limitation of HBM1.