I'm just wondering how they are doing it. How they prioritize what goes into on-board memory and what goes into system RAM. Is it game dependent or is it fully on-the-fly, do they have to make game profiles, this is the stuff I'm wondering the most about. Because if they can achieve this fully on-the-fly without any profiles, just a really intelligent algorithms (maybe assisted with driver updatable algorithms on software level) that could be really sweet.
Resource streaming has to be implemented in the game engine.
Prefetching in CPUs works by finding access patterns; e.g. access of block at address x, then x + k, then x + 2k, but it has three requirements:
- The data to be accessed needs to be specifically laid out the way it's going to be accessed.
- There has to be several accesses before there can be a pattern, which means several cache misses, which in turn means stutter or missing resources.
- The patterns have to occur over a relatively short time, and there is no way you can look for patterns in hundreds of thousands of memory accesses. A CPU for comparison looks through a instruction window of up to 224 instructions. For GPUs we have queues of up to several thousands of instructions, and it's not like the driver is going to analyze the queues for several frames to look for patterns and keep a giant table and resolve that immediately.
The only game data that would have benefits from this would be landscape data, but the data still needs to laid out in a specific pattern, which is something developers usually don't control. Also, this kind of caching would only work as long as the camera keeps moving in straight lines over time.
Resource streaming can be very successful when it's implemented properly in the rendering engine itself.
RX480 and RX580 might have 8GB, but they only have that and no more. RX Vega with HBC can address all the memory you have in the system. In my case that would be 16+ GB of always free RAM and 8GB on-board. Not even GTX 1080Ti or Titan X Pascal has that.
FYI: Sharing of memory between CPU and GPU has been available in CUDA for years, so the idea is not new. It does however have very limited use cases.
Problem with texture streaming is that you're essentially doing a VRAM+HDD/VRAM+SSD instead of something a lot faster. And Vega's HBC with VRAM+RAM (+SSD) could certainly address that far better in same way how CPU addresses it's memory hierarchically. L1 cache is VRAM. L2 is RAM. L3 can be SSD. Because texture streaming still causes hitching, stuttering and framerate lag when doing texture streaming the way current game engines do it (VRAM+HDD) because it's parsing textures from really slow medium.
As someone who has implemented texture streaming with a three-level hierarchy, I can tell you the problem is prediction. With HBC each access to RAM is still going to be very slow, so the data has to be prefetched. HBC is not going to make the accesses "better".