• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Panmnesia Uses CXL Protocol to Expand GPU Memory with Add-in DRAM Card or Even SSD

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,999 (1.07/day)
South Korean startup Panmnesia has unveiled an interesting solution to address the memory limitations of modern GPUs. The company has developed a low-latency Compute Express Link (CXL) IP that could help expand GPU memory with external add-in card. Current GPU-accelerated applications in AI and HPC are constrained by the set amount of memory built into GPUs. With data sizes growing by 3x yearly, GPU networks must keep getting larger just to fit the application in the local memory, benefiting latency and token generation. Panmnesia's proposed approach to fix this leverages the CXL protocol to expand GPU memory capacity using PCIe-connected DRAM or even SSDs. The company has overcome significant technical hurdles, including the absence of CXL logic fabric in GPUs and the limitations of existing unified virtual memory (UVM) systems.

At the heart of Panmnesia's solution is a CXL 3.1-compliant root complex with multiple root ports and a host bridge featuring a host-managed device memory (HDM) decoder. This sophisticated system effectively tricks the GPU's memory subsystem into treating PCIe-connected memory as native system memory. Extensive testing has demonstrated impressive results. Panmnesia's CXL solution, CXL-Opt, achieved two-digit nanosecond round-trip latency, significantly outperforming both UVM and earlier CXL prototypes. In GPU kernel execution tests, CXL-Opt showed execution times up to 3.22 times faster than UVM. Older CXL memory extenders recorded around 250 nanoseconds round trip latency, with CXL-Opt potentially achieving less than 80 nanoseconds. As with CXL, the problem is usually that the memory pools add up latency and performance degrades, while these CXL extenders tend to add to the cost model as well. However, the Panmnesia CXL-Opt could find a use case, and we are waiting to see if anyone adopts this in their infrastructure.



Below are some benchmarks by Panmnesia, as well as the architecture of the CXL-Opt.



View at TechPowerUp Main Site | Source
 
that's at most what, 128GB/s on 16x Gen 5 PCIe? really not much for a big GPU, that's even less than what the RX 6500 XT has.
 
that's at most what, 128GB/s on 16x Gen 5 PCIe? really not much for a big GPU, that's even less than what the RX 6500 XT has.

That is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!

Of course, as faster as possible memory interfaces are better.
 
I think the Phison AI100E / aiDAPTIV+ is more practical for most people, hope to see coverage / testing on that
 
Everthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.
 
But that name ... If someone had asked me yesterday what "panmnesia" means, I'd answer that it's a situation where everyone forgets everything. (Or should that mean everyone except AI?)

That is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!

Of course, as faster as possible memory interfaces are better.
But this is exactly that, if I understand its purpose well. It's low-latency memory that's shared between nodes, and it becomes part of each GPU's memory space.

Also, do any modern GPU+CPU architectures exist that can actually share memory between nodes, the way a multi-socket CPU system does?
 
now I can fix this pos 8gb 3070ti and say eat sh!t jensen
 
Oh sweet summer children... this will never come to the consumer sector. :roll:
Sad, since this would almost be a good excuse all its own for Gen6-> PCI-E in the Consumer Market.

Besides adding resources to GPUs and CPUs, being able to address relatively large amounts of prev.-gen. 'surplus' RAM as NVMe-like (RAMdrive) storage/cache would be useful. [Both in the Enthusiast-Consumer world, and Industry]

NtM, if Intel hasn't completely abandoned Optane; they could easily reinvigorate interest.
Offering Intel-licensed Pmem Cards (optionally, utilizing once platform-propietary P-DIMMs) over CXL, would greatly broaden the potential market. Esp. w/ the newfound interests in "AI-ing every-thing" :laugh:
 
Last edited:
Will never be supported for desktop dGPUs so forget it, and it's also not coming any time soon.

GPU makers should add DDR5 memory slots on the consumer GPU so we can expand memory and still have good latency compared to any on MB solution.
 
That is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!
There's a problem with your explanation. You say its not a big problem to move one big chunk slowly once, because then the data is on the GPU to be processed there. This is different. This is one big chunk next to the GPU, which will then be processed in many small chunks over the slow bus. It's effectively moving the data around on the slow bus constantly, because this is a product designed to be used with GPUs which don't have enough onboard VRAM.
 
Will never be supported for desktop dGPUs so forget it, and it's also not coming any time soon.

GPU makers should add DDR5 memory slots on the consumer GPU so we can expand memory and still have good latency compared to any on MB solution.
I don't think this will ever happen because,
1. The likes of Nvidia will never allow it and they have an iron reign over these AIBs.
2. Such option will deprive them of higher revenue/ profit margin since it allows you buy a lower end model and increase the RAM.
 
I don't think this will ever happen because,
1. The likes of Nvidia will never allow it and they have an iron reign over these AIBs.
2. Such option will deprive them of higher revenue/ profit margin since it allows you buy a lower end model and increase the RAM.
Oh indeed, but I can dream and it would be a simple option for consumer GPUs. This CXL stuff is for workstation+ class GPU.

Nvidia could of course stop gimping their GPUs and pretending L2 cache is the answer.
 
Last edited:
Everthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.
That was in 1997 (in my case) when I added RAM to my ATI GPU back then.
 
This is great, we're finally going to be able to play Crysis at over 30fps in 1080p.
 
Everthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.

For a short period of time, AMD also experimented with their Radeon Pro SSG cards, which included a user-upgradable NVMe drive and provided up to 2TB worth of video card memory.

There were some niche use-cases for it, and there were also attempts by some hardcore enthusiasts to try and access it to install games onto.

Would be interesting if AMD could bring it back for newer datacenter Accelerators as well as even for top-level gaming cards, making full use of PCIe 4.0 bandwidth or even PCIe 5.0 bandwidth to either use the SSDs as extra storage or internally to speed up memory use somehow.
 
Back
Top