Apple Wants to Store LLMs on Flash Memory to Bring AI to Smartphones and Laptops

AleksandarK · Dec 21, 2023

Apple has been experimenting with Large Language Models (LLMs) that power most of today's AI applications. The company wants these LLMs to serve the users best and deliver them efficiently, which is a difficult task as they require a lot of resources, including compute and memory. Traditionally, LLMs have required AI accelerators in combination with large DRAM capacity to store model weights. However, Apple has published a paper that aims to bring LLMs to devices with limited memory capacity. By storing LLMs on NAND flash memory (regular storage), the method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding optimization in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Instead of storing the model weights on DRAM, Apple wants to utilize flash memory to store weights and only pull them on-demand to DRAM once it is needed.

Two principal techniques are introduced within this flash memory-informed framework: "windowing" and "row-column bundling." These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to native loading approaches on CPU and GPU, respectively. Integrating sparsity awareness, context-adaptive loading, and a hardware-oriented design pave the way for practical inference of LLMs on devices with limited memory, such as SoCs with 8/16/32 GB of available DRAM. Especially with DRAM prices outweighing NAND Flash, setups such as smartphone configurations could easily store and inference LLMs with multi-billion parameters, even if the DRAM available isn't sufficient. For a more technical deep dive, read the paper on arXiv here.

View at TechPowerUp Main Site | Source

phanbuey · Dec 21, 2023

This is done by a company here locally that seems to work fairly well. The flash cells can hold various charge values and are good for storing pre-baked AI neurons.

Power-efficient analog compute for edge AI - Mythic

basically they're using the flash as an analog computer.

AleksandarK · Dec 21, 2023

phanbuey said:
This is done by a company here locally that seems to work fairly well. The flash cells can hold various charge values and are good for storing pre-baked AI neurons.

Power-efficient analog compute for edge AI - Mythic

basically they're using the flash as an analog computer.

This is processing in memory, Apple just uses flash to store weights and only pull the weights to DRAM if needed. Not processing in flash

phanbuey · Dec 21, 2023

AleksandarK said:
This is processing in memory, Apple just uses flash to store weights and only pull the weights to DRAM if needed. Not processing in flash

I think it's the same thing, they just call it "Analog computing" - but they still have a digital processor to interpret and makes the data from the model usable.

" Each tile has a large Analog Compute Engine (Mythic ACE™) to store bulky neural network weights, local SRAM memory for data being passed between the neural network nodes, a single-instruction multiple-data (SIMD) unit for processing operations not handled by the ACE"

AleksandarK · Dec 21, 2023

phanbuey said:
I think it's the same thing, they just call it "Analog computing" - but they still have a digital processor to interpret and makes the data from the model usable.

" Each tile has a large Analog Compute Engine (Mythic ACE™) to store bulky neural network weights, local SRAM memory for data being passed between the neural network nodes, a single-instruction multiple-data (SIMD) unit for processing operations not handled by the ACE"

It is not the same thing as ACE, which actually processes the data in memory. What Apple is proposing is still separate processing and memory elements.

phanbuey · Dec 21, 2023

AleksandarK said:
It is not the same thing as ACE, which actually processes the data in memory. What Apple is proposing is still separate processing and memory elements.

but if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.

AleksandarK · Dec 21, 2023

phanbuey said:
but if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.

"Our analog compute takes compute-in-memory to an extreme, where we compute directly inside the memory array itself. This is possible by using the memory elements as tunable resistors, supplying the inputs as voltages, and collecting the outputs as currents. We use analog computing for our core neural network matrix operations, where we are multiplying an input vector by a weight matrix"
It actually processes matrix multiply and accumulate (MAC) operations, which is integral to neural network operation flow. Such data is then pushed further.

phanbuey · Dec 21, 2023

AleksandarK said:
"Our analog compute takes compute-in-memory to an extreme, where we compute directly inside the memory array itself. This is possible by using the memory elements as tunable resistors, supplying the inputs as voltages, and collecting the outputs as currents. We use analog computing for our core neural network matrix operations, where we are multiplying an input vector by a weight matrix"
It actually processes matrix multiply and accumulate (MAC) operations, which is integral to neural network operation flow. Such data is then pushed further.

Fascinating :toast:

.

bug · Dec 21, 2023

Keep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.

AleksandarK · Dec 21, 2023

bug said:
Keep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.

Those are cloud resources; all we care about is local resources at our fingertips!

bug · Dec 21, 2023

AleksandarK said:
Those are cloud resources; all we care about is local resources at our fingertips!

I was just making the distinction, lest we get people excited about how Apple is able to put a whole cloud onto a magic Apple stick...

AnotherReader · Dec 21, 2023

phanbuey said:
but if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.

After reading the paper, this is a clever approach to minimize the amount of data transferred from NAND to DRAM by taking advantage of the sparsity of common LLMs such as GPT-3 and OPT.

bug said:
Keep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.

While training is the most resource intensive part by definition, using the model for inference for hundred of millions or even billions of times will overcome the cost of training as that is a one time cost. Therefore, improving the cost of inference is beneficial too. It also allows more inference to be done on the client side.

Note: The cost for training GPT-3 is estimated to be 3.1 * 10^23 flops while the cost of a single inference operation is estimated to be 740 TFLOPs. After 420 million inferences, the cost of inference will surpass that of training.

bug · Dec 21, 2023

AnotherReader said:
While training is the most resource intensive part by definition, using the model for inference for hundred of millions or even billions of times will overcome the cost of training as that is a one time cost. Therefore, improving the cost of inference is beneficial too. It also allows more inference to be done on the client side.

I'm not sure about LLMs, but for NNs, inference is almost zero cost. A bunch of additions and multiplications that will barely register on any recent CPU.
The biggest part in this would be moving it from the cloud to the client. Even then, it's not that different from what JS did when it started offloading processing to the client (browser). It's a welcome option, but really nothing to rave about.

pk67 · Dec 21, 2023

with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches on CPU and GPU

Naive is not the right term author want to use I guess

AnotherReader · Dec 21, 2023

bug said:
I'm not sure about LLMs, but for NNs, inference is almost zero cost. A bunch of additions and multiplications that will barely register on any recent CPU.
The biggest part in this would be moving it from the cloud to the client. Even then, it's not that different from what JS did when it started offloading processing to the client (browser). It's a welcome option, but really nothing to rave about.

In isolation, the cost of one single inference is miniscule compared to the training cost. However, the volume of queries per day (10 million for ChatGPT in early 2023) is enough to ensure that inference cost for a GPT-3 trained model will have surpassed training cost in 42 days.

FrostWolf · Dec 21, 2023

This is where I insert the (kidding) comment that this is the first time Apple wanted to sell devices with more memory. ;^)

(Disclaimer: I use an iPhone, and I have no personal issues with Macs).

SOAREVERSOR · Dec 21, 2023

FrostWolf said:
This is where I insert the (kidding) comment that this is the first time Apple wanted to sell devices with more memory. ;^)

(Disclaimer: I use an iPhone, and I have no personal issues with Macs).

Oh they want to sell devices with more memory! It's what they want to charge for it that's been the issue

bug · Dec 21, 2023

AnotherReader said:
In isolation, the cost of one single inference is miniscule compared to the training cost. However, the volume of queries per day (10 million for ChatGPT in early 2023) is enough to ensure that inference cost for a GPT-3 trained model will have surpassed training cost in 42 days.

Right. And doubling the inference speed would net you what? 84 days before you surpass it? That's why I said the big thing here is moving inference to client. In my mind that takes precedence over the speed of the inference itself.

pk67 said:
Naive is not the right term author want to use I guess

It really is. That the common term in software engineering for an implementation that aims for nothing but just proving some concept works.
Designing data structures around a particular set of problems is nothing new either.

Wirko · Dec 21, 2023

The most surprising part of this story is how much technical information Apple is willing to share with the world.

Tomorrow · Dec 22, 2023

The company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...

kondamin · Dec 22, 2023

Tomorrow said:
The company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...

Should have bought in to intel and micron 3dxpoint

Shihab · Dec 22, 2023

Apple discovers asset streaming...

bug · Dec 22, 2023

Tomorrow said:
The company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...

LLMs are not write intensive. At least not past their training stage. They may add your conversations to their existing knowledge base, but other than that, they're pretty much read-only.

But again, offloading server processing to clients and finding more efficient ways to represent data is something done as a routine throughout the industry. It only becomes remarkable when Apple does it... And they're not even the first to do it, afaik Mozilla was the first to announce they want to let you use AI on your local machine (among other things).

Wirko · Dec 22, 2023

Apple is actually right, SSDs are thrice cheaper per terabyte than even the cheapest peasant DDR4-2400. I think it's that A.666 form factor to blame for cheapness.

pk67 · Jan 2, 2024

bug said:
It really is. That the common term in software engineering for an implementation that aims for nothing but just proving some concept works.
Designing data structures around a particular set of problems is nothing new either.

I'm not sure you are commenting the right term really. Native fit to the context imho but naive not.

System Name	Arrow in the Knee
Processor	265KF -50mv, 32 NGU 34 D2D 40 ring
Motherboard	ASUS PRIME Z890-M
Cooling	Thermalright Phantom Spirit EVO (Intake)
Memory	64GB DDR5 7200 CL34-44-44-44-88 TREFI 65535
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	Corsair SF750
Mouse	DeathadderV2 X Hyperspeed
Keyboard	Aula F75 cream switches
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

System Name	Arrow in the Knee
Processor	265KF -50mv, 32 NGU 34 D2D 40 ring
Motherboard	ASUS PRIME Z890-M
Cooling	Thermalright Phantom Spirit EVO (Intake)
Memory	64GB DDR5 7200 CL34-44-44-44-88 TREFI 65535
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	Corsair SF750
Mouse	DeathadderV2 X Hyperspeed
Keyboard	Aula F75 cream switches
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

System Name	Arrow in the Knee
Processor	265KF -50mv, 32 NGU 34 D2D 40 ring
Motherboard	ASUS PRIME Z890-M
Cooling	Thermalright Phantom Spirit EVO (Intake)
Memory	64GB DDR5 7200 CL34-44-44-44-88 TREFI 65535
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	Corsair SF750
Mouse	DeathadderV2 X Hyperspeed
Keyboard	Aula F75 cream switches
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

System Name	Arrow in the Knee
Processor	265KF -50mv, 32 NGU 34 D2D 40 ring
Motherboard	ASUS PRIME Z890-M
Cooling	Thermalright Phantom Spirit EVO (Intake)
Memory	64GB DDR5 7200 CL34-44-44-44-88 TREFI 65535
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	Corsair SF750
Mouse	DeathadderV2 X Hyperspeed
Keyboard	Aula F75 cream switches
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Apple Wants to Store LLMs on Flash Memory to Bring AI to Smartphones and Laptops

AleksandarK

News Editor

phanbuey

AleksandarK

News Editor

phanbuey

AleksandarK

News Editor

phanbuey

AleksandarK

News Editor

phanbuey

bug

AleksandarK

News Editor

bug

AnotherReader

bug

pk67

AnotherReader

FrostWolf

SOAREVERSOR

bug

Wirko

Tomorrow

kondamin

Shihab

bug

Wirko

pk67

Processor	Ryzen 7 5700X
Motherboard	ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling	Noctua NH-C14S (two fans)
Memory	2x16GB DDR4 3200
Video Card(s)	Reference Vega 64
Storage	Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s)	Nixeus NX-EDG27, and Samsung S23A700
Case	Fractal Design R5
Power Supply	Seasonic PRIME TITANIUM 850W
Mouse	Logitech
VR HMD	Oculus Rift
Software	Windows 11 Pro, and Ubuntu 20.04

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

System Name	DarkStar
Processor	AMD Ryzen 7 5800X3D
Motherboard	Gigabyte X570 Aorus Master 1.0 (BIOS F39g)
Cooling	Arctic Liquid Freezer II 420mm AIO (rev4)
Memory	4x8GB Patriot Viper DDR4 4400C19 @ 3733Mhz 14-14-13-27 1T
Video Card(s)	Gigabyte Radeon RX 9070 XT Gaming OC 16GB GDDR6 @ 3400Mhz Core/22Gbps Mem
Storage	1TB Samsung 990 Pro (OS);2TB Samsung PM9A1;4TB XPG S70 Blade (Games);14TB WD UltraStar HC530 (Video)
Display(s)	27" LG UltraGear 27GS85Q-B @ 2560x1440 @ 200Hz, Nano-IPS
Case	be quiet! Dark Base Pro 900 Rev.2
Audio Device(s)	SteelSeries Arctis Nova Pro Wireless
Power Supply	1000W Seasonic PRIME Ultra Titanium;600W APC SMT750i UPS
Mouse	Logitech G604
Keyboard	Logitech G910 Orion Spark
Software	Windows 11 Pro x64 24H2 (Build 26100.4351)

System Name	192.168.1.1~192.168.1.100
Processor	AMD Ryzen5 5600G.
Motherboard	Gigabyte B550m DS3H.
Cooling	AMD Wraith Stealth.
Memory	16GB Crucial DDR4.
Video Card(s)	Gigabyte GTX 1080 OC (Underclocked, underpowered).
Storage	Samsung 980 NVME 500GB && Assortment of SSDs.
Display(s)	ViewSonic VA2406-MH 75Hz
Case	Bitfenix Nova Midi
Audio Device(s)	On-Board.
Power Supply	SeaSonic CORE GM-650.
Mouse	Logitech G300s
Keyboard	Kingston HyperX Alloy FPS.
VR HMD	A pair of OP spectacles.
Software	Ubuntu 24.04 LTS.
Benchmark Scores	Me no know English. What bench mean? Bench like one sit on?