IBM Artificial Intelligence Unit (AIU) Arrives with 23 Billion Transistors

AleksandarK · Oct 21, 2022

IBM Research has published information about the company's latest development of processors for accelerating Artificial Intelligence (AI). The latest IBM processor, called the Artificial Intelligence Unit (AIU), embraces the problem of creating an enterprise solution for AI deployment that fits in a PCIe slot. The IBM AIU is a half-height PCIe card with a processor powered by 23 Billion transistors manufactured on a 5 nm node (assuming TSMC's). While IBM has not provided many details initially, we know that the AIU uses an AI processor found in the Telum chip, a core of the IBM Z16 mainframe. The AIU uses Telum's AI engine and scales it up to 32 cores and achieve high efficiency.

The company has highlighted two main paths for enterprise AI adoption. The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver similar result. The other one is, as IBM touts, that "AI chip should be laid out to streamline AI workflows. Because most AI calculations involve matrix and vector multiplication, our chip architecture features a simpler layout than a multi-purpose CPU. The IBM AIU has also been designed to send data directly from one compute engine to the next, creating enormous energy savings."

In the sea of AI accelerators, IBM hopes to differentiate its offerings by having an enterprise chip to solve more complex problems than current AI chips target. "Deploying AI to classify cats and dogs in photos is a fun academic exercise. But it won't solve the pressing problems we face today. For AI to tackle the complexities of the real world—things like predicting the next Hurricane Ian, or whether we're heading into a recession—we need enterprise-quality, industrial-scale hardware. Our AIU takes us one step closer. We hope to soon share news about its release," says the official IBM release.

View at TechPowerUp Main Site | Source

BorisDG · Oct 21, 2022

Sounds not impressive to me. When you think about a small nail sized A16 has 16 billion.

ZetZet · Oct 21, 2022

BorisDG said:
Sounds not impressive to me. When you think about a small nail sized A16 has 16 billion.

This is probably similar size, just has a heat spreader, because they can push the chip instead of power limiting it.

lemonadesoda · Oct 21, 2022

I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.

Ferrum Master · Oct 21, 2022

lemonadesoda said:
that deviation can be ENORMOUS.

Maybe it is for weather forecasting? Could explain a lot

ModEl4 · Oct 21, 2022

More probable Samsung than TSMC.
They already working together for other 5nm designs.

Innovation at the Albany Nanotech Complex is often directed towards commercialization, and on that end of the chip lifecycle today the companies also announced that Samsung will manufacture IBM's chips at the 5 nm node.

GuiltySpark · Oct 21, 2022

lemonadesoda said:
I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.

I agree on this but you cannot demand TPU to explain everything related to pro / cons of quantization of NN, there is an amount of theory for an entire degree course on that. Just only saying there is, in a classical NN, a space to explore that is still incredible high even using int8 (int4 as well) will return a glimpse for the applicability of such reduced precision.

Wirko · Oct 21, 2022

lemonadesoda said:
I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.

It's not just INT8, you also have FP8 (in several variants, maybe unsigned too) and INT4 and more. All are usable as long as input data and intermediate results are noisy/unreliable/biased enough but sure they can't replace 16 and 32 everywhere.

I'm also curious why it always has to be powers of 2. Formats like INT12 or FP12 would be usable in some cases too.

Edit: Nvidia claims that FP8 can replace FP16 with no loss of accuracy.

Vya Domus · Oct 21, 2022

BorisDG said:
Sounds not impressive to me. When you think about a small nail sized A16 has 16 billion.

It isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.

Silicon Valley is rife with startups that are making more or less the same kind of chips.

Wirko said:
I'm also curious why it always has to be powers of 2. Formats like INT12 or FP12 would be usable in some cases too.

It's not that it has to be powers of 2, it has to be in multiples of 8 bits. It's very unusual and problematic to build processors that work on data which isn't a multiple of a byte. In every computing system 1 byte is the basic unit of storage/processing and everything else that doesn't match that will be problematic to integrate. INT4 still works fine because two INT4s can fit in one byte. The original floating point format was 80bits, which wasn't a power of 2 but was a multiple of 8.

mechtech · Oct 21, 2022

About 3 times the transistors of the population of earth. How much intelligence this have?

Wirko · Oct 21, 2022

mechtech said:
About 3 times the transistors of the population of earth. How much intelligence this have?

Less than a stick of RAM :¬]

GuiltySpark · Oct 22, 2022

Vya Domus said:
It isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.

BTW, that could be said also for GPU, but in the end I don't see anyone saying that's simple architecture to do.

First Strike · Oct 22, 2022

GuiltySpark said:
BTW, that could be said also for GPU, but in the end I don't see anyone saying that's simple architecture to do.

He practically described how to design a "usable" ML inference chip. Being usable and being bleeding-edge is different, and GPU market has already matured to the state that only the top player survives.

Wirko · Oct 22, 2022

Vya Domus said:
It isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.

What ML chips need is an advanced, flexible/programmable, high bandwidth internal network. I don't know what sort of execution units they have, just guessing that they don't even execute user code like CPUs and GPUs do, instead they can be parametrized to define the length of vectors, precision, data paths an so on.

AsRock · Oct 24, 2022

lemonadesoda said:
I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.

Hehe, maybe it works like their destop CPU's back in the day ( chug chug hot hot!)..

Either way not interested in any thing AI based as you know it will get in the wrong hands, bad enough with whats available now.

Everton0x10 · Oct 30, 2022

AsRock said:
Hehe, maybe it works like their destop CPU's back in the day ( chug chug hot hot!)..

Either way not interested in any thing AI based as you know it will get in the wrong hands, bad enough with whats available now.

Hi, what exactly did you mean by: - "bad enough with whats available now."

What bad thing is going on that I don't know about?

I am just starting in this area of IA.

System Name	Black Knight \| White Queen
Processor	Intel Core i9-10940X \| Intel Core i7-5775C
Motherboard	ASUS ROG Rampage VI Extreme Encore X299G \| ASUS Sabertooth Z97 Mark S (White)
Cooling	Noctua NH-D15 chromax.black \| Xigmatek Dark Knight SD-1283 Night Hawk (White)
Memory	G.SKILL Trident Z RGB 4x8GB DDR4 3600MHz CL16 \| Corsair Vengeance LP 4x4GB DDR3L 1600MHz CL9 (White)
Video Card(s)	ASUS ROG Strix GeForce RTX 4090 OC \| KFA2/Galax GeForce GTX 1080 Ti Hall of Fame Edition
Storage	Samsung 990 Pro 2TB, 980 Pro 1TB, 850 Pro 256GB, 840 Pro 256GB, WD 10TB+ (incl. VelociRaptors)
Display(s)	Dell Alienware AW2721D 240Hz, ASUS ROG Strix XG279Q 170Hz, ASUS PA246Q 60Hz\| Samsung JU7500 48'' TV
Case	Corsair 7000D AIRFLOW (Black) \| NZXT ??? w/ ASUS DRW-24B1ST
Audio Device(s)	ASUS Xonar Essence STX \| Realtek ALC1150
Power Supply	Enermax Revolution 1250W 85+ \| Super Flower Leadex Gold 650W (White)
Mouse	Razer Basilisk Ultimate, Razer Naga Trinity \| Razer Mamba 16000
Keyboard	Razer Blackwidow Chroma V2 (Orange switch) \| Razer Ornata Chroma
Software	Windows 10 Pro 64bit

System Name	ICE-QUAD // ICE-CRUNCH
Processor	Q6600 // 2x Xeon 5472
Memory	2GB DDR // 8GB FB-DIMM
Video Card(s)	HD3850-AGP // FireGL 3400
Display(s)	2 x Samsung 204Ts = 3200x1200
Audio Device(s)	Audigy 2
Software	Windows Server 2003 R2 as a Workstation now migrated to W10 with regrets.

System Name	HELLSTAR
Processor	AMD RYZEN 9 5950X
Motherboard	ASUS Strix X570-E
Cooling	2x 360 + 280 rads. 3x Gentle Typhoons, 3x Phanteks T30, 2x TT T140 . EK-Quantum Momentum Monoblock.
Memory	4x8GB G.SKILL Trident Z RGB F4-4133C19D-16GTZR 14-16-12-30-44
Video Card(s)	Sapphire Pulse RX 7900XTX + under waterblock through Kryosheet
Storage	Optane 900P[W11] + WD BLACK SN850X 4TB + 750 EVO 500GB + 1TB 980PRO[FEDORA]
Display(s)	Philips PHL BDM3270 + Acer XV242Y
Case	Lian Li O11 Dynamic EVO
Audio Device(s)	Sound Blaster ZxR
Power Supply	Fractal Design Newton R3 1000W
Mouse	Razer Basilisk
Keyboard	Razer BlackWidow V3 - Yellow Switch
Software	FEDORA 40

Processor	Ryzen 9 5900X
Motherboard	Gigabyte Aorus B550i pro ax
Cooling	Noctua NH-D15 chromax.black (with original fans)
Memory	G.skill 32GB 3200MHz
Video Card(s)	4060ti 16GB
Storage	1TB Samsung PM9A1, 256GB Toshiba pcie3 (from a laptop), 512GB crucial MX500, 2x 1TB Toshiba HDD 2.5
Display(s)	Mateview GT 34''
Case	Thermaltake the tower 100, 1 noctua NF-A14 ippc3000 on top, 2 x Arctic F14
Power Supply	Seasonic focus-GX 750W
Software	Windows 10, Ubuntu when needed.

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

IBM Artificial Intelligence Unit (AIU) Arrives with 23 Billion Transistors

AleksandarK

News Editor

BorisDG

ZetZet

lemonadesoda

Ferrum Master

ModEl4

GuiltySpark

Wirko

Vya Domus

mechtech

Wirko

GuiltySpark

First Strike

Wirko

AsRock

TPU addict

Everton0x10

New Member

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	Ryzen 5700x
Motherboard	Gigabyte X570S Aero G R1.1 BiosF5g
Cooling	Noctua NH-C12P SE14 w/ NF-A15 HS-PWM Fan 1500rpm
Memory	Micron DDR4-3200 2x32GB D.S. D.R. (CT2K32G4DFD832A)
Video Card(s)	AMD RX 6800 - Asus Tuf
Storage	Kingston KC3000 1TB & 2TB & 4TB Corsair LPX
Display(s)	LG 27UL550-W (27" 4k)
Case	Be Quiet Pure Base 600 (no window)
Audio Device(s)	Realtek ALC1220-VB
Power Supply	SuperFlower Leadex V Gold Pro 850W ATX Ver2.52
Mouse	Mionix Naos Pro
Keyboard	Corsair Strafe with browns
Software	W10 22H2 Pro x64

Processor	AMD 3900X \ AMD 7700X
Motherboard	ASRock AM4 X570 Pro 4 \ ASUS X670Xe TUF
Cooling	D15
Memory	Patriot 2x16GB PVS432G320C6K \ G.Skill Flare X5 F5-6000J3238F 2x16GB
Video Card(s)	eVga GTX1060 SSC \ XFX RX 6950XT RX-695XATBD9
Storage	Sammy 860, MX500, Sabrent Rocket 4 Sammy Evo 980 \ 1xSabrent Rocket 4+, Sammy 2x990 Pro
Display(s)	Samsung 1080P \ LG 43UN700
Case	Fractal Design Pop Air 2x140mm fans from Torrent \ Fractal Design Torrent 2 SilverStone FHP141x2
Audio Device(s)	Yamaha RX-V677 \ Yamaha CX-830+Yamaha MX-630 \Paradigm 7se MKII, Paradigm 5SE MK1 , Blue Yeti
Power Supply	Seasonic Prime TX-750 \ Corsair RM1000X Shift
Mouse	Steelseries Sensei wireless \ Steelseries Sensei wireless
Keyboard	Logitech K120 \ Wooting Two HE
Benchmark Scores	Meh benchmarks.