• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

IBM Artificial Intelligence Unit (AIU) Arrives with 23 Billion Transistors

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,241 (0.91/day)
IBM Research has published information about the company's latest development of processors for accelerating Artificial Intelligence (AI). The latest IBM processor, called the Artificial Intelligence Unit (AIU), embraces the problem of creating an enterprise solution for AI deployment that fits in a PCIe slot. The IBM AIU is a half-height PCIe card with a processor powered by 23 Billion transistors manufactured on a 5 nm node (assuming TSMC's). While IBM has not provided many details initially, we know that the AIU uses an AI processor found in the Telum chip, a core of the IBM Z16 mainframe. The AIU uses Telum's AI engine and scales it up to 32 cores and achieve high efficiency.

The company has highlighted two main paths for enterprise AI adoption. The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver similar result. The other one is, as IBM touts, that "AI chip should be laid out to streamline AI workflows. Because most AI calculations involve matrix and vector multiplication, our chip architecture features a simpler layout than a multi-purpose CPU. The IBM AIU has also been designed to send data directly from one compute engine to the next, creating enormous energy savings."




In the sea of AI accelerators, IBM hopes to differentiate its offerings by having an enterprise chip to solve more complex problems than current AI chips target. "Deploying AI to classify cats and dogs in photos is a fun academic exercise. But it won't solve the pressing problems we face today. For AI to tackle the complexities of the real world—things like predicting the next Hurricane Ian, or whether we're heading into a recession—we need enterprise-quality, industrial-scale hardware. Our AIU takes us one step closer. We hope to soon share news about its release," says the official IBM release.

View at TechPowerUp Main Site | Source
 
Joined
Apr 16, 2013
Messages
535 (0.13/day)
Location
Bulgaria
System Name Black Knight | White Queen
Processor Intel Core i9-10940X | Intel Core i7-5775C
Motherboard ASUS ROG Rampage VI Extreme Encore X299G | ASUS Sabertooth Z97 Mark S (White)
Cooling Noctua NH-D15 chromax.black | Xigmatek Dark Knight SD-1283 Night Hawk (White)
Memory G.SKILL Trident Z RGB 4x8GB DDR4 3600MHz CL16 | Corsair Vengeance LP 4x4GB DDR3L 1600MHz CL9 (White)
Video Card(s) ASUS ROG Strix GeForce RTX 4090 OC | KFA2/Galax GeForce GTX 1080 Ti Hall of Fame Edition
Storage Samsung 990 Pro 2TB, 980 Pro 1TB, 850 Pro 256GB, 840 Pro 256GB, WD 10TB+ (incl. VelociRaptors)
Display(s) Dell Alienware AW2721D 240Hz, ASUS ROG Strix XG279Q 170Hz, ASUS PA246Q 60Hz| Samsung JU7500 48'' TV
Case Corsair 7000D AIRFLOW (Black) | NZXT ??? w/ ASUS DRW-24B1ST
Audio Device(s) ASUS Xonar Essence STX | Realtek ALC1150
Power Supply Enermax Revolution 1250W 85+ | Super Flower Leadex Gold 650W (White)
Mouse Razer Basilisk Ultimate, Razer Naga Trinity | Razer Mamba 16000
Keyboard Razer Blackwidow Chroma V2 (Orange switch) | Razer Ornata Chroma
Software Windows 10 Pro 64bit
Sounds not impressive to me. When you think about a small nail sized A16 has 16 billion.
 
Joined
Feb 27, 2013
Messages
445 (0.11/day)
Location
Lithuania
Sounds not impressive to me. When you think about a small nail sized A16 has 16 billion.
This is probably similar size, just has a heat spreader, because they can push the chip instead of power limiting it.
 
Joined
Aug 30, 2006
Messages
7,198 (1.11/day)
System Name ICE-QUAD // ICE-CRUNCH
Processor Q6600 // 2x Xeon 5472
Memory 2GB DDR // 8GB FB-DIMM
Video Card(s) HD3850-AGP // FireGL 3400
Display(s) 2 x Samsung 204Ts = 3200x1200
Audio Device(s) Audigy 2
Software Windows Server 2003 R2 as a Workstation now migrated to W10 with regrets.
I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.
 
Joined
Nov 18, 2010
Messages
7,140 (1.45/day)
Location
Rīga, Latvia
System Name HELLSTAR
Processor AMD RYZEN 9 5950X
Motherboard ASUS Strix X570-E
Cooling 2x 360 + 280 rads. 3x Gentle Typhoons, 3x Phanteks T30, 2x TT T140 . EK-Quantum Momentum Monoblock.
Memory 4x8GB G.SKILL Trident Z RGB F4-4133C19D-16GTZR 14-16-12-30-44
Video Card(s) Sapphire Pulse RX 7900XTX + under waterblock through Kryosheet
Storage Optane 900P[W11] + WD BLACK SN850X 4TB + 750 EVO 500GB + 1TB 980PRO[FEDORA]
Display(s) Philips PHL BDM3270 + Acer XV242Y
Case Lian Li O11 Dynamic EVO
Audio Device(s) Sound Blaster ZxR
Power Supply Fractal Design Newton R3 1000W
Mouse Razer Basilisk
Keyboard Razer BlackWidow V3 - Yellow Switch
Software FEDORA 40
Joined
Aug 31, 2021
Messages
21 (0.02/day)
Processor Ryzen 9 5900X
Motherboard Gigabyte Aorus B550i pro ax
Cooling Noctua NH-D15 chromax.black (with original fans)
Memory G.skill 32GB 3200MHz
Video Card(s) 4060ti 16GB
Storage 1TB Samsung PM9A1, 256GB Toshiba pcie3 (from a laptop), 512GB crucial MX500, 2x 1TB Toshiba HDD 2.5
Display(s) Mateview GT 34''
Case Thermaltake the tower 100, 1 noctua NF-A14 ippc3000 on top, 2 x Arctic F14
Power Supply Seasonic focus-GX 750W
Software Windows 10, Ubuntu when needed.
I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.
I agree on this but you cannot demand TPU to explain everything related to pro / cons of quantization of NN, there is an amount of theory for an entire degree course on that. Just only saying there is, in a classical NN, a space to explore that is still incredible high even using int8 (int4 as well) will return a glimpse for the applicability of such reduced precision.
 
Joined
Jan 3, 2021
Messages
2,710 (2.22/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.
It's not just INT8, you also have FP8 (in several variants, maybe unsigned too) and INT4 and more. All are usable as long as input data and intermediate results are noisy/unreliable/biased enough but sure they can't replace 16 and 32 everywhere.

I'm also curious why it always has to be powers of 2. Formats like INT12 or FP12 would be usable in some cases too.

Edit: Nvidia claims that FP8 can replace FP16 with no loss of accuracy.
 
Last edited:
Joined
Jan 8, 2017
Messages
8,990 (3.36/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Sounds not impressive to me. When you think about a small nail sized A16 has 16 billion.
It isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.

Silicon Valley is rife with startups that are making more or less the same kind of chips.

I'm also curious why it always has to be powers of 2. Formats like INT12 or FP12 would be usable in some cases too.

It's not that it has to be powers of 2, it has to be in multiples of 8 bits. It's very unusual and problematic to build processors that work on data which isn't a multiple of a byte. In every computing system 1 byte is the basic unit of storage/processing and everything else that doesn't match that will be problematic to integrate. INT4 still works fine because two INT4s can fit in one byte. The original floating point format was 80bits, which wasn't a power of 2 but was a multiple of 8.
 
Last edited:
Joined
Dec 26, 2006
Messages
3,550 (0.56/day)
Location
Northern Ontario Canada
Processor Ryzen 5700x
Motherboard Gigabyte X570S Aero G R1.1 BiosF5g
Cooling Noctua NH-C12P SE14 w/ NF-A15 HS-PWM Fan 1500rpm
Memory Micron DDR4-3200 2x32GB D.S. D.R. (CT2K32G4DFD832A)
Video Card(s) AMD RX 6800 - Asus Tuf
Storage Kingston KC3000 1TB & 2TB & 4TB Corsair LPX
Display(s) LG 27UL550-W (27" 4k)
Case Be Quiet Pure Base 600 (no window)
Audio Device(s) Realtek ALC1220-VB
Power Supply SuperFlower Leadex V Gold Pro 850W ATX Ver2.52
Mouse Mionix Naos Pro
Keyboard Corsair Strafe with browns
Software W10 22H2 Pro x64
About 3 times the transistors of the population of earth. How much intelligence this have? ;)
 
Joined
Jan 3, 2021
Messages
2,710 (2.22/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Joined
Aug 31, 2021
Messages
21 (0.02/day)
Processor Ryzen 9 5900X
Motherboard Gigabyte Aorus B550i pro ax
Cooling Noctua NH-D15 chromax.black (with original fans)
Memory G.skill 32GB 3200MHz
Video Card(s) 4060ti 16GB
Storage 1TB Samsung PM9A1, 256GB Toshiba pcie3 (from a laptop), 512GB crucial MX500, 2x 1TB Toshiba HDD 2.5
Display(s) Mateview GT 34''
Case Thermaltake the tower 100, 1 noctua NF-A14 ippc3000 on top, 2 x Arctic F14
Power Supply Seasonic focus-GX 750W
Software Windows 10, Ubuntu when needed.
It isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.
BTW, that could be said also for GPU, but in the end I don't see anyone saying that's simple architecture to do.
 
Joined
Jun 12, 2017
Messages
136 (0.05/day)
BTW, that could be said also for GPU, but in the end I don't see anyone saying that's simple architecture to do.
He practically described how to design a "usable" ML inference chip. Being usable and being bleeding-edge is different, and GPU market has already matured to the state that only the top player survives.
 
Joined
Jan 3, 2021
Messages
2,710 (2.22/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
It isn't, not because of the size but because of the type of chip this is. ML chips are really not that complicated to build, you just fill a chip with as many ALUs as you can until you run into memory bandwidth constraints and you stop, they don't need much in the way of complex schedulers, branch prediction, any of that kind of stuff, they're meant to compute dot products + a couple of other basic operations and that's about it.
What ML chips need is an advanced, flexible/programmable, high bandwidth internal network. I don't know what sort of execution units they have, just guessing that they don't even execute user code like CPUs and GPUs do, instead they can be parametrized to define the length of vectors, precision, data paths an so on.
 

AsRock

TPU addict
Joined
Jun 23, 2007
Messages
18,887 (3.07/day)
Location
UK\USA
Processor AMD 3900X \ AMD 7700X
Motherboard ASRock AM4 X570 Pro 4 \ ASUS X670Xe TUF
Cooling D15
Memory Patriot 2x16GB PVS432G320C6K \ G.Skill Flare X5 F5-6000J3238F 2x16GB
Video Card(s) eVga GTX1060 SSC \ XFX RX 6950XT RX-695XATBD9
Storage Sammy 860, MX500, Sabrent Rocket 4 Sammy Evo 980 \ 1xSabrent Rocket 4+, Sammy 2x990 Pro
Display(s) Samsung 1080P \ LG 43UN700
Case Fractal Design Pop Air 2x140mm fans from Torrent \ Fractal Design Torrent 2 SilverStone FHP141x2
Audio Device(s) Yamaha RX-V677 \ Yamaha CX-830+Yamaha MX-630 \Paradigm 7se MKII, Paradigm 5SE MK1 , Blue Yeti
Power Supply Seasonic Prime TX-750 \ Corsair RM1000X Shift
Mouse Steelseries Sensei wireless \ Steelseries Sensei wireless
Keyboard Logitech K120 \ Wooting Two HE
Benchmark Scores Meh benchmarks.
I don't like this new "AI" terminology in compute. As written here, >>The first one is to embrace lower precision and use approximate computing to drop from 32-bit formats to some odd-bit structures that hold a quarter as much precision and still deliver the same result.<<

So, instead of FP32 bit, we are going to INT8. And >>still deliver the same result.<< is just untrue. Because there are only a very few and specific number of simulation/calculation scenarios where you would get the same results. You might get nearly the same, most of the time, but never always the same all the time.

Crude savage approximate fast-math has it's applications for certain jobs. But let's not pretend this is suitable for all "AI" computational tasks, nor that it will "deliver the same result."

The Error Set of fast INT8 math vs. slow FP32/64/80/128 math is like a fractal - or mandelbrot set. It is "beautiful" in it's unpredictable deviation, and in some boundary areas, that deviation can be ENORMOUS.

Hehe, maybe it works like their destop CPU's back in the day ( chug chug hot hot!)..

Either way not interested in any thing AI based as you know it will get in the wrong hands, bad enough with whats available now.
 

Everton0x10

New Member
Joined
Oct 30, 2022
Messages
1 (0.00/day)
Hehe, maybe it works like their destop CPU's back in the day ( chug chug hot hot!)..

Either way not interested in any thing AI based as you know it will get in the wrong hands, bad enough with whats available now.
Hi, what exactly did you mean by: - "bad enough with whats available now."

What bad thing is going on that I don't know about?

I am just starting in this area of IA.
 
Top