• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,260 (0.92/day)
Realizing the full potential of next-generation deep learning requires highly efficient AI infrastructure. For a computing platform to be scalable and cost efficient, optimizing every layer of the AI stack, from algorithms to hardware, is essential. Advances in narrow-precision AI data formats and associated optimized algorithms have been pivotal to this journey, allowing the industry to transition from traditional 32-bit floating point precision to presently only 8 bits of precision (i.e. OCP FP8).

Narrower formats allow silicon to execute more efficient AI calculations per clock cycle, which accelerates model training and inference times. AI models take up less space, which means they require fewer data fetches from memory, and can run with better performance and efficiency. Additionally, fewer bit transfers reduces data movement over the interconnect, which can enhance application performance or cut network costs.





Bringing Together Key Industry Leaders to Set the Standard
Earlier this year, AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Technologies, Inc. formed the Microscaling Formats (MX) Alliance with the goal of creating and standardizing next-generation 6- and 4-bit data types for AI training and inferencing. The key enabling technology that enables sub 8-bit formats to work, referred to as microscaling, builds on a foundation of years of design space exploration and research. MX enhances the robustness and ease-of-use of existing 8-bit formats such as FP8 and INT8, thus lowering the barrier for broader adoption of single digit bit training and inference.

The initial MX specification introduces four concrete floating point and interger-based data formats (MXFP8, MXFP6, MXFP4, and MXINT8) that are compatible with current AI stacks, upport implementation flexibility across both hardware and software, and enable finegrain microscaling at the hardware level. Extensive studies demonstrate that MX formats can be easily deployed for many diverse real-world cases such as large language models, computer vision, and recommender systems. My technology also enables LLM pre-training at 6- and 4-bit precisions without any modifications to conventional training recipes.

Democratizing AI Capabilities
In the evolving landscape of AI, open standards are critical to foster innovation, collaboration, and widespread adoption. These standards offer a unifying framework that enables consistent toolchains, model development, and interoperability across the AI ecosystem. This further empowers developers and organizations to harness the full potential of AI while mitigating the fragmentation and technology constraints that could otherwise stifle progress.

In this spirit, the MX Alliance has released the Microscaling Formats (MX) Specification v1.0 in an open, license-free format through the Open Compute Project Foundation (OCP) to enable and encourage broad industry adoption and provide the foundation for potential future narrow-format innovations. Additionally, a white paper and emulation libraries have also been published to provide details on the data science approach and select results of MX in action. This inclusivity not only accelerates the pace of AI advancement but also promotes openness, accountability, and the responsible development of AI applications.

"AMD is pleased to be a founding member of the MX Alliance and has been a key contributor to the OCP MX Specification v1.0. This industry collaboration to standardize MX data formats provides an open and sustainable approach to continued AI innovations while providing the AI ecosystem time to prepare for the use of MX data formats in future hardware and software. AMD is committed to driving forward an open AI ecosystem and is happy to contribute our research results on MX data formats to the broader AI community." - Michael Schulte, Sr. Fellow, AMD

"As an industry we have a unique opportunity to collaborate and realize the benefits of AI technology, which will enable new use cases from cloud to edge to endpoint. This requires commitment to standardization for AI training and inference so that developers can focus on innovating where it really matters, and the release of the OCP MX specification is a significant milestone in this journey." - Ian Bratt, Fellow and Senior Director of Technology, Arm

"The OCP MX spec is the result of a fairly broad cross-industry collaboration and represents an important step forward in unifying and standardizing emerging sub-8bit data formats for AI applications. Portability and interoperability of AI models enabled by this should make AI developers very happy. Benefiting AI applications should see higher levels of performance and energy efficiency, with reduced memory needs." - Pradeep Dubey, Senior Fellow and Director of the Parallel Computing Lab, Intel

"To keep pace with the accelerating demands of AI, innovation must happen across every layer of the stack. The OCP MX effort is a significant leap forward in enabling more scalability and efficiency for the most advanced training and inferencing workloads. MX builds upon years of internal work, and now working together with our valued partners, has evolved into an open standard that will benefit the entire AI ecosystem and industry." - Brian Harry, Technical Fellow, Microsoft

"MX formats with a wide spectrum of sub-8-bit support provide efficient training and inference solutions that can be applied to AI models in various domains, from recommendation models with strict accuracy requirements, to the latest large language models that are latency-sensitive and compute intensive. We believe sharing these MX formats with the OCP and broader ML community will lead to more innovation in AI modeling." - Ajit Mathews, Senior Director of Engineering, Meta AI

"The OCP MX specification is a significant step towards accelerating AI training and inference workloads with sub-8-bit data formats. These formats accelerate applications by reducing memory footprint and bandwidth pressure, also allowing for innovation in math operation implementation. The open format specification enables platform interoperability, benefiting the entire industry." - Paulius Micikevicius, Senior Distinguished Engineer, NVIDIA

"The new OCP MX specification will help accelerate the transition to lower-cost, lower-power server-based forms of AI inference. We are passionate about democratizing AI through lower-cost inference and we are glad to join this effort." - Colin Verrilli, Senior Director, Qualcomm Technologies, Inc

About the Open Compute Project Foundation
The Open Compute Project (OCP) is a collaborative Community of hyperscale data center operators, telecom, colocation providers and enterprise IT users, working with the product and solution vendor ecosystem to develop open innovations deployable from the cloud to the edge. The OCP Foundation is responsible for fostering and serving the OCP Community to meet the market and shape the future, taking hyperscale-led innovations to everyone. Meeting the market is accomplished through addressing challenging market obstacles with open specifications, designs and emerging market programs that showcase OCP-recognized IT equipment and data center facility best practices. Shaping the future includes investing in strategic initiatives and programs that prepare the IT ecosystem for major technology changes, such as AI & ML, optics, advanced cooling techniques, composable memory and silicon. OCP Community-developed open innovations strive to benefit all, optimized through the lens of impact, efficiency, scale and sustainability.

Learn more at: www.opencompute.org.

View at TechPowerUp Main Site
 
Joined
Mar 18, 2023
Messages
610 (1.43/day)
System Name Never trust a socket with less than 2000 pins
4 bits?

I want to use useful calculations on that data type. Maybe I am not up-to-date with ML. It this just for inference?
 
Joined
Jan 3, 2021
Messages
2,762 (2.25/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
4 bits?

I want to use useful calculations on that data type. Maybe I am not up-to-date with ML. It this just for inference?
All these new formats are exponent-only if I'm reading this table right. That's also interesting.
1697643322926.png
 
Joined
Nov 26, 2021
Messages
1,372 (1.52/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04

buildbot

New Member
Joined
Oct 18, 2023
Messages
2 (0.01/day)
4 bits?

I want to use useful calculations on that data type. Maybe I am not up-to-date with ML. It this just for inference?
This is for both training and inference! You end up with a small gap using MX4 compared to FP32, but that might be acceptable for your use case. MX6 is on par with FP32 training.

All these new formats are exponent-only if I'm reading this table right. That's also interesting.
View attachment 318042
Not exactly - the element data type for MXFP4 for example is 2 exponent bits and 1 mantissa bit. These are grouped into a block of 32 elements, and scaled by an 8 bit exponent. So the effective bits per element for MXFP4 is 4+8/32 = 4.25 bits per element.
 
Joined
Mar 10, 2010
Messages
11,878 (2.29/day)
Location
Manchester uk
System Name RyzenGtEvo/ Asus strix scar II
Processor Amd R5 5900X/ Intel 8750H
Motherboard Crosshair hero8 impact/Asus
Cooling 360EK extreme rad+ 360$EK slim all push, cpu ek suprim Gpu full cover all EK
Memory Corsair Vengeance Rgb pro 3600cas14 16Gb in four sticks./16Gb/16GB
Video Card(s) Powercolour RX7900XT Reference/Rtx 2060
Storage Silicon power 2TB nvme/8Tb external/1Tb samsung Evo nvme 2Tb sata ssd/1Tb nvme
Display(s) Samsung UAE28"850R 4k freesync.dell shiter
Case Lianli 011 dynamic/strix scar2
Audio Device(s) Xfi creative 7.1 on board ,Yamaha dts av setup, corsair void pro headset
Power Supply corsair 1200Hxi/Asus stock
Mouse Roccat Kova/ Logitech G wireless
Keyboard Roccat Aimo 120
VR HMD Oculus rift
Software Win 10 Pro
Benchmark Scores 8726 vega 3dmark timespy/ laptop Timespy 6506
In essence then, standard format's should equate to transferable Algebraic IP across many diverse brands, some of which will still do better than others.

Cuda would be the main looser here IMHO

We consumers can only win from this news.

But damn if AI hasn't become the next 3Dtv, IMHO.
 
Joined
Oct 24, 2022
Messages
92 (0.16/day)
I really hope that people open their eyes and stop making apps in CUDA and only make them in OpenCL and in other open APIs so that the apps can run on any GPU or dedicated chip for AI.
 
Joined
Mar 18, 2023
Messages
610 (1.43/day)
System Name Never trust a socket with less than 2000 pins
I really hope that people open their eyes and stop making apps in CUDA and only make them in OpenCL and in other open APIs so that the apps can run on any GPU or dedicated chip for AI.

Unfortunately CUDA is much more convenient and approachable for programmers new to GPU computing.
 

buildbot

New Member
Joined
Oct 18, 2023
Messages
2 (0.01/day)
The text of this news is long-winded and confusing.
It's really technical that is fair! If you have any questions I would be happy to try to explain!
In essence then, standard format's should equate to transferable Algebraic IP across many diverse brands, some of which will still do better than others.

Cuda would be the main looser here IMHO

We consumers can only win from this news.

But damn if AI hasn't become the next 3Dtv, IMHO.
Exactly - standardize the datatypes so that everyone can use the same number format and build hardware the supports it.

CUDA/Nvidia don't loose at all! In my opinion - they gain as much as everyone else, since Nvidia will also support the new more efficient datatypes and still have great hardware for those types with all of the ease CUDA brings.
I really hope that people open their eyes and stop making apps in CUDA and only make them in OpenCL and in other open APIs so that the apps can run on any GPU or dedicated chip for AI.
CUDA is the default and has a huge amount of mindshare, but it is slowly happening - Pytorch is at least trying to support other backends with different levels of intermediate compilation to open up new GPUs and dedicated chips. They have quite a few already:
  • torch.backends.cpu
  • torch.backends.cuda
  • torch.backends.cudnn
  • torch.backends.mps
  • torch.backends.mkl
  • torch.backends.mkldnn
  • torch.backends.openmp
  • torch.backends.opt_einsum
  • torch.backends.xeon
Unfortunately CUDA is much more convenient and approachable for programmers new to GPU computing.
CUDA is somewhat pleasant to write compared to OpenCL which I have always really disliked, personally at least.
 
Top