IBM --- In our estimates, a MAC unit that performs 4-way INT4×FP4 inner products to support 4-bit backpropagation consumes 55% of the area of the FP16 FPU while providing 4× throughput, yielding a total compute density improvement of 7.3×. Compared to FP16 FPUs, the 4-bit unit has simpler shift-based multipliers thanks to the power-of-2 FP4 numbers. It also benefits from the absence of addend aligners, narrower adders, and a simpler normalizer.
Dedicated hardware accelerators for DNN training, including GPUs and TPUs, have powered machine learning research and model exploration over the past decade. These devices have enabled training on very large models and complex datasets (necessitating 10 - 100’s of ExaOps during the training process). Reduced precision innovations (16-bits) have recently improved the capability of these accelerators by 4-8× and have dramatically improved the pace of model innovation and build. The 4-bit training results, presented in this work, aim to push this front aggressively and can power faster and cheaper training systems for a wide spectrum of deep learning models and domains. To summarize, we believe that 4-bit training solutions can accelerate ML research ubiquitously and provide *huge cost and energy saving*s for corporations and research institutes—in addition to helping reduce the carbon / climate impact of AI training. By improving the power efficiency by 4 − 7× in comparison to current FP16 designs (and > 20× vs. default FP32 designs), the _carbon footprint for training large DNN models can be significantly reduce_d