https://open-source-silicon.dev logo
Channels
aa
abcc
activity
adiabatonauts
analog-design
announce
announcements
b2aws
b2aws-tutorial
bag
basebands
beagleboard
bluetooth
board-respin
cadence-genus
cadence-innovus
cadence-spectre
cadence-virtuoso
caravan
caravel
caravel-board
chilechipmakers
chip-yard
chipignite
chipignite2206q_stanford_bringup
chisel
coalition-for-digital-environmental-sustainability
community_denmark_dtu
containers
courses
design-review
design-services
dffram
digital-design
digital-electronics-learners
discord-mods
dynamic-power-estimation
efabless
electric
events
fasoc
fault
foss-asic-tools
fossee-iitb-esim
fossee-iitb-google-sky130
fpga
funding
fuserisc
general
generative-ai-silicon-challenge
genius-vlsi
gf180
gf180mcu
hardware-beginners
help-
ieee-sscs-cac-23
ieee-sscs-dc-21q3
ieee-sscs-dc-22
ieee-sscs-dc-23
ihp-sg13g2
images
infiniband
j-core
japan-region
junk
klayout
latam_vlsi
layouteditor
lvs
lvs-analysis
magic
magical
maker-projects
maker-zone
microwatt
mpw-2-silicon
mpw-one-clean-short
mpw-one-silicon
neuro-mem
nydesign
open_pdks
open-pdk
openadiabaticlogic
openfpga
openhighqualityresonators
openlane
openlane_cloudrunner
openlane-development
openocd
openpositarithmetic
openpower
openram
openroad
opentitan
osu
pa-test-chip
paracells
pd-openlane-and-sky130
picosoc
pll
popy_neel
power
private-shuttle
rad-lab-silicon
radio
rdircd
reram
researchers
rf-mmw-design
rios
riscv
sdram
serdes
shuttle
shuttle-precheck
shuttle-status
silicon-photonics
silicon-validation
silicon-validation-private
sky130
sky130-ci
sky130-pv-workshop
sky65
sky90
skywater
sram
stdcelllib
strive
swerv
system-verilog-learners
tapeout-job
tapeout-pakistan
team-awesome
timing-closure
toysram
travis-ci
uvm-learners
vendor-synopsys
venn
verification-be
verification-fe
verilog-learners
vh2v
vhdl
vhdl-learners
vliw
vlsi_verilog_using_opensource_eda
vlsi_verilog_using_opensoure_eda
vlsi-learners-group
vlsi101
waveform-viewers
xls
xschem
xyce
zettascale
Powered by
Title
a

Art Scott

12/08/2020, 3:01 PM
https://proceedings.neurips.cc/paper/2020/file/13b919438259814cd5be8cb45877d577-Paper.pdf In our estimates, a MAC unit that performs 4-way INT4×FP4 inner products to support 4-bit backpropagation consumes 55% of the area of the FP16 FPU while providing 4× throughput, yielding a total compute density improvement of 7.3×. Compared to FP16 FPUs, the 4-bit unit has simpler shift-based multipliers thanks to the power-of-2 FP4 numbers. It also benefits from the absence of addend aligners, narrower adders, and a simpler normalizer. Broader Impact Dedicated hardware accelerators for DNN training, including GPUs and TPUs, have powered machine learning research and model exploration over the past decade. These devices have enabled training on very large models and complex datasets (necessitating 10 - 100’s of ExaOps during the training process). Reduced precision innovations (16-bits) have recently improved the capability of these accelerators by 4-8× and have dramatically improved the pace of model innovation and build. The 4-bit training results, presented in this work, aim to push this front aggressively and can power faster and cheaper training systems for a wide spectrum of deep learning models and domains. To summarize, we believe that 4-bit training solutions can accelerate ML research ubiquitously and provide *huge cost and energy saving*s for corporations and research institutes—in addition to helping reduce the carbon / climate impact of AI training. By improving the power efficiency by 4 − 7× in comparison to current FP16 designs (and > 20× vs. default FP32 designs), the _carbon footprint for training large DNN models can be significantly reduce_d [43].