Julian Kemmerer
08/07/2022, 4:50 PM~32GIOPs of matrix multiplication
you could in theory implement in ASIC :)
PipelineC was not designed with ASICs in mind (mostly used with FPGAs). But recently got an easy avenue some experimental support through PyRTL.
PyRTL allows you to load in a netlist and say like ~'give me timing analysis for `130nm tech`':
https://pyrtl.readthedocs.io/en/latest/analysis.html
max_freq(tech_in_nm=130, ffoverhead=None)
Estimates the max frequency of a block in MHz.
Parameters
tech_in_nm – the size of the circuit technology to be estimated (for example, 65 is 65nm and 250 is 0.25um)
ffoverhead – setup and ff propagation delay in picoseconds
Returns
a number representing an estimate of the max frequency in Mhz
For fun I ran that 8x8 matrix multiply systolic array
example trying to get to 500MHz
. Uses results of PyRTL 130nm ASIC tech timing models to figure out where to pipeline. Had to use a custom pipeline-able multiplier since dont have the hard DSPs you'd find in FPGA - sure there are improvements all over to be made - just the start of an experiment...
The tool is essentially pipelining one of the processing elements, and repeating that same module.
Each processing element consists of a INT16
multiply-accumulate and some muxing for read out/reset.
c
processor_io_t processor(processor_io_t inputs)
{
// Logic to get outputs from inputs
processor_io_t outputs;
// Typically pass inputs to outputs
outputs = inputs;
// First do multiply, then accumulate
data_t increment = inputs.a * inputs.b;
// Except when resetting and reading out results
if(inputs.reset_and_read_result)
{
increment = 0; // Reset to 0
}
// The built in accumulate function
data_t result = accum(increment, inputs.reset_and_read_result);
// Read out of results through 'a' port
if(inputs.reset_and_read_result)
{
outputs.a = result;
}
return outputs;
}
The tool tries several options trying to get to 500MHz without too much extra latency (dumb overpipelining is easy):
146.47 MHz 0 clks
336.62 MHz 3 clks
440.42 MHz 5 clks
441.74 MHz 7 clks
459.31 MHz 12 clks
485.44 MHz 19 clks
506.74 MHz 28 clks
Meaning that finally splitting the processing element into ~28 stages is what got to 500MHz: ~1 bit per stage for the multiply, ~2 bits per stage for the accumulate, plus a few stages for the muxing.
The entire 8x8 64 element systolic array yosys cell report
Number of cells: 405952
$_ANDNOT_ 34816
$_AND_ 25664
$_DFF_P_ 244864
$_MUX_ 4096
$_NAND_ 16192
$_NOR_ 14144
$_NOT_ 2112
$_ORNOT_ 768
$_OR_ 14720
$_XNOR_ 30080
$_XOR_ 18496
(wow lots of flops :-o)
I dont have real ASIC PnR and timing analysis tooling - but if anyone wants to give continue experimenting I am absolutely willing to help :)
Thanks again for your time folks, happy to chatArman Avetisyan
08/07/2022, 5:51 PMArman Avetisyan
08/07/2022, 5:55 PMJulian Kemmerer
08/07/2022, 7:35 PM