Julian Kemmerer
08/07/2022, 4:50 PM~32GIOPs of matrix multiplication you could in theory implement in ASIC :)
PipelineC was not designed with ASICs in mind (mostly used with FPGAs). But recently got an easy avenue some experimental support through PyRTL.
PyRTL allows you to load in a netlist and say like ~'give me timing analysis for `130nm tech`':
https://pyrtl.readthedocs.io/en/latest/analysis.html
max_freq(tech_in_nm=130, ffoverhead=None)
Estimates the max frequency of a block in MHz.
Parameters
tech_in_nm – the size of the circuit technology to be estimated (for example, 65 is 65nm and 250 is 0.25um)
ffoverhead – setup and ff propagation delay in picoseconds
Returns
a number representing an estimate of the max frequency in Mhz
For fun I ran that 8x8 matrix multiply systolic array example trying to get to 500MHz. Uses results of PyRTL 130nm ASIC tech timing models to figure out where to pipeline. Had to use a custom pipeline-able multiplier since dont have the hard DSPs you'd find in FPGA - sure there are improvements all over to be made - just the start of an experiment...
The tool is essentially pipelining one of the processing elements, and repeating that same module.
Each processing element consists of a INT16 multiply-accumulate and some muxing for read out/reset.
c
processor_io_t processor(processor_io_t inputs)
{
// Logic to get outputs from inputs
processor_io_t outputs;
// Typically pass inputs to outputs
outputs = inputs;
// First do multiply, then accumulate
data_t increment = inputs.a * inputs.b;
// Except when resetting and reading out results
if(inputs.reset_and_read_result)
{
increment = 0; // Reset to 0
}
// The built in accumulate function
data_t result = accum(increment, inputs.reset_and_read_result);
// Read out of results through 'a' port
if(inputs.reset_and_read_result)
{
outputs.a = result;
}
return outputs;
}
The tool tries several options trying to get to 500MHz without too much extra latency (dumb overpipelining is easy):
146.47 MHz 0 clks
336.62 MHz 3 clks
440.42 MHz 5 clks
441.74 MHz 7 clks
459.31 MHz 12 clks
485.44 MHz 19 clks
506.74 MHz 28 clks
Meaning that finally splitting the processing element into ~28 stages is what got to 500MHz: ~1 bit per stage for the multiply, ~2 bits per stage for the accumulate, plus a few stages for the muxing.
The entire 8x8 64 element systolic array yosys cell report
Number of cells: 405952
$_ANDNOT_ 34816
$_AND_ 25664
$_DFF_P_ 244864
$_MUX_ 4096
$_NAND_ 16192
$_NOR_ 14144
$_NOT_ 2112
$_ORNOT_ 768
$_OR_ 14720
$_XNOR_ 30080
$_XOR_ 18496
(wow lots of flops :-o)
I dont have real ASIC PnR and timing analysis tooling - but if anyone wants to give continue experimenting I am absolutely willing to help :)
Thanks again for your time folks, happy to chatArman Avetisyan
08/07/2022, 5:51 PMArman Avetisyan
08/07/2022, 5:55 PMJulian Kemmerer
08/07/2022, 7:35 PM