Hi folks, Hopefully you dont mind that I blab a l...
# general
j
Hi folks, Hopefully you dont mind that I blab a little more to follow up on my original post that Art shared with yall. Wanted to do something for you ASIC minded people, some
~32GIOPs of matrix multiplication
you could in theory implement in ASIC :) PipelineC was not designed with ASICs in mind (mostly used with FPGAs). But recently got an easy avenue some experimental support through PyRTL. PyRTL allows you to load in a netlist and say like ~'give me timing analysis for `130nm tech`': https://pyrtl.readthedocs.io/en/latest/analysis.html
Copy code
max_freq(tech_in_nm=130, ffoverhead=None)
Estimates the max frequency of a block in MHz.
Parameters
tech_in_nm – the size of the circuit technology to be estimated (for example, 65 is 65nm and 250 is 0.25um)
ffoverhead – setup and ff propagation delay in picoseconds
Returns
a number representing an estimate of the max frequency in Mhz
For fun I ran that
8x8 matrix multiply systolic array
example trying to get to
500MHz
. Uses results of PyRTL 130nm ASIC tech timing models to figure out where to pipeline. Had to use a custom pipeline-able multiplier since dont have the hard DSPs you'd find in FPGA - sure there are improvements all over to be made - just the start of an experiment... The tool is essentially pipelining one of the processing elements, and repeating that same module. Each processing element consists of a
INT16
multiply-accumulate and some muxing for read out/reset.
Copy code
c
processor_io_t processor(processor_io_t inputs)
{  
  // Logic to get outputs from inputs
  processor_io_t outputs;
  // Typically pass inputs to outputs
  outputs = inputs;
  // First do multiply, then accumulate
  data_t increment = inputs.a * inputs.b;
  // Except when resetting and reading out results
  if(inputs.reset_and_read_result)
  {
    increment = 0; // Reset to 0
  }
  // The built in accumulate function
  data_t result = accum(increment, inputs.reset_and_read_result);
  // Read out of results through 'a' port
  if(inputs.reset_and_read_result)
  {
    outputs.a = result;
  }
  return outputs;
}
The tool tries several options trying to get to 500MHz without too much extra latency (dumb overpipelining is easy):
Copy code
146.47 MHz 0 clks
336.62 MHz 3 clks
440.42 MHz 5 clks
441.74 MHz 7 clks
459.31 MHz 12 clks
485.44 MHz 19 clks
506.74 MHz 28 clks
Meaning that finally splitting the processing element into ~28 stages is what got to 500MHz: ~1 bit per stage for the multiply, ~2 bits per stage for the accumulate, plus a few stages for the muxing. The entire 8x8 64 element systolic array yosys cell report
Copy code
Number of cells:             405952
  $_ANDNOT_                   34816
  $_AND_                      25664
  $_DFF_P_                   244864
  $_MUX_                       4096
  $_NAND_                     16192
  $_NOR_                      14144
  $_NOT_                       2112
  $_ORNOT_                      768
  $_OR_                       14720
  $_XNOR_                     30080
  $_XOR_                      18496
(wow lots of flops :-o) I dont have real ASIC PnR and timing analysis tooling - but if anyone wants to give continue experimenting I am absolutely willing to help :) Thanks again for your time folks, happy to chat
a
Definitely very useful tool. I would personally use it to make ALUs for my CPUs.
👍 1
If this can be ported to Chisel3, so that it can optimize the combinational modules, then it would be the most useful RTL tool ever.
j
Hmm interesting idea Have heard good things about Chisel Wondering how that could look best? ~maybe you mark some chisel module that is describing as comb. logic as 'for pipelining' And somehow get that into some intermediate,FIRRTL/etc PipelineC could be made to work with I am also interested in following intermediates/compiler tooling like the CIRCT project and Google's XLS tool