Hi folks Hopefully you dont mind that I blab a little more t open-source-silicon.dev #general

Hi folks, Hopefully you dont mind that I blab a l...

Julian Kemmerer

08/07/2022, 4:50 PM

Hi folks, Hopefully you dont mind that I blab a little more to follow up on my original post that Art shared with yall. Wanted to do something for you ASIC minded people, some

~32GIOPs of matrix multiplication

you could in theory implement in ASIC :) PipelineC was not designed with ASICs in mind (mostly used with FPGAs). But recently got an easy avenue some experimental support through PyRTL. PyRTL allows you to load in a netlist and say like ~'give me timing analysis for `130nm tech`': https://pyrtl.readthedocs.io/en/latest/analysis.html

Copy code

max_freq(tech_in_nm=130, ffoverhead=None)
Estimates the max frequency of a block in MHz.
Parameters
tech_in_nm – the size of the circuit technology to be estimated (for example, 65 is 65nm and 250 is 0.25um)
ffoverhead – setup and ff propagation delay in picoseconds
Returns
a number representing an estimate of the max frequency in Mhz

For fun I ran that

8x8 matrix multiply systolic array

example trying to get to

500MHz

. Uses results of PyRTL 130nm ASIC tech timing models to figure out where to pipeline. Had to use a custom pipeline-able multiplier since dont have the hard DSPs you'd find in FPGA - sure there are improvements all over to be made - just the start of an experiment... The tool is essentially pipelining one of the processing elements, and repeating that same module. Each processing element consists of a

INT16

multiply-accumulate and some muxing for read out/reset.

Copy code

c
processor_io_t processor(processor_io_t inputs)
{  
  // Logic to get outputs from inputs
  processor_io_t outputs;
  // Typically pass inputs to outputs
  outputs = inputs;
  // First do multiply, then accumulate
  data_t increment = inputs.a * inputs.b;
  // Except when resetting and reading out results
  if(inputs.reset_and_read_result)
  {
    increment = 0; // Reset to 0
  }
  // The built in accumulate function
  data_t result = accum(increment, inputs.reset_and_read_result);
  // Read out of results through 'a' port
  if(inputs.reset_and_read_result)
  {
    outputs.a = result;
  }
  return outputs;
}

The tool tries several options trying to get to 500MHz without too much extra latency (dumb overpipelining is easy):

Copy code

146.47 MHz 0 clks
336.62 MHz 3 clks
440.42 MHz 5 clks
441.74 MHz 7 clks
459.31 MHz 12 clks
485.44 MHz 19 clks
506.74 MHz 28 clks

Meaning that finally splitting the processing element into ~28 stages is what got to 500MHz: ~1 bit per stage for the multiply, ~2 bits per stage for the accumulate, plus a few stages for the muxing. The entire 8x8 64 element systolic array yosys cell report

Copy code

Number of cells:             405952
  $_ANDNOT_                   34816
  $_AND_                      25664
  $_DFF_P_                   244864
  $_MUX_                       4096
  $_NAND_                     16192
  $_NOR_                      14144
  $_NOT_                       2112
  $_ORNOT_                      768
  $_OR_                       14720
  $_XNOR_                     30080
  $_XOR_                      18496

(wow lots of flops :-o) I dont have real ASIC PnR and timing analysis tooling - but if anyone wants to give continue experimenting I am absolutely willing to help :) Thanks again for your time folks, happy to chat

Arman Avetisyan

08/07/2022, 5:51 PM

Definitely very useful tool. I would personally use it to make ALUs for my CPUs.

👍 1

Arman Avetisyan

08/07/2022, 5:55 PM

If this can be ported to Chisel3, so that it can optimize the combinational modules, then it would be the most useful RTL tool ever.

Julian Kemmerer

08/07/2022, 7:35 PM

Hmm interesting idea Have heard good things about Chisel Wondering how that could look best? ~maybe you mark some chisel module that is describing as comb. logic as 'for pipelining' And somehow get that into some intermediate,FIRRTL/etc PipelineC could be made to work with I am also interested in following intermediates/compiler tooling like the CIRCT project and Google's XLS tool

11 Views

Open in Slack

Previous Next