Since you dive into the idea of optimum buffer count, a good test for that would be set_driving_cell smallest buffer on input port, wire driving another output por with set_load constraint you could sweep load value to see how the tools handles the buffering. Expectation would be to equalize timing between each buffer in the chain by sizing to make each fanout = cload/ cin close to the geom mean of load