How can I debug a segfault with re-placement? duri...
# openroad
m
How can I debug a segfault with re-placement? during executing: "openroad -exit /openlane/scripts/openroad/or_replace.tcl |& tee >&@stdout /project/openlane/user _project_wrapper/runs/user_project_wrapper/logs/placement/16-replace.log" Last 10 lines: child killed: segmentation violation
m
is it an out of memory situation in docker? child killed sounds like an external agent
large designs can trigger that
m
possibly.
m
OL has a DOCKER_MEMORY variable to allow a user defined amount
👍 1
m
Hm, seems like it defaults to 64G
m
a big enough design could exceed the limit. What is the tail of 16-replace.log
m
This is 10 macros plus a few hundred gates
m
so probably not memory then.
log tail?
m
There's a bunch of infos that have always existed: [INFO GRT-0209] Ignoring an obstruction on layer met5 outside the die area. [INFO GRT-0209] Ignoring an obstruction on layer met5 outside the die area. [INFO GRT-0209] Ignoring an obstruction on layer met5 outside the die area. [INFO GRT-0209] Ignoring an obstruction on layer met5 outside the die area. [INFO GRT-0209] Ignoring an obstruction on layer met5 outside the die area. Lots of warnings about pins outside the area (from the caravel harness)
m
It could be in the global router (which placement calls) based on this. Can you try without PL_ROUTABILITY_DRIVEN ? That would narrow it down
Either way I think we'll need a testcase to debug if possible.
m
Yeah. I'm re-running my MPW2 design with the new timing updates, so I'll debug a bit more and provide something.
Same error when PL_ROUTABILITY_DRIVEN is set to false
m
what is the log tail then?
m
Nothing seems to have changed...
m
That doensn't make sense. You should get no GRT messages in that case
m
Just this in my config, right? set ::env(PL_ROUTABILITY_DRIVEN) 0
Oh the default is false anyways
m
or_replace.tcl is a mess. I think it doesn't turn off correctly.
Can you provide a test case for this and open an issue?
m
Unfortunately, it seems to not fail any longer and gets to stage 29-opendp but then also segfaults
m
what changed?
m
I modified some verilog tests. That is all
Completely unrelated
m
that sounds suspicious.... what is the opendp failure?
m
I also ran make clean?
The opendp failure is related to escaping some signal names. My clock is io_in[17]. I need to escape the [] in config.tcl but not in base.sdc or else STA won't recognize the clock. However, or_opendp.tcl complains if I don't have it escaped:
Copy code
invalid command name "17"
    while executing
"17"
    ("uplevel" body line 1)
    invoked from within
"uplevel #0 ${cmd}"
    (procedure "set_log" line 3)
    invoked from within
"set_log ::env($index) $escaped_env_var $::env(GLB_CFG_FILE) 1"
    (procedure "save_state" line 9)
    invoked from within
"save_state"
    (procedure "flow_fail" line 6)
    invoked from within
"flow_fail"
    (procedure "try_catch" line 25)
    invoked from within
"try_catch $::env(OPENROAD_BIN) -exit $::env(SCRIPTS_DIR)/openroad/or_opendp.tcl |& tee $::env(TERMINAL_OUTPUT) [index_file $::env(opendp_log_file_tag)..."
This actually could be what changed for global routing too...
m
opendb doesn't do anything with timing.
opendp rather
m
Why did it read my SDC?
m
what is in the opendp log?
I don't see a read_sdc in or_opendp.tcl
m
Just a segfault with no real context
m
where is invalid command name "17" coming from?
m
That is the text in my signal name: io_in[17]
m
I understand but who is trying to execute that name?
m
or_opendp.tcl... Rerunning to get the remainder of the stack trace I didn't paste.
m
Do you see anything in that script that would access the name of your clock? I'm looking at the version in master and I don't see anything
m
Copy code
invalid command name "17"
    while executing
"17"
    ("uplevel" body line 1)
    invoked from within
"uplevel #0 ${cmd}"
    (procedure "set_log" line 3)
    invoked from within
"set_log ::env($index) $escaped_env_var $::env(GLB_CFG_FILE) 1"
    (procedure "save_state" line 9)
    invoked from within
"save_state"
    (procedure "flow_fail" line 6)
    invoked from within
"flow_fail"
    (procedure "try_catch" line 25)
    invoked from within
"try_catch $::env(OPENROAD_BIN) -exit $::env(SCRIPTS_DIR)/openroad/or_opendp.tcl |& tee $::env(TERMINAL_OUTPUT) [index_file $::env(opendp_log_file_tag)..."
    (procedure "detailed_placement_or" line 6)
    invoked from within
"detailed_placement_or"
    (procedure "run_routing" line 32)
    invoked from within
"run_routing"
    (procedure "run_routing_step" line 10)
    invoked from within
"[lindex $step_exe 0] [lindex $step_exe 1] "
    (procedure "run_non_interactive_mode" line 43)
    invoked from within
"run_non_interactive_mode {*}$argv"
    invoked from within
"if { [info exists flags_map(-interactive)] || [info exists flags_map(-it)] } {
        puts_info "Running interactively"
        if { [info exists arg_values(-file)..."
    (file "/openlane/flow.tcl" line 356)
make[1]: *** [Makefile:43: user_project_wrapper] Error 1
make[1]: Leaving directory '/home/mrg/openram_testchip/openlane'
make: *** [Makefile:70: user_project_wrapper] Error 2
The log output is useless.
It's reporting overlaps then segfaults [WARNING DPL-0005] Overlap check failed (16972).
repeater448 overlaps ANTENNA_repeater448_A repeater449 overlaps ANTENNA_repeater449_A repeater451 overlaps ANTENNA_repeater451_A [ERROR]: during executing: "openroad -exit /openlane/scripts/openroad/or_opendp.tcl |& tee >&@stdout /project/openlane/user_project_wrapper/runs/user_project_wrapper/logs/placement/29-opendp.log" [ERROR]: Exit code: 1 [ERROR]: Last 10 lines: child process exited abnormally [ERROR]: Please check openroad log file [ERROR]: Dumping to /project/openlane/user_project_wrapper/runs/user_project_wrapper/error.log
m
I guess the 17 is a red herring. For the crash a test case is best as I can't guess from this what the problem is
you mentioned having macros - are there placement sites in the channels ?
I have seen a case recently where the channel was so narrow no instances could be placed there
m
It's big. This successfully routed for MPW2
One dumb question. There is now a config.json in addition to config.tcl with duplicate information. Why are they both there?
m
sorry but I guess I need to look at it
I am not much of an openlane expert, I mostly work on openroad. @User can you explain config.json vs config.tcl?
m
Thanks for your help. I'll add a test case and/or debug a bit more
m
np
d
The .json is there so users can be allowed to customize things on platforms where freely modifiable Tcl would constitute a security concern, for example the efabless platform and the OpenLane cloud runner. It’s just an alternative that you’re free to pick.
m
@User what happens if both are there like in the example?
d
tcl's prioritized
Do note that I mean only Tcl will be loaded. JSON will be ignored entirely. If the tcl config's missing, it will attempt to load a json config. If the json config's missing as well, flow.tcl will throw an error.
m
@User yeah, thanks for the clarification.
After wrestling with a fresh install of openlane/pdk, I can reproduce this with or_opendp again:
Copy code
invalid command name "17"
    while executing
"17"
    ("uplevel" body line 1)
    invoked from within
"uplevel #0 ${cmd}"
    (procedure "set_log" line 3)
    invoked from within
"set_log ::env($index) $escaped_env_var $::env(GLB_CFG_FILE) 1"
    (procedure "save_state" line 9)
    invoked from within
"save_state"
    (procedure "flow_fail" line 6)
    invoked from within
"flow_fail"
    (procedure "try_catch" line 25)
    invoked from within
"try_catch $::env(OPENROAD_BIN) -exit $::env(SCRIPTS_DIR)/openroad/or_opendp.tcl |& tee $::env(TERMINAL_OUTPUT) [index_file $::env(opendp_log_file_tag)..."
    (procedure "detailed_placement_or" line 6)
    invoked from within
"detailed_placement_or"
    (procedure "run_routing" line 32)
    invoked from within
"run_routing"
    (procedure "run_routing_step" line 10)
    invoked from within
"[lindex $step_exe 0] [lindex $step_exe 1] "
    (procedure "run_non_interactive_mode" line 43)
    invoked from within
"run_non_interactive_mode {*}$argv"
    invoked from within
"if { [info exists flags_map(-interactive)] || [info exists flags_map(-it)] } {
        puts_info "Running interactively"
        if { [info exists arg_values(-file)..."
    (file "/openlane/flow.tcl" line 356)
make[1]: *** [Makefile:43: user_project_wrapper] Error 1
make[1]: Leaving directory '/home/mrg/openram_testchip/openlane'
make: *** [Makefile:70: user_project_wrapper] Error 2
The "17" is in the name of my clock in my base.sdc or my config.tcl file:
Copy code
set ::env(CLOCK_PORT) {io_in[17]}
If I don't use the base.sdc, it still does the above so it must be something with the config.tcl. I have the name escaped there:
Copy code
set ::env(CLOCK_PORT) {io_in\[17\]}
If I look at the generated SDC files, however, the name is unescaped:
Copy code
mrg@diode ~/openram_testchip/openlane/user_project_wrapper/runs/user_project_wrapper (main)$ find . -name \*.sdc -exec grep create_clock {} \; -print
create_clock -name io_in[17] -period 30.0000 [get_ports {io_in[17]}]
./results/cts/user_project_wrapper.cts.sdc
create_clock -name io_in[17] -period 30.0000 [get_ports {io_in[17]}]
./tmp/floorplan/4-verilog2def.sdc
create_clock -name io_in[17] -period 30.0000 [get_ports {io_in[17]}]
./tmp/placement/23-resizer_timing.sdc
create_clock -name io_in[17] -period 30.0000 [get_ports {io_in[17]}]
./tmp/placement/21-resizer_timing.sdc
create_clock -name io_in[17] -period 30.0000 [get_ports {io_in[17]}]
./tmp/placement/16-resizer.sdc
So there are two questions: 1. why isn't write_sdc escaping the name properly? 2. why is or_opendp using the SDC at all?
@User ^^
m
@User
m
@User This may actually be a red herring like @User mentioned before. This looks like it is part of the "save_state" function which is probably trying to write out the SDC (or config.tcl) after an error. opendp probably doesn't use the SDC (or clock at all) but the fail triggers this saving. I'm running it now without any clock defined to see if I can identify solve the real error.
OH, so this failure is actually during routing when it is trying to legalize the diodes. This is why timing is enabled...
GAH, and it can't legalize the diodes because they are "sprayed" all over the macros and can't be moved outside of the macros.
m
which DIODE_INSERTION_STRATEGY are you using?
m
I was using spray because the others caused problems during the MPW2 tool flow
But spray won't work if you have macros now
m
which one is spray?
m
1
I'm uncertain how the others would work. If they put a diode on a macro it won't work
m
1. "A diode is inserted for each PIN and connected to it. ?
m
"Spray diodes"
Specifies the insertion strategy of diodes to be used in the flow. 0 = No diode insertion, 1 = Spray diodes, 2 = insert fake diodes and replace them with real diodes if needed. 3= use FastRoute Antenna Avoidance flow, 4 = Use Sylvian's Custom Script for diode insertion on design pins and smartly inserting needed diodes inside the design, 5 = a mix of strategy 2 and 4. (Default: 3)
m
3 used to not work for some reason
m
maybe you have an older version?
Yes, I had an older version during MPW2 🙂
Looks like those are in conflict with eachother. Maybe spray puts them randomly and then tries to connect them to each pin?
I think 3 used to not work because they were using a different router before, if I recall?
TritonRoute vs FastRoute?
m
I'm not sure of the history. @User would you clarify the doc discrepancy between the README and hardening_macros on
DIODE_INSERTION_STRATEGY
described above
from what I can see in routing.tcl it looks to like 1 is closer to "A diode is inserted for each PIN and connected to it. " versus spraying but I guess that doesn't match your experience
it looks like it is trying to put a diode on each pin directly and then let detailed placement legalize it
that's potentially a lot of diodes so it might not be possible to legalize them in a dense enough design area
however I would expect that would lead to diodes on the pins not deep inside the macros
m
I see. I removed the diodes entirely and it seems to still have issues legalizing clock buffers (and filler?). I need to figure out that red herring save_state bug too though. I'm finally digging more into the openlane/openroad flow so I have a better understanding of things under the hood now.
m
there should be no fillers when you are running detailed placement. They should happen after otherwise there will be no empty sites
m
I get lots of:
Copy code
repeater432 overlaps FILLER_2_3245
 repeater433 overlaps FILLER_406_3269
 repeater435 overlaps FILLER_631_2825
 repeater437 overlaps FILLER_2_4701
before it unelegantly gives up
This is all after routing
m
if you are inserting diodes after routing then you should delay filler insertion to after that
the design will be 100% full after filler insertion and nothing else will fit
m
That might be an openlane issue. This is the relevant stack:
Copy code
invoked from within
"detailed_placement_or"
    (procedure "run_routing" line 32)
    invoked from within
"run_routing"
    (procedure "run_routing_step" line 10)
There are calls to ins_fill_cells after ins_diode_cells but before detailed_place_or
So it may take up the space and not be able to legalize
Yeah, it runs ins_fill_cells before legalization of the diodes. That is the problem.
@User ^^
m
that sounds worth an issue if you don't hear from @User
m
@User @User Maybe this has been resolved, but I was able to reproduce the
Copy code
invalid command name "17"
    while executing
"17"
error and have a work around. It occurs when there is an existing
<design>/runs/<tag>/config.tcl
file. Deleting this file works for me. When this file is created, the clock port is defined as below, but it looks like the routine that reads this and rewrites it can't handle
]
or
[
.
Copy code
set ::env(CLOCK_PORT) "io_in\[17\]"
I believe the permanent solution is to patch the
save_state
routine in
scripts/tcl_commands/all.tcl
with the following
Copy code
set escaped_env_var [string map {\[ \\\[} $escaped_env_var]
            set escaped_env_var [string map {\] \\\]} $escaped_env_var]
I'll submit a PR once I test it.
m
Hi @User I had gotten to that point as well and even had that same fix but I don't see it resolving the issue. Sometimes I feel like the scripts are cached somewhere and don't seem to update when I run though...
m
@User I found another place
proc prep
in
all.tcl
that looks like it's trying to write out the
config.tcl
file. However, the same type of fix doesn't work as expected. I'll dig deeper tomorrow. Incidentally, I've noticed that the
config.tcl
file has a lot of duplicate entries and sometimes the values don't match.
👍 1
m
@User Our day is starting so I'll let you know what I find today.