For some long running CACE sims I ve seen this error popping open-source-silicon.dev #chipalooza

For some long-running CACE sims, I've seen this er...

Brady Etz

03/29/2024, 6:55 PM

For some long-running CACE sims, I've seen this error popping up in the live log after a simulation test case completes, and data for that run is lost. Any idea what could be closing the pipes?

Tim Edwards

03/29/2024, 8:35 PM

It looks like something killed the process. Did you monitor memory use while it was running? It is possible that the process was killed by the kernel's out-of-memory (OOM) manager. This is possible for a long-running simulation if you have opted to save everything. Could always be something else, but that's my hunch.

Brady Etz

03/29/2024, 8:39 PM

Will monitor next time. Yeah, I think WSL limits available memory to 12 GB or so, so it could well be.

Tim Edwards

03/29/2024, 8:40 PM

If it's WSL, though, I'm even less sure how the OOM management works.

Brady Etz

03/29/2024, 11:06 PM

It does seem related to memory. On my machine, sets of ~50 long (400000 rows) sims in CACE can run out of memory. It helps if I exit the CACE GUI between runs, but makes it difficult to collect a full datasheet summary. There's a note in Microsoft's documentation about how WSL 2 can consume excess memory because it does not promptly release cached pages. https://github.com/microsoft/WSL/issues/4166 I can confirm idle memory usage is higher after use, even without actively running a CACE job. I'm trying some configuration settings with a

.wslconfig

file, described here: https://learn.microsoft.com/en-us/windows/wsl/wsl-config. Will report if this resolves my troubles.

👍 1

Tim Edwards

03/30/2024, 12:12 AM

I would suppose that a low-speed crystal oscillator transient startup would be a resource hog, but is there any way to reduce the data after the simulation? Or even during? The

linearize

command in ngspice might be useful here (or not; just spouting off ideas here. . .).

Brady Etz

03/30/2024, 12:25 AM

If I understand the manual right,

linearize

acts on a finished vector but it doesn't accelerate or compress the data mid-sim. The post-simulation data handling is small enough. But the process of running the simultaneous transient sims occupies significant memory. I haven't tried running single-threaded, but I expect the total runtime will be longer. I added a

.wslconfig

file to my Windows home directory with a couple lines that show improvements.

memory=20GB

gives a more generous chunk to WSL than the default 16GB for my system.

pageReporting=false

keeps Windows from nabbing memory from WSL whenever it can (I think).

autoMemoryReclaim=dropcache

is an experimental feature that seems to help the most. WSL seems to free up memory more aggressively after a job completes and I'm not seeing it climb over 6GB. EDIT: It does climb over, but only during a batch of runs. Once WSL 2 runs out of swap space and memory, the running processes fail or their pipes close. With these configuration settings, after 10-15 seconds, the swap and memory clears, and the last testbench run can be aborted from the CACE GUI. Resuming the run after memory clears at least gives a chance of success.

Tim Edwards

03/30/2024, 12:28 AM

Okay, so I guess the problem really was just memory management. I'm glad you figured it out (and hope that was pretty much the whole problem).

Brady Etz

03/31/2024, 3:54 AM

It still creeps up with the second or third testbench I run. Watching

htop

, there's a cace-gui process that climbs by 40-80 MB in resident memory for every completed ngspice run. I thought ngspice did the math (e.g. calculate the mean of a vector, or calculate a .meas time/value) and sent a few words over to cace-gui with the

.data

file output it makes using

wrdata

, but it seems like there's more going on than that. When a testbench finishes, all the ngspice processes terminate. This releases 1-2G of memory. But mid-testbench, the memory clumping onto CACE is an issue. At the end of a testbench cycle, CACE does its thing and processes the data, then the resident memory in the cace-gui process with the highest priority hikes up, which carries forward into the next testbench. Eventually I'm all out of juice over the 45 or 135 corners. Since adding more comprehensive corners, I haven't been able to generate results for every testbench in one sitting. I can still create an updated datasheet, but will do a piecewise data summary in the GitHub README. Is there a way for me to pare down the data CACE handles? I'm trying "Do not create plot files" (although I wasn't plotting anything). I am definitely wondering if there's anything else I should try.

Tim Edwards

03/31/2024, 4:13 PM

I really don't know because CACE is not supposed to be handling much data at all; if you did 135 corners, then it should just be handling 135 values (or a small multiple of that, depending on how many values your simulation outputs). It sounds like python in WSL isn't doing proper garbage collection, like it's grabbing memory for the buffered output from ngspice and not releasing it. You might try axing the ngspice output in CACE by changing

stdout=subprocess.PIPE

and

stderr=subprocess.PIPE

stderr=subprocess.STDOUT

stdout=subprocess.DEVNULL

and

stderr=subprocess.DEVNULL

. This will have one negative impact that a simulation that hits an error and drops back to the ngspice interpreter prompt will cause CACE to hang. But if it works, then I can add it as an option setting.

Tim Edwards

03/31/2024, 4:17 PM

(The changes would be in

cace_simulate.py

lines 152 and 153; I don't think it's needed anywhere else.) I have a pretty low level of confidence that that will change anything. But I can't think of anywhere else that would be using so much memory. For that matter, the output of ngspice can't be using that much memory, either.

9 Views

Open in Slack

Previous Next