I have done a comparison between `ngspice` and (se...
# analog-design
s
I have done a comparison between
ngspice
and (serial)
Xyce
on a big design. Note: this test circuit does not use sky130 models, but uses a generic 180nm Cmos process at 1.5V. Design has only 1 LV pmos and 1 LV nmos, only two models, BSIM3 level=49. The example is in the standard (standalone) xschem test schematics,
..../share/xschem/xschem_library/rom8k/rom8k.sch
, it is a 16KByte Rom Macro cell. Part of the rom array is populated with actual transistors so some read cycles are simulated and data is read out. Circuit statistics:
Copy code
***** Device Count Summary ...
       C level 1 (Capacitor)                   1035
       M level 9 (BSIM3)                      14287
       V level 1 (Independent Voltage Source)    22
       --------------------------------------------
       Total Devices                          15344
***** Setting up matrix structure...
***** Number of Unknowns = 6677
Xyce (serial) report:
Copy code
***** Total Simulation Solvers Run Time: 977.709 seconds
***** Total Elapsed Run Time:            980.18 seconds
*****
***** End of Xyce(TM) Simulation
*****
Ngspice report:
Copy code
Total elapsed time (seconds) = 771.449
So there is no much difference in execution time (at least no orders of magnitude). Simulation results are absolutely correct for both. I have set up the schematic such that it simulates unchanged both in ngspice and Xyce. @Steven Bos it would be nice if you can test this on your parallel
Xyce
installation. You need to run xschem as standalone (so not in a directory with sky130 xschemrc file), get the above models file and place it in the simulation directory as explained in the COMMANDS element of the schematic. Also you need to place the stimuli file from
.../share/doc/xschem/rom8k/stimuli.rom8k
into the simulation directory and follow instructions in the COMMANDS element. I have done some updates to xschem for better Xyce integration and more error checks (for example against corrupted or malformed .raw files) so please update if you want to try this test. CC (@Eric Keiter @Harald Pretl @Tim Edwards)
❤️ 1
👍 6
s
Cool @Stefan Schippers! I will test this with my setup. Do you have the openmp version of ngspice installed (and thus using a parallel version of ngspice) or without this flag? If with, how many threads are you using?
s
In this test and due to the models (BSIM3 rev.3.1) the ngspice simulation used only on one thread and so did Xyce, so they used the same computing power. I know ngspice with BSIM4 can use 2 threads, one for the matrix solver, and one for the device equations calculation. But in above test this was not the case. My ngspice is compiled with
--enable-openmp
From ngspice manual, "To state it clearly: OpenMP is installed inside the model equations of a particular model. It is available in BSIM3 versions 3.3.0 and 3.2.4, but not in any other BSIM3 model, in BSIM4 versions 4.5, 4.6.5, 4.7 or 4.8, but not in any other BSIM4 model".
s
Will ngspice behave differently with or without openmp compilation? I see some differences between xyce serial and xyce parallel with 1 core. To be precise: I have not inspected any functional changes (eg. precision) only some run time differences
I translated the stimuli file into a .cir file and downloaded the .7z file. Should I then extract the .7z file and rename a library file to models_rom8k.txt? I also pulled, config, make and installed the latest version
s
For the models, yes that is the way to go. I don't know if i can package the models (due to license issues) inside xschem so i provide the instructions to get them.
s
Ah good to know that in your test it forces single core compute for a fair comparison.
Which file should i rename to models_rom8k.txt?
s
I believe there are two transistors, NMOS and PMOS. Put these into a single file. Ensure the name of the NMOS model is NMOS and the pmos name is PMOS. Or get ith here :-)
sorry the names are cmosn and cmosp, respectively. But get the file i uploaded,
s
I get several messages in the info window. Is this correct?
s
yes no problem
s
Ok, ill make the model file and see if the results are similar to yours
it is running. If it run even close to yours this will take at least 2x 10 min
s
My laptop is a 13 years old core i3, so you will be faster 🙂
s
Indeed ngspice was a bit faster on my laptop:
image.png
correct waves (i think)
s
Did you see the launchers in the top schematic? there are2 for launching the simulation and 2 for loading the waveforms. I have set different netlist/raw filenames to avoid overwriting. xschem is structured such that you can run on the same schematic more simulations in parallel even on the same schematic if you do that with different output tiles. This is easy with tcl script if you look at the launchers embedded tcl code. From the GUI if you press the SImulate button it turns to red and is disabled until the simulation is finished. This because it would be terrible if a user clicks the button 10 times....
ok, just for test please read the last LDQ[15:0] pattern
Shift + mouse wheel inside the graph will zoom in/out the waves
while mouse wheel with mouse outside the graphs will zoom in/out the schematic
s
image.png
s
I meant the whole output data bus, LDQ[15:0], (LDQ in the graph ) but i am sure the results are correct
s
Great.
I do get some error when trying xyce
and yes i am using the launcher buttons for bot running the sim and loading data
s
when i was designer i often used C1A0 for a pass code, and F16A as Fail code. In Italian CIAO mean "Hello", and FIGA means ... ehm... "pussy"
s
hahahaha
i like those insider tidbits 😄
s
better than using FFFF or 0000 which often occur regardless of good/bad working parts 🙂
let me know the Xyce error... I have Xyce rev 7.5 installed,
s
error.log
maybe because my folder is not empty?
I also use xyce 7.5
s
Oh yes, may be i forgot to check that in. Please regenerate the stimuli cir file, but before edit the source file when doing Simulation -> Utile stimuli editor (GUI) and change in the window the line voltage VCC to voltage 1.5. Xyce does not accept parametrized pwl functions. Then press
Translate
s
ok
s
The
Utile
things is a side project i used to create complex stimuli for spice. You can describe waveforms in a more convenient way, declare busses and use macros. The help button in the Utile window explains the language. When doing
Translate
everything is translated into single bit voltage sources with PWL functions. You can also set signals to Hi-Z state so it is quite useful for complex designs.
s
It is now running xyce serial. Interesting to know about that function, my simulation so far have only required simple waveforms. I mostly use the Pulse function for that
Interesting! Xyce was a fair bit slower than ngspice
s
yes, however doing a read cycle on a memory requires to set a 13 bit address bus, set a number of signals (clock, enable , output enable) that must have specific timings. Once the macros are set a read cycle in the source stimuli file is just a :
;     add en oe
;=====================
cycle 0000 1 1
s
The output of xyce
I wonder why in my circuits i get the complete opposite.
Let me run this test with xyce parallel
s
Anyway a few minutes for a circuit with 14000 MOS and 16k total devices is quite good
This is a transistor level simulation, however the circuit is quite 'digital' the only real analog part is the sense amplifier. May be the switching logic does not give Xyce a big advantage.
s
What name is the parallel version of xyce in simulator_commands?
s
while a deep analog circuit like ADC, PLL etc may give different results
Yo mus give the complete path of the simulator. Open the Simulation -> configure simulators and tools and set the right path/filename
s
Yes, that could be an interesting explanation. Maybe you can confirm that my DAC test was also performed correctly?
e
Hi @Stefan Schippers and @Steven Bos this is interesting! Glad Xyce is working for you.
s
you also must edit the launcher to set the 4th simulator: set sim(spice,default) 3 ;# 4th simulator: Xyce
which must refer to the parallel Xyce
Hi, @Eric Keiter you guys did a great work!
e
BTW, if you run parallel Xyce, this circuit is large enough that Xyce will probably attempt to use a parallel linear solver. However, at this scale, it is probably not large enough to beat the serial linear solver. So, I suggest trying the “parallel load, serial solve” option.
@Stefan Schippers thanks!
s
Ah yes, i was editing the wrong launcher. Thanks, now running xyce parallel.
s
from 7.2 to 7.5 lot of improvements and now we can really use Xyce/ngspice on the same designs with minimal changes
👍 1
e
To do “parallel load serial solve”, you can just add “.options LINSOL type=klu”
s
Oh @Eric Keiter interesting! How can I set that option
e
To the netlist
s
ok will do after this test
👍 1
e
That will force the linear solver to be KLU, which is a serial direct solver.
s
yes add the line in the Xyce command symbol
e
If you do that, then only the device evaluations and the parser will be done in parallel.
So, of course, based on Amdahl’s law, that will limit the theoretical parallel speed up you can get. But it is generally very hard to get a linear solver to scale perfectly for circuits. So, for moderate sized circuits keeping the linear solve serial-direct while making everything else parallel is usually best.
@Stefan Schippers that is great to hear that the improvements from 7.2 to 7.5 have helped. I’ve been trying to track the various issues that have been reported here, and that definitely helped prioritize some things.
s
This was significantly faster than Xyce serial
👍 1
e
@Steven Bos neat!
s
(the same as ngspice 169 sec exactly) I will now run with “.options LINSOL type=klu”
s
@Eric Keiter how does Xyce partition the circuit? looking at the matrix? I know from experience with other (commercial) simulators splitting the circuit for parallel is the very hard part, and if not done well (and it is difficult by looking at a spice netlist) It can get slower that a brute force flat solver
Anyway great simulator!
and 3' to simulate several read accesses on a 16KB ROM at transistor level is impressive.
e
@Stefan Schippers Xyce does 2 different partitions, one for device evaluation, and one for the linear solver (if using a parallel linear solver). If using KLU the linear solve is just done on proc 0. For the device eval, the default partition is simply a first-come-first serve strategy, based on the ordering of devices in the netlist. For the linear solver, there are various options, but they are all based on graph partitioners like ParMETIS, where the graph just comes from the matrix.
s
42 seconds
👍 1
s
Aha, here comes the hot stepper.
42 sec is the time with Xyce serial and teh .option @Eric Keiter mentioned?
e
We’ve intended to look at parallel partitioning based on the (known) circuit hierarchy, but at this point Xyce doesn’t do that. The matrix graph is from (essentially) a flattened netlist.
s
will add in my test
@Eric Keiter this ROM circuit test uses a
.options SCALE=0.1
i didn't see any mention of it in the manual but i believe it works. SImularion would not succeed with 10x sizes. BTW also sky130 uses a SCALE factor, so it definitely is accepted by Xyce.
e
Yes, option SCALE is supported. I had thought we documented it, but possibly it was overlooked.
s
Hmm the data is wrong though
s
Oh, yes found it in the ref man. 2.1.21.3. Length Scaling
s
image.png
42 sec was with parallel xyce. Should i run this with serial xyce?
I also get these errors:
s
these are warnings, and yes there are some transistors in the ROM array with drain not connected to the bitline, also there are unrecognized model parameters, but these are also from ngspice
If you see COmpleted at the very beginning it means the process returned EXIT_SUCCESS.
s
maybe I added the .options command wrongly? Now the simulation is running a bit longer. .options SCALE=0.10 .options LINSOL type=klu vs .options SCALE=0.10 LINSOL type=klu
s
I did on a separate .options line just to be sure it is accepted
i don't see big differences (on serial Xyce, the only one i have installed)
s
Ok. The longer simulation was serial xyce with the linsol option
which seems to give correct output
image.png
s
looks good
e
.options SCALE and .options LINSOL type=klu need to be on separate lines.
s
so the question is why the 42 sec simulation did produce wrong results.
may be your LINSOL option on teh same lien as the SCALE did invalidate also the SCALe factor and boom nothing works
s
i think so
i am now running the parallel version again and it is passed 42 seconds
e
If SCALE got ignored, it would certainly cause the results to be wrong. I think if Xyce sees an unrecognized .options command, it will print a warning and run anyway.
s
This time we got good data and ...
s
yes, to my knowledge no design id robust enough to survice a 10x increase in L and W
s
image.png
s
Good! success🍺
s
173 seconds, meaning no real speed up compared to 169 but at least quality results
e
In general, all the “.options” commands go to specific parts of the code. The long way to specify “.options scale” is to actually have “.options parser scale=#”.
So “options parser” and “options LINSOL” are different metadata, and would conflict.
Anyway, glad the result matches now!
s
177 sec is a great result anyway. I am curious what HSIM would be able to do , lol
e
If you were running parallel I am a little surprised it wasn’t faster than serial. How many processors were you using?
s
6
e
To run parallel, the command would be “mpirun -np N Xyce netlist.cir “
OK, 6.
I would have expected some speed up over serial in that case.
Actually, I’ve lost track of how long the serial Xyce calculation took. Was it the 260sec result?
s
TLDR; ngspice 169s parallel xyce w/o linsol 169s parallel xyce w 173s serial xyce 267s
1
e
AH, ok. Thanks!
s
I think so, yes (@Steven Bos confirm?)
s
these are single measurements, i dont know how deterministic this process is
s
I have similar results for ngspie and Xyce serial (scaled to my laptop cou)
e
Understood, makes sense.
s
single measurements but long simulation, so if machineis not busy i think results are repeatable
Thank you @steven for testing Xyce parallel
s
maybe you can test my 5bit DAC @Stefan Schippers? That one is weirdly much faster on xyce than ng for some reason
e
Interesting.
s
Sure can you share the schematics ? or the netlist if you prefer
s
No problem, it is fun to test these things and I learn a lot from the interactions
s
eys doing this tests helps a lot. I have also fixed a crashing bug if xschem attempts rto read in a raw file Xycehas not finished writing, so always improving
s
Also, @Eric Keiter, in another thread, I tested a 7 bit DAC which was outperforming serial xyce significantly. But the 8 bit DAC was the reverse, xyce parallel found 24 singularities and the full simulation would have taken 9 days (i stopped after 2%)
Possibly the scale of the circuit? I dont think it has more devices than this circuit though
e
Weird. How large (# of unknowns and/or devices)?
s
Let me look up the thread
e
If the circuit is under about 100k unknowns, it is usually best to use the “.options LINSOL type=klu” (parallel device eval, serial solve) setting
s
@Eric Keiter I started working seriously on Xyce integration when i saw the m= parameter was implemented on subcircuit calls, This is soooo much used in big system simulations for ganging together idle subcircuits but not reducing the loading capacitances. One subcircuit with m=256 simulates much faster that 256 instances of the same subcircuit!
e
Oh, yes. I can see that “M=“ would be very important! I should have implemented it a long time ago, but better late than never!
s
well done!
e
Back to the linear solver, etc; I think that when the number of unknowns is 10,000 or less, parallel Xyce automatically will use KLU (the serial direct solver). But, in reality, the trade-off point for the parallel sovler is more like 100k.
s
I will try that option
e
So, between 10k and 100k, it needs to be set manually, and it is the right thing to do. Above 100k, it is harder to saw which is best. I’ve seen some circuits where KLU is still the better choice at 500,000 unknowns. But I wouldn’t count on it at that point.
s
500k at transistor level is a big beast, lol
e
Yes!
s
When it finds singletons, does it mean that xyce cant partition?
e
Singletons are actually fine. If Xyce is reporting them, however, that means you are using the parallel solver.
For the parallel solver, Xyce does several pre-processing steps to the matrix to make it easier to solve. The first step is something we call “singleton removal”.
s
Hmm i wonder why the parallel solver is then having trouble with the circuit while the serial solver can solve it in 700s
e
Basically, what it is doing is identifying matrix structures that are associated with ideal voltage sources (like power supplies, clocks, etc). For ideal sources, they tend to be connected to a lot of things, which causes communication problems for a parallel solver. However, since they are ideal, we already know the solution for those nodes and there isn’t any point in including them in the matrix solve. So, the “singleton removal” function factors them out before moving to the next step
s
Ah thanks for sharing that!
e
It is generally difficult to get parallel matrix solvers to work well with circuit matrices. There are a handful of specialized pre conditioners that we’ve developed (that are applied after singleton removal), but they tend to only be effective on certain types of circuits.
So, the default preconditioned is a fairly simple one, that has a “one-size fits all” quality to it.
“Preconditioner” I mean.
s
Right, so results with parallel xyce really depend on the circuit, but if suitable can lead to great speed up
s
@Steven Bos can you do a TL;DR for the sim times of your 10 bid DAC with ngspice, Xyce serial and Xyce parallel?
s
Yes that was my plan!
but so far xyce parallel was not running anything higher than 7. I will now test it with the suggestion @Eric Keiter gave above using KLU
e
In general, even when the parallel matrix solver is working well, it can’t beat a good direct solver. THe parallel solvers are based on GMRES, and even the serial GMRES will be slower than a direct solve for small circuits. But, GMRES is easy to do in parallel, and direct solver methods are very hard to do in parallel. Also they scale very differently. Direct solvers (even the best ones) scale much worse. So, eventually, there is a problem size where the direct solver loses.
s
Yeah for my circuit that was around 6 bit DAC where parallel came out on top
e
So, up in the 100s of thousands of unknowns, the direct solver just can’t hack it any more. After that, the iterative (GMRES based) solvers are the only option.
The holy grail would be direct solver that actually scales well in parallel. We’ve worked on this, but it is hard.
s
Great, they should teach this practical wisdom
👍 1
I will share my 6 bit DAC netlist with you @Stefan Schippers, if you have time please give it a comparison between ng and xyce
s
sure i will. (I only have Xyce serial, may be in thenext days i will take a big breath and build the parallel as well )
s
yes that was quite a struggle :D
No idea if this will be useful to you, but this was my installation log of all the dependencies incl. trilinos serial, parallel, regression-suite, xyce serial and xyce parallel It is not a step-by-step guide, but all the important steps should be in there
s
Thanks i have saved the file for reference. 👍
I am looking at your DAC10 netlist . I would suggest increasing rise / fall times of voltage sources. Since period varies between 10u and 5120u, using 1ps rise/fall is probably slowing the simulator, requiring very small timesteps around the transitions. I suggest using 100ps or something, but i am not sure if increasing the rise/fall in the pulse voltage source lines without shifting the start time will keep the signals correctly aligned.
s
Good one, the parallel xyce complained about too small time steps
I slightly redesigned the DAC, it was build around a digitally controlled analog switch but the design i used seemed weird when doing a DC sweep
so it is taking a bit longer to propagate all the changes and cretae new netlists
The design constraint was to switch analog voltages between 3.3V and 0V using 1.8V digital logic.
s
The classical voltage shifter is this one:
Usually with a buffer 3.3V inverter on the output
s
What are the drawback/benefits of this design compared to the one above?
s
Your switch has Nfets with body connected to OUT. This can be done using triple insulated well, but ustually nfet body terminal is 0V.
on cmos processes with no insulated p-wells (that require buried nwells) that switch cannot be done
s
Ah. so this design would not work for the sky130 process?
s
It can be done, sky130 has insulated p-well, but surely takes more space due to the numerous tap rings
s
Right. good to know. I will be experimenting with layout and post layout simulation in the next weeks
s
1.png
s
Wow! Same curve but a much cleaner design. I have so much to learn about designing 😄
Does the output need an 3.3V inverting buffer like you mentioned? Or can i chain this to the next one without a buffer like so
s
yes if there is a capacitive loading this is the right structure:
if output load (the capacitance attached to the voltage shifter output) is low you can avoid the buffering inverter.
s
I was thinking of 1pF load capacitance
s
Oh, i see there is a problem. Your VREFL is not alwats 0 and VREFH is not always 3.3. Then the above structure is not good. i thought you needed to switch between 2 fixed levels, 0 and 3.3
ok delete my last posts 🙂
s
no problem, i found it very useful anyway!
s
well, if you need a 1.8 --> 3.3 voltage domain shifter the above is the way to go
s
The design i already made?
s
no i mean the one with the cross-coupled p channels
s
ah for pure switching of 1,8 to 3.3. Gotcha!
s
if you need an analog switch that uses arbitrary VREFL and VREFH levels your implementation is probably the way to go.
s
Great, I think i will continue to layout then -- after a comparison plot of 1 to 10 bit DAC.
I noticed tht the .raw files blow up quite well, the 7 bit is 450MB
Thanks for all the help @Stefan Schippers and @Eric Keiter. I will call it a night and will produce the plot tomorrow.
s
This switch implementation is probably safe, it will switch between arbitrary VREFL / VREFH levels anywhere between 0 and3.3V. It is similar to your design but uses 3.3V signals (X and XB) to drive the transmission gates instead of your ininv / inbuf signals at 1.8V. Also if the IN signal is driving 2 swicthes the X-XB generator can be shared between 2 swicthes. Also no insulated pwell required. Also no static consumption occurs. for any switch state, for any vrefh/vrefl combinations.
🤯 1
🙌 1
... Always plot the voltage supply currents, so you see if there are static leakage paths . In this case only during switching there is a consumption.
1
@Steven Bos i did a comparison of Xyce serial vs ngspice of your DAC10 test netlist. I changed all voltage sources to be 0 at time 0 and ramp up to their values after 1ns. I also have set the rise and fall times to 100ps and used
uic
in the
.tran
lines. Also to speed up things i simulated only the first 128usec. ngspice report:
Copy code
Transient analysis time = 269.679
Total analysis time (seconds) = 271.359
Xyce serial report:
Copy code
***** Total Simulation Solvers Run Time: 831.252 seconds
***** Total Elapsed Run Time:            1277.15 seconds
*****
***** End of Xyce(TM) Simulation
Results are correct for both ( out voltage = 3.3 * decimal(D[9:0]) / 1024 ) Ngspice used 2 cores (max 140% cpu usage) for solver and equation calculations. Xyce used only one core (max 100%). Simulation time on Xyce serial was significantly higher. (cc @Eric Keiter)
@Steven Bos the following (long!) expression in a graph (RPN notation):
Copy code
"ERROR;
d0 
d1 2 * + 
d2 4 * + 
d3 8 * + 
d4 16 * +
d5 32 * + 
d6 64 * +
d7 128 * +
d8 256 * +
d9 512 * +
1.8 /
1024 /
3.3 *
out -
"
calculates the ADC error: it calculates the theroretical voltage by summing the weighted binary bits, normalized to 1V dividing by the VCC level (1.8) and by 1024 and then normalized to 3.3V. the
out
node is subtracted from the result. This error will be significant if you do mismatch simulations.
s
Wow @Stefan Schippers! Thank you again for sharing your design experience, I think more people from this channel and other analog beginners like myself should take notice. I will go with your design (with full credits and share the github repo when submitted for tapeout). The error graph is something i was planning to do in Excel or Matlab. Doing it automatically after graph loading in xschem is of course the proper way of doing it in.
I am still investigating why my DACs show the inverse result, with ngspice being slower. Could you run this 5-bit DAC netlist (still based on my switch design)? It should be small enough to run a full sim. Something must be very wrong for ngspice as these are 10x differences. My logs and netlists are attached ngspice (default, using 2 thread): 220s ngspice (set threads to 12): 215s xyce serial (serial load, serial solve w/ KLU linear solver): 19s xyce parallel (parallel load, parallel solve): 28s xyce parallel (parallel load, serial solve w/
linsol klu
): 26s my ngspice sims are done with your .spiceinit file in the netlist directory. But without any of the sky130 lib and corner optimization mentioned in an earlier thread.
This is the schematic
Could this be hardware difference, like a new cpu instruction or more memory? I am running Ubuntu 20.04 LTS using the Windows Subsystem for Linux v2 (windows 11)
Though, the rom_8k test gave the same results on both our platforms: yours: 771s (ngspice) vs 980s (xyce serial) means ngpsice is +27% faster mine: 169s (ngspice) vs 267 (xyce serial) means ngpsice is +58% faster and: xyce parallel load and solve was also 169s, so ngpsice and xyce parallel tied perfectly on this circuit
@Stefan Schippers I managed to get similar plots to your 2:1 analog mux design. My titles are a bit wrong though, it has 1 output so should be named 1-channel analog switch
Though, the current variable in xyce is loaded as V1#branch while in ngspice is i(v1) meaning I cannot refresh my data in the graphs
The transient analysis of the same device
~100ps delay
ngspice does it in 10.5s while xyce serial in 0.8s
s
@Steven Bos about current names saved in the raw file: i do not save all variables with the
-r file.raw
given on command line. Without this i specify with
.prin tran format=raw ...
the variables i want to save (the xschem
spice_probe.sym
attached to the nets does exactly that). FOr currents i explictly save the ones i need as in: .
print tran format=raw i(vvcc) i(vsa) i(vl) i(vdec)
and these currents retain the name given on the .print line. If you remember the rom8k reloads all variables including currents, either from ngspice raw and from Xyce raw. However another thing i will do is to look for V1#branch (either upper/lower case) if search of i(v1) / I(V1) fails.
s
@Stefan Schippers Thanks! It is a minor nuisance, just duplicating them for both ngpsice and xyce is a good workaround for me. Other question: when designing my analog switch i frequently run both DC and tran analysis. I have currently assigned ngspice for tran and xyce for DC and thus reuse my graphs for both of them. Maybe a graph can be paired to a certain simulator or file such that both graphs can exist at the same time when doing a refresh?
I have not run more run time comparisons as I first want to learn about your experience with 5-bit DAC netlists send earlier.
In the mean time I have been exploring your level shifter combined with the transmissions gates. I noticed you used the thick oxide FETs for the level shifter and transmission gates. Will the 1.8V be damaged when supplied with 3.3V? How did you decide on the W/L?
Since we use a transmission gate, I thought it would be interesting to experiment with negative voltages such that the DAC can supply both. My current design uses two level shifters, one for the plus rail and one for the negative rail with each connecting to a transmission gate. To make it work I had to tweak the L/W of the negative level shifter. Is my approach with 2 level shifters a good one for this requirement and is it acceptable (manufacturable) to have different L/W transistors in a cell? As you can see in the left most graph, I still need some more tweaking for the negative level shifter. ATM it is more trying than science.
s
@Steven Bos I did the DAC05 comparison, DAC05_simulation_ngpsice_2threads.spice vs DAC05_simulation_xyce_linsol.spice. Got the same results as you, Xyce being much faster.
Copy code
***** Total Simulation Solvers Run Time: 51.5038 seconds
***** Total Elapsed Run Time:            89.2508 seconds
*****
***** End of Xyce(TM) Simulation
*****
However, in the ngspice file you did a '`save all`' while only selected variables are saved in Xyce. Moreover the
.tran 1n 160u uic
causes Xyce to simulate and save
3164
timesteps in the raw file while ngspice saves
160000
points! (No. of Data Rows : 160819) I changed the ngspice commands to save same variables as Xyce, set
tran 50n 160u uic
and simulation completed in 55 seconds:
Copy code
Total analysis time (seconds) = 24.99
Total elapsed time (seconds) = 55.076
In above run i also added
.option chgtol=4e-16 method=gear
to increase precision and better integration method. This way ngspice saved
5233
points in raw file, a number that is comparable to Xyce. The simulators take different timesteps decisions. ngspice tends to use the given time step in saved file, while Xyce takes its own decision regardless of this parameter.
For currents that are saved either as i(vvcc) or VVCC#branch, before i will fix xschem to look for either one or the other (ngspice itself does in some cases use one or the other naming, this is really annoying and needs to be fixed) you can simply add i(vcc) and vcc#branch in the graph. The one that is found will be shown.
For having both Xyce and ngspice graphs at the same time and in the same schematic this is currently not possible. for a given schematic only one raw file can be loaded. Loading another file will unload the previous one. You can however open the same schematic in another tab (xschem gives a warning, since opening same schematic in 2 tabs can be dangerous if editing and saving both). In the new tab you can load another file.
s
@Stefan Schippers i noticed that for the rom_8k test xyce produces a 222MB raw file while ngspice produces a 23MB, that could explain why ngspice is faster in this test. Can you confirm this at your laptop? As you can see this was due to xyce recording much more variables
As you can see for all my circuits xyce was significantly faster. Especially the first circuit a simple analog switch with the help from @Stefan Schippers runs in 0.8s, giving it a almost real-time vibe vs 10s with ngspice. In the video you can also see the clear differences in .raw files. xyce records more variables, but ngspice more samples
s
for short simulations nsgpice sim time is dominated by model file parsing. @Steven Bos if you use
xky130_fd_pr/corner.sym
which feeds to ngspice only the desired corner things speed up considerably. This is of course a dirty workaround for something ngspice should fix.
@Steven Bos i have now removed in xchem the -r command line option for Xyce in the default configuration, leaving it as a comment so users can easily enable it if they wish. You can anyway easily save all voltages in a given hierarchy with
.print tran format=raw file=... v(*)
s
I have reran the ROM_8k test such that both ngpsice and xyce only sample
i(vvcc) i(vsa) i(vl) i(vdec)
by removing all the
.print tran format=raw v(xctrl:LDCPB)
from the xyce spice file. The raw file sizes are now more or less equal (~23MB), and both report almost similar variables and datapoints. The runtime statistics reported earlier do not change, meaning Xyce serial is a fair bit slower than ngspice at 2 thread, while Xyce parallel is just as fast.
I will see if I can improve my standard simulations with ngspice using the corner hack mentioned earlier next
@Stefan Schippers thanks for helping me looking into this. I hope others will benefit from it as well. Since many new users will start with small circuits using the sky130 lib, having a 10x improvement in simulation time is too good to ignore. Also, xyce parallel is a great addition to xyce serial, especially since ngspice compiled with openmp is also doing work in parallel by default. With the circuits tested so far, in none of them ngspice came out on top (incl. ROM_8k), and only tied at best. I really like to see a formal benchmark with standard circuits (big and small, out-of-the-box and simulator tuned) for a more conclusive answer. I hope the developers of ngspice will address the corner issue soon. For the developers of xyce (@Eric Keiter) much improvements can be done in the building of xyce serial and parallel. The current build process will discourage new users to try it.
@Stefan Schippers this comment of yours is probably the most important "_The simulators take different timesteps decisions. ngspice tends to use the given time step in saved file, while Xyce takes its own decision regardless of this parameter_" So far example
.tran 1ns 160us
will not result in 160k samples by xyce. This is quite magical. How is this dynamic time stepping done while at first glance, no loss of data (does it ignore redundant data points?). Can it be switched on with ngspice as well? Changing the stepsize in the netlist such that ngspice and xyce will have the same amount of datapoints seems cheating for the comparison. The ROM_8k test has identical .tran settings which gave good results for both ngspice and xyce. Maybe that circuit has much more signal variation and thus xyce cannot do dynamic stepsize?
s
Yes i think the rom8k has no idle periods. I have seen Xyce is much more efficient in jumping simulation forward if the circuit is idle, while ngspice takes longer. For the step size calculation, modern iterative solvers have also an error estimator. If the estimated error (which is in most cases a lower limit of the real error) is far lower than the allowed error, then take longer timesteps. On the other hand if the error estimate is too high reduce step and repeat calculation.
s
Thanks for confirming!
I am quite surprised that ngspice doesnt even attempt to do this, this is a huge feature. It is great that it follows the analysis command to the detail (eg. if you want 100 samples, you get 100 samples), but there should be an option to let this be dynamic and tweak the allowable error limits. In my circuit 160k data points vs 3k data points is like no compression vs lossless compression, the result is identical
s
well also ngspice speeds up considerably when circuit is idle, but it seems Xyce does that faster
s
I wonder if we can get a report about the step size change decisions so we can measure the total error
Based on your experience, is it more common to simulate circuits that idle than not?
s
I was not 100% precise when i said rom has no idle.. after the read cycles are completed @400ns simulation goes all the way to 480ns with no circuit activity. The time points saved by Xyce (xschem raw_query values time) between 400ns and end are:
Copy code
4.0220587039e-07 4.0714201077e-07 4.1701429154e-07 4.3675882466e-07 4.7624794774e-07 4.7999998287e-07
the time points in the same interval saved by ngspice are:
Copy code
4.0035197912e-07 4.0055198269e-07 4.0075198626e-07 4.0095196141e-07 4.0115196498e-07 4.0135196855e-07 4.0155197212e-07 4.0175197569e-07 4.0195197926e-07 4.0215198283e-07 4.0235198639e-07 4.0255196154e-07 4.0275196511e-07 4.0295196868e-07 4.0315197225e-07 4.0335197582e-07 4.0355197939e-07 4.0375198296e-07 4.0395198653e-07 4.0415196167e-07 4.0435196524e-07 4.0455196881e-07 4.0475197238e-07 4.0495197595e-07 4.0515197952e-07 4.0535198309e-07 4.0555198666e-07 4.0575196181e-07 4.0595196538e-07 4.0615196895e-07 4.0635197252e-07 4.0655197608e-07 4.0675197965e-07 4.0695198322e-07 4.0715198679e-07 4.0735196194e-07 4.0755196551e-07 4.0775196908e-07 4.0795197265e-07 4.0815197622e-07 4.0835197979e-07 4.0855198336e-07 4.0875198692e-07 4.0895196207e-07 4.0915196564e-07 4.0935196921e-07 4.0955197278e-07 4.0975197635e-07 4.0995197992e-07 4.1015198349e-07 4.1035198706e-07 4.1055196220e-07 4.1075196577e-07 4.1095196934e-07 4.1115197291e-07 4.1135197648e-07 4.1155198005e-07 4.1175198362e-07 4.1195198719e-07 4.1215196234e-07 4.1235196591e-07 4.1255196948e-07 4.1275197304e-07 4.1295197661e-07 4.1315198018e-07 4.1335198375e-07 4.1355195890e-07 4.1375196247e-07 4.1395196604e-07 4.1415196961e-07 4.1435197318e-07 4.1455197675e-07 4.1475198032e-07 4.1495198388e-07 4.1515195903e-07 4.1535196260e-07 4.1555196617e-07 4.1575196974e-07 4.1595197331e-07 4.1615197688e-07 4.1635198045e-07 4.1655198402e-07 4.1675195916e-07 4.1695196273e-07 4.1715196630e-07 4.1735196987e-07 4.1755197344e-07 4.1775197701e-07 4.1795198058e-07 4.1815198415e-07 4.1835195930e-07 4.1855196287e-07 4.1875196644e-07 4.1895197000e-07 4.1915197357e-07 4.1935197714e-07 4.1955198071e-07 4.1975198428e-07 4.1995195943e-07 4.2015196300e-07 4.2035196657e-07 4.2055197014e-07 4.2075197371e-07 4.2095197728e-07 4.2115198085e-07 4.2135198441e-07 4.2155195956e-07 4.2175196313e-07 4.2195196670e-07 4.2215197027e-07 4.2235197384e-07 4.2255197741e-07 4.2275198098e-07 4.2295198455e-07 4.2315195969e-07 4.2335196326e-07 4.2355196683e-07 4.2375197040e-07 4.2395197397e-07 4.2415197754e-07 4.2435198111e-07 4.2455198468e-07 4.2475195983e-07 4.2495196340e-07 4.2515196697e-07 4.2535197053e-07 4.2555197410e-07 4.2575197767e-07 4.2595198124e-07 4.2615198481e-07 4.2635195996e-07 4.2655196353e-07 4.2675196710e-07 4.2695197067e-07 4.2715197424e-07 4.2735197781e-07 4.2755198137e-07 4.2775198494e-07 4.2795196009e-07 4.2815196366e-07 4.2835196723e-07 4.2855197080e-07 4.2875197437e-07 4.2895197794e-07 4.2915198151e-07 4.2935198508e-07 4.2955196022e-07 4.2975196379e-07 4.2995196736e-07 4.3015197093e-07 4.3035197450e-07 4.3055197807e-07 4.3075198164e-07 4.3095198521e-07 4.3115196036e-07 4.3135196393e-07 4.3155196749e-07 4.3175197106e-07 4.3195197463e-07 4.3215197820e-07 4.3235198177e-07 4.3255198534e-07 4.3275196049e-07 4.3295196406e-07 4.3315196763e-07 4.3335197120e-07 4.3355197477e-07 4.3375197833e-07 4.3395198190e-07 4.3415198547e-07 4.3435196062e-07 4.3455196419e-07 4.3475196776e-07 4.3495197133e-07 4.3515197490e-07 4.3535197847e-07 4.3555198204e-07 4.3575198561e-07 4.3595196075e-07 4.3615196432e-07 4.3635196789e-07 4.3655197146e-07 4.3675197503e-07 4.3695197860e-07 4.3715198217e-07 4.3735198574e-07 4.3755196089e-07 4.3775196445e-07 4.3795196802e-07 4.3815197159e-07 4.3835197516e-07 4.3855197873e-07 4.3875198230e-07 4.3895198587e-07 4.3915196102e-07 4.3935196459e-07 4.3955196816e-07 4.3975197173e-07 4.3995197530e-07 4.4015197886e-07 4.4035198243e-07 4.4055198600e-07 4.4075196115e-07 4.4095196472e-07 4.4115196829e-07 4.4135197186e-07 4.4155197543e-07 4.4175197900e-07 4.4195198257e-07 4.4215198614e-07 4.4235196128e-07 4.4255196485e-07 4.4275196842e-07 4.4295197199e-07 4.4315197556e-07 4.4335197913e-07 4.4355198270e-07 4.4375198627e-07 4.4395196142e-07 4.4415196498e-07 4.4435196855e-07 4.4455197212e-07 4.4475197569e-07 4.4495197926e-07 4.4515198283e-07 4.4535198640e-07 4.4555196155e-07 4.4575196512e-07 4.4595196869e-07 4.4615197226e-07 4.4635197582e-07 4.4655197939e-07 4.4675198296e-07 4.4695198653e-07 4.4715196168e-07 4.4735196525e-07 4.4755196882e-07 4.4775197239e-07 4.4795197596e-07 4.4815197953e-07 4.4835198310e-07 4.4855198666e-07 4.4875196181e-07 4.4895196538e-07 4.4915196895e-07 4.4935197252e-07 4.4955197609e-07 4.4975197966e-07 4.4995198323e-07 4.5015198680e-07 4.5035196194e-07 4.5055196551e-07 4.5075196908e-07 4.5095197265e-07 4.5115197622e-07 4.5135197979e-07 4.5155198336e-07 4.5175198693e-07 4.5195196208e-07 4.5215196565e-07 4.5235196922e-07 4.5255197278e-07 4.5275197635e-07 4.5295197992e-07 4.5315198349e-07 4.5335198706e-07 4.5355196221e-07 4.5375196578e-07 4.5395196935e-07 4.5415197292e-07 4.5435197649e-07 4.5455198006e-07 4.5475198363e-07 4.5495198719e-07 4.5515196234e-07 4.5535196591e-07 4.5555196948e-07 4.5575197305e-07 4.5595197662e-07 4.5615198019e-07 4.5635198376e-07 4.5655195891e-07 4.5675196247e-07 4.5695196604e-07 4.5715196961e-07 4.5735197318e-07 4.5755197675e-07 4.5775198032e-07 4.5795198389e-07 4.5815195904e-07 4.5835196261e-07 4.5855196618e-07 4.5875196975e-07 4.5895197331e-07 4.5915197688e-07 4.5935198045e-07 4.5955198402e-07 4.5975195917e-07 4.5995196274e-07 4.6015196631e-07 4.6035196988e-07 4.6055197345e-07 4.6075197702e-07 4.6095198059e-07 4.6115198415e-07 4.6135195930e-07 4.6155196287e-07 4.6175196644e-07 4.6195197001e-07 4.6215197358e-07 4.6235197715e-07 4.6255198072e-07 4.6275198429e-07 4.6295195943e-07 4.6315196300e-07 4.6335196657e-07 4.6355197014e-07 4.6375197371e-07 4.6395197728e-07 4.6415198085e-07 4.6435198442e-07 4.6455195957e-07 4.6475196314e-07 4.6495196671e-07 4.6515197027e-07 4.6535197384e-07 4.6555197741e-07 4.6575198098e-07 4.6595198455e-07 4.6615195970e-07 4.6635196327e-07 4.6655196684e-07 4.6675197041e-07 4.6695197398e-07 4.6715197755e-07 4.6735198111e-07 4.6755198468e-07 4.6775195983e-07 4.6795196340e-07 4.6815196697e-07 4.6835197054e-07 4.6855197411e-07 4.6875197768e-07 4.6895198125e-07 4.6915198482e-07 4.6935195996e-07 4.6955196353e-07 4.6975196710e-07 4.6995197067e-07 4.7015197424e-07 4.7035197781e-07 4.7055198138e-07 4.7075198495e-07 4.7095196010e-07 4.7115196367e-07 4.7135196724e-07 4.7155197080e-07 4.7175197437e-07 4.7195197794e-07 4.7215198151e-07 4.7235198508e-07 4.7255196023e-07 4.7275196380e-07 4.7295196737e-07 4.7315197094e-07 4.7335197451e-07 4.7355197808e-07 4.7375198164e-07 4.7395198521e-07 4.7415196036e-07 4.7435196393e-07 4.7455196750e-07 4.7475197107e-07 4.7495197464e-07 4.7515197821e-07 4.7535198178e-07 4.7555198535e-07 4.7575196049e-07 4.7595196406e-07 4.7615196763e-07 4.7635197120e-07 4.7655197477e-07 4.7675197834e-07 4.7695198191e-07 4.7715195706e-07 4.7735198905e-07 4.7755196420e-07 4.7775199619e-07 4.7795197133e-07 4.7815194648e-07 4.7835197847e-07 4.7855195362e-07 4.7875198561e-07 4.7895196076e-07 4.7915199275e-07 4.7935196790e-07 4.7955199989e-07 4.7975197504e-07 4.7995195018e-07 4.7999998287e-07
s
Have you done this test with the xyce netlist without
.print tran format=raw v(xctrl:LDCPB)
Ngpsice has so much datapoints in that period while the the final .raw files are near identical in terms of total data points. That means that in other parts of the simulation xyce must be sampling a lot more while ngpsice doesnt
s
the rom8k circuit contains spice probes, these automatically create a .print tran format=raw v(<netname>) of the net they are attached to. This way a limited set of nodes is saved. The above time values represents the times at which all variables are saved in the raw file. Unfortunately the raw file is extremely simple (and this makes it extremely easy to read) but if there is one node changing in a circuit with 100k nodes all other 99999 nodes are saved at that time point.
s
I had to remove all these
.print tran format=raw v(xctrl:LDCPB)
otherwise the xyce raw file exploded to 222 MB (10x), see earlier threads. Without these lines, both have near identical variables and datapoints
s
... And yes, i believe Xyce does a much better sampling around fast transitions .. ngspice also does that but with less deviation from the user specified time step. At least this is my impression.
1
s
sorry not data points, but only the variables exploded, which was 10x ( 700 variables vs 7000 with those extra prints)
s
that is strange, the spice_probe elements cause ~700 variables to be saved. This works if Xyce is launched without the -r option. The -r was included in the default Xyce lanch command. I have now removed this by default. There is also one thing that needs to be fixed. Ngspice understands a .save xxx placed inside a subcircuit. If the subcircuit is instantiated as X1 in the top the x1.xxx is saved . If there is also another x2 instance of it also x2.xxx is saved. Xyce does not understand spice robes placed in lower hierarchies, so i had to manually write the ,print tran format=raw .... for these lower level nodes at the top manually.
s
The test with 700 variables for both xyce and ngspice was done today after I pulled your latest version. I ran xyce without the -r option, but removed those print lines before i ran it
s
Anyway thanks to your tests and comments i have done some commits to make the handling of different simulators easier. The terrible combination of upper/lowe/mixed case of nodes saved was slowing down node lookups in graphs. I have now decided that regardless of the simulator, when xschem loads the raw file all nodes are converted to lower case (ngspice convention) and all hierarchy separators converted to "." (ngspice convention, xyce uses ":").
👍 1
@Steven Bos if you did not have -r and removed the .print lines where did Xyce get the list of nodes to be saved?
s
it saved the .raw files in .xschem/simulations
I startup xschem from /home. Load rom_8k.sch from /usr/local/share/doc/xschem/rom_8k and than hit the simulate launchers. I dont check
use simulation dir under current schemtatic dir
Sorry you meant something else i think. The only print line I left out was
.print tran format=raw i(vvcc) i(vsa) i(vl) i(vdec)
which is similar to
save tran i(vvcc) i(vsa) i(vl) i(vdec)
used in the ngspice netlist
s
ok i understand. The above line just adds 4 currents. to the ~700 nodes, so should not increase the output file that much.
s
I like the probe symbol feature vs writing it using save. It is much less error prone. I will use it in the future
s
@Steven Bos there is another big example in the xschem_sky130 tests:
sky130_tests/test_carry_lookahead.sch
. This example does a comparison between a ripple carry 32 bit and 256 bit adder vs 32 bit and 256 bit carry lookahead adders. The design has 27k devices and 80k unknowns:
Copy code
***** Device Count Summary ...
       C level 1 (Capacitor)                   3763
       M level 14 (BSIM4)                     23010
       V level 1 (Independent Voltage Source)   515
       --------------------------------------------
       Total Devices                          27288
***** Setting up matrix structure...
***** Number of Unknowns = 80343
ngspice and Xyce take approx the same time for simulating this. Ngspice:
Copy code
Total analysis time (seconds) = 2710.63
Total elapsed time (seconds) = 4191.515
Xyce:
Copy code
***** Total Simulation Solvers Run Time: 2103.34 seconds
***** Total Elapsed Run Time:            4147.71 seconds
*****
***** End of Xyce(TM) Simulation
s
Excellent! I will run it later today. Since this is the second non-trivial test with surely more to come, a simulator benchmark suite in the xschem repo could be interesting!
s
nice!. Get the updated versions at https://github.com/StefanSchippers/xschem_sky130/blob/main/sky130_tests/test_carry_lookahead.sch since i updated it to work with Xyce and ngspice. One big thing that i need to test is to verify if and how mismatch simulations can be done with Xyce, like the one done with ngspice on
sky130_tests/test_comparator,.sch
e
Hi @Stefan Schippers and @Steven Bos, it looks like I missed a lot of posts here over the weekend! I’ll try to answer your questions, but I’m not sure if I’ve read them all.
It looks like a lot of the questions pertain to dynamic time-stepping and/or output files? If so I can make some very general comments.
s
Hi @Eric Keiter, yes indeed since that seems to be a big differentiator between ngspice and xyce
e
One comment, that might be relevant to the output file size: When using the command line “-r”, the resulting raw file will contain every solution variable, whether you want them or not. Alternatively, using
.print tran format=raw v(…)
in the netlist, you can reduce the number of outputs to only be the ones you want. Also, you can then include outputs that aren’t in the solution vector (like a lot of lead currents).
Regarding dynamic time stepping, and also outputs one thing to understand is that many codes (including, possibly ngspice) do sampled output. So, if you are running Hspice for example, I think the points that appear in the output file are not the specific time steps used by the solver. They’ve been sampled/interpolated to reduce the size of the output file. By default, Xyce simply outputs the results for every time step used by the solver. But, Xyce can be instructed to do sampled output.
I’m not sure if that is the nature of the difference you observed between Spice and ngspice, but I wanted to mention it.
s
@Stefan Schippers updated xschem to run xyce without the -r flag for this reason since we tried to create a fair comparison with equal sampled variables
👍 1
By default Xyce outputs every time step? In our test of a .tran 1n 160u we expected to see 160k datapoints which ngspice recorded but Xyce recorded only 3k (causing a huge speed up). The quality / output signal was (seemingly) identical
e
Regarding sampled output, if you want Xyce to just output at certain intervals (to reduce output filesize), you can add this command to the netlist:
.OPTIONS OUTPUT INITIAL_INTERVAL=1e-3
If you do this, then it will output every 1e-3 seconds. It is also possible to have different intervals for different windows of time (although I don’t recall the precise command at the moment)
👍 1
@Steven Bos Interesting. I can’t speak to what ngspice is necessarily doing, but if I had to guess, their time integrator is doing dynamic time stepping, but it is sampling the output at every 1ns. Pretty much all circuit simulators are doing dynamic time stepping under the hood.
Xyce is doing dynamic time stepping as well, but we definitely were not the first to do it.
FYI, if you are curious about how the dynamic time stepping is done, it is mostly based on local truncation error analysis. At each step, the integrator makes an explicit prediction, which serves as the initial guess for the step. Then it does an implicit solve for the step, which is the actual solution candidate. It then compares the prediction to this “corrector”, and based on this comparison can make an estimate as to how big the truncation error was for this step. If it is too big, then it rejects the step and takes a smaller one. If it is really small, it increases the stepsize for the next step.
There are other constraints to time step, beyond local truncation error (LTE). For example, if you have PWL or Pulse sources, they create known discontinuities in the signal. The time stepping has to land precisely on those discontinuities and restart.
Also, if the Newton solve fails (which is the process by which it computes the “corrector”), then the step fails and it cuts the stepsize by a fixed fraction. (In xyce the next attempted step is 1/8 the size of the failed one). The LTE analysis won’t be performed in this case, as there isn’t a valid corrector to usue.
Anyway, at a high level this is what most codes do.
But I think that Xyce is the only one that by default outputs every time step, rather than interpolating. Most codes can (like Hspice) can be forced to output every step if that is desired. And, Xyce (as noted) can be forced to use interpolated output. But the default behavior is different.
I would guess that ngspice can be forced to output evey step rather than sampling. But I don’t know that code well enough to know the command.
s
For my circuits these speedup are not isolated events. Every circuit so far benefitted from the better dynamic time step in Xyce (see video above). Out-of-the-box experiences for small circuits are 10x in my case, almost real-time (although this in big part due to a good parsing support of the sky130 lib in ngspice). @Stefan Schippers and I will be testing both ngspice and xyce serial/parallel with more digital and analog focussed small/mid/large circuit test to see how both fair. One comment from Stefan mentioned above: " .. And yes, i believe Xyce does a much better sampling around fast transitions .. ngspice also does that but with less deviation from the user specified time step. At least this is my impression."
e
OK! I haven’t been able to digest this whole thread yet. I’m certainly always happy to hear the Xyce is faster. 🙂.
Are these test cases with Sky130 or with the simpler model cards? I’ve been told anecdotally that Xyce seems to parse the Sky130 files more quickly. But that faster parsing would only matter for a large number of PDK files.
s
We users too, if it doesnt lead to inferiour data of course.
👍 1
Yes test are with sky130. I wouldnt say more quickly. 0.8s vs 12s for a simple inverter or simple analog switch is out-of-the-park. And for my DAC circuit the speedup persists. We are very curious why.
👍 1
e
OK. If there is a way to compare the parse/setup times, I’d be curious about that. For a really long simulation, the parse time won’t matter very much, of course.
I have to run to a meeting, but I’ll check in here later.
👍 1
s
Sorry, mashed my questions in one post. "_..if the Newton solve fails (which is the process by which it computes the “corrector”), then the step fails and it cuts the stepsize by a fixed fraction. (In xyce the next attempted step is 1/8 the size of the failed one). The LTE analysis won’t be performed in this case, as there isn’t a valid corrector to usue.._" 1) I wonder how ngspice implements this part. If the fixed fraction is say 1/4 in ngspice that would be huge. Can we change this fraction in xyce and see how it effects simulation time? Also, we mostly use PWL and Pulse sources. "_But I think that Xyce is the only one that by default outputs every time step, rather than interpolating. Most codes can (like Hspice) can be forced to output every step if that is desired. And, Xyce (as noted) can be forced to use interpolated output. But the default behavior is different._" 2) Would interpolated output result in a speed up (at the cost of uncertainty in data)? If ngspice is doing this we should set xyce to the same setting (or change ngspice) because equal quality output is important for fair comparison. Although TBH, so far we couldnt visually detect any difference in output quality. 3) Can the LTE be logged at every time step change decision? That way we trace the time steps in a simulation.
e
Regarding (1), I don’t expect for most digital circuits that you’ll get a ton of Newton solve failures. The stepsize is much more likely to be adjusted based on LTE. Newton solve failures are much more likely in highly nonlinear analog circuits. Or if you have model implementation problems (which would be my fault, if true … )
Interpolated output can result in faster run times if the total number of outputs is significantly different otherwise. If you have a circuit where there are long periods of time in which not much happens, the adaptive step size algorithm should take a relatively small number of steps. If you do interpolated output with a very small interval, you’ll wind up outputting a lot more, which would probably slow things down.
Or, alternatively, you could have a complex waveform (like a high-q oscillator taking a long time to settle) where there is a really small time step. In that case, the solver will use a LOT of steps, and if you do interpolated output at a much larger interval, you’ll get a much smaller output file and possibly a faster simulation.
Regarding (3) In the more verbose builds of Xyce the results of the LTE calculation are output to the screen (stdout/terminal output). In the default build, however, I don’t think there is currently a way to do this.
However, generally, your LTE algorithm is controlled by the time integrator tolerances, reltol and abstol, which are set on the .options timeint line.
s
Thanks for these answers @Eric Keiter! This gives a few knobs to turn. Especially interpolated output vs non-interpolated. I have to check if reltol and abstol is identical to ngspice.
@Stefan Schippers I tried to run the test but it is missing the
stimuli_test_carry_lookahead.cir
file. Could you commit that?
s
@Steven Bos you can generate this file yourself, by going to SImulation-> Utile Stimuli editor (GUI) and pressing '`Translate`'
s
@Stefan Schippers Yes but doesnt that need some input file as well? When I press SImulation-> Utile Stimuli editor (GUI) it is empty. I recall that rom_8k had a stimuli.rom8k file but cannot find such a file for this test
s
@Steven Bos ok then i have to check.
ok, @Steven Bos i have checked in `sky130_tests/stimuli.test_carry_lookahead`'. update your repo, then copy this file into the simulation directory and do the '`Translate`' step. This copy into simulation directory is boring and i need to fix that, this file should be looked up in the directory where the schematic is. For now accept this extra step.
👍 1
e
@Steven Bos Glad I could help. I should mention that in my experience the main reason users turn on the interpolated output in Xyce is to reduce the size of the output file. Some of our users run a lot of very long running analog circuits, and they can wind up having millions of time steps by the time they are complete. For plotting purposes, etc, you often just don’t need that many points, and they’d rather not fill up their drive with huge files. As far as overall runtime goes, I usually only expect these IO differences to matter when the circuits aren’t too large. Once things get big enough the solver time dominates. But I’ve never really done a systematic study of it.
s
We try to mention solve time next to runtime in our test so that eventually we can see them side-by-side for several circuits. Is there a quick way to find all the default settings (eg. what is the default RELTOL in xyce?)
e
For the time integrator, the current defaults are ABSTOL=1.0E-6 andRELTOL = 1.0E-3.
s
it seems that ngspice has the same RELTOL but a lower ABSTOL
e
It is possible that they aren’t using it in quite the same way.
s
Implementation differences you mean?
e
Some simulators set different tolerances for different types of variables. So, for example, they’d have a different ABSTOL for charge than for current. We have not done that.
The relative tolerance inherently handles whatever the natural/typical magnitude of each variable.
s
Ah ok. I recon at some point in the comparison we just need to accept some implementation differences.
e
Yes, I think so.
Two settings we also have, which may be of interest to you are:
.options timeint NEWLTE
and
.options timeint ERROPTION
The first one (NEWLTE) allows you to set what the “reference” used by RELTOL. The other one (ERROPTION) allows you to completely disable local trunction error, and only rely on nonlinear solver behavior to set the time step. This second one (ERROPTION) is mostly a last resort when someone has a really difficult circuit that won’t run otherwise. But it often runs very fast (albeit probably less accurately).
The
.options timeint NEWLTE
can be set to 0,1,2 or 3. The meaning is as follows:
s
Interesting! I will try them. I am now running another test made by @Stefan Schippers, the carry lookahead test.
@Stefan Schippers I assume that the Xyce simulation be run with the -r option for this carry lookahead test?
s
@Steven Bos add the following command in the Xyce command block:
.print tran format=raw file=carry_lookahead_xyce.spice.raw v(*)
I am restructuring all examples to run without the -r command line option, this example was not completed yet.
@Steven Bos the
test_carry_lookahead
example has 80k unknowns, using -r will save all internal nodes, which are not interesting and will generate a huge raw file. the .print above just saves the top level voltages which is enough for the test.
@Steven Bos regarding ngspice killed, are you running inside a container? if so check process / memory limits in the setup/preferences. If not using containers/sandboxes check your limits with
ulimit -a
, though it is usually unrestricted as far as i know. If using a virtual machine / WSL try to see if there are restrictions. I am not familiar with these, so can not help much. If everything is set up correctly then it might be a ngspice problem.
e
Hello @Steven Bos, It looks like the parallel simulation is using the parallel linear solver rather than KLU. You can tell this by the message:
If that was the intent, that is fine. But the reason that the parallel linear solve needs 4:22.302 and the serial linear solve needs 53.103 is this issue. At this size, serial KLU is more efficient than parallel preconditioned GMRES.
If running with KLU you won’t ever see the singleton warnings.
Or the Hypergraph message
s
@Steven Bos i saw the screenshot. Based on past experience the (terse!) message '`Killed`' means someone else killed (in the rude way, signal 9) ngspice. This is the exact message i get if i send a kill -9 to the ngspice process while running.
e
Regarding the runtime vs. simtime question. That is a good quesion. I notice in the logs that the “distribute devices” phase of the setup is taking 11 minutes in serial and 18 minutes in parallel. That is a long time; I’m not sure I can explain why that is happening.
Is this a circuit that would be easy to share with us? There might be a weird bottleneck in the setup.
Usually, even for very large circuits and PDKs the setup time is faster than that.
s
@Eric Keiter @Steven Bos I did the tests on Xyce serial and a very long time is spent parsing the netlist. Ngspice also had similar results, ~50% of the time spent processing / parsing, then starting the simulaton.
@Eric Keiter the circuit is highly hierarchical, it contains (among other fhings ) a 256 bit adder, which is implemented as 4x 64 bit adders, implemented as 4x 16 bit adders, implemented as 4x4 bit adders, implemented as 4 1-bit full adders.... Don't know if this means something for the simulator ability to parse the whole thing.
e
@Stefan Schippers @Steven Bos Interesting. I had previously been told anecdotally that Xyce handled the setup a lot faster than ngspice. But that probably wasn’t a systematic comparison, and it may have been a very problem dependent observation.
Usually, I expect hierarchical netlists to be a bit faster to parse, if only b/c the file IO is less than it would be for its flattened equivalent. But that is just me speculating.
s
@Eric Keiter the results of Xyce (serial) and ngspice made by me on this design (27k devices, 80k unknowns)were posted above: Ngspice:
Copy code
Total analysis time (seconds) = 2710.63
Total elapsed time (seconds) = 4191.515
Xyce:
Copy code
***** Total Simulation Solvers Run Time: 2103.34 seconds
***** Total Elapsed Run Time:            4147.71 seconds
*****
***** End of Xyce(TM) Simulation
so not very different. Results were good for both and raw sizes were comparable (Xyce aves less timepoints when circuit is idle, ngspice somewhat more).
e
Interesting result. Thanks!
s
see both take a considerable time to get the netlist down their throaths.
s
@Stefan Schippers Indeed ngspice rapidly hogs all the 15GB main +4GB swap memory in about 3 minutes and at that point gets killed
I tried running it in batch mode hoping that it would offload some memory usage on disk, but apparently that only works for solver (writing it directly to .raw instead of memory). It gets killed while doing the parsing, so before doing any .tran
I will ask around in the ngspice forums on sourceforge, too bad the devs dont have a slack channel in this community
@Eric Keiter I will test Xyce parallel again with
.options LINSOL type=klu
and see how it performs. BTW all test we do are open source and are either shared here or in the xschem repo. I am thinking about a simulator benchmark/test suite probably as a separate repo where all the schematics are centralized and users can submit PR of their own schematics. But before that a good test method is needed. I can imagine parsing run time , file I/O, solving, total run time good candidates
👍 1
For me xyce starts up really fast with sky130 regardless of circuit compared to ngspice with only
Copy code
.lib /usr/local/share/pdk/sky130A/libs.tech/ngspice/sky130.lib.spice tt
but this test does a bit more with
Copy code
.include /usr/local/share/pdk/sky130A/libs.ref/sky130_fd_sc_hd/spice/sky130_fd_sc_hd.spice
.include /usr/local/share/pdk/sky130A/libs.tech/ngspice/corners/tt.spice
.include /usr/local/share/pdk/sky130A/libs.tech/ngspice/r+c/res_typical__cap_typical.spice
.include /usr/local/share/pdk/sky130A/libs.tech/ngspice/r+c/res_typical__cap_typical__lin.spice
.include /usr/local/share/pdk/sky130A/libs.tech/ngspice/corners/tt/specialized_cells.spice
s
@Steven Bos try giving ngspice only the corner it needs using the component
sky130_fd_pr/corner.sym
. (if you are using my test it is probably alrready done). Also ensure you have the
.spiceinit
file in the simulation directory with following content:
Copy code
set ngbehavior=hsa
set ng_nomodcheck
May be the 2nd line does help, not sure.
I have only 4GB ram and both simulators peaked around 1.2 - 1.5GB resident (not Virtual) memory during the run time.
s
Again, great call @Stefan Schippers. I ran this example from the .xschem folder that didnt have a .spiceinit file. With this file (and probably the
set ng_nomodcheck
line) memory never exceeded 1.8 GB. The result: Ngspice 2 threads 1458s (773s) and raw file 1381 (1351)
I am now running the parallel with linsol and than make another TLDR; post
TLDR; @Stefan Schippers laptop and my laptop both report identical ranking and file size but other durations due to hardware specification. Xyce parallel was not yet tested by Stefan. Two mid-sized circuits were tested: total runtime (solver time) and raw file #variables (#data points) ROM_8K ------------------ ngspice 2 threads 169s (168s) xyce parallel 170s (169s) xyce parallel klu 174s (173s) xyce serial 268s (267s) CARRY_LOOKAHEAD (256-bit adder) ------------------- xyce serial 1218s (515s) and raw file 10799 (1448) ngspice 2 threads 1458s (773s) and raw file 1381 (1351) xyce parallel klu 1500s (451s) and raw file 10799 (1448) xyce parallel 1608s (509s) and raw file 10799 (1448) CC. @Eric Keiter!
I am now finishing up my new 1 to 10 bit pulsed DACS (thanks to your help @Stefan Schippers and @Mitch Bailey) and will pair ngspice and xyce again, but now for these smaller circuits. When I find some time I will use the hspice version of the sky130 lib to test that as well for the circuits mentioned here.
e
@Steven Bos thanks for the info. I am still a bit surprised that KLU isn’t faster. One other thought that I have, looking at the xyce_parallel_klu.log file is that the amount of warnings is really excessive. The file is 25MB and contains 473778 lines! If I edit out the warnings, the log file is only 246 lines. We’ve talked about adding a setting so that Xyce will optionally throttle or suppress warnings like this, as this many is really excessive and not useful.
I am curious if you see similar warnings from ngspice
Part of why I ask is that I want to double check if these warnings represent something we need to fix. I’ve seen these warnings before but haven’t had time to dig into them. They are coming from one of the BSIM models (probably BSIM4). I know that commercial simulators have often added extra geometrical parameters to the BSIM models, that were not part of the original model from Berkeley. I also think that ngspice has implemented some of these extra geometrical params. So, I’d like to check if these warnings are happening b/c the Sky130 PDK is using those parameters and Xyce is incorrectly not using them. Other than the mountain of warnings, they don’t seem to be causing the answers to be substantially different, so there is that.
s
Ha! Did i forgot to add the ngspice log file? I will add it first thing tomorrow. Note that parallel with klu is faster than all the other solvers, except not in total runtime. I don't recall seeing that many errors. TBF @Stefan Schippers mentioned the .spiceinit file for ngspice. That file has a line 'nomodcheck' that solved a memory issue in the parse phase . Not sure if xyce can also do something similar. I know from for example Blender that the obj parser was recently sped up hugely because nobody touched&analyzed that piece of code.
Great that you look into this. The less warnings, the more comfortable people are when simulating. We definitely noticed the effects of heavy I/O slowing down total runtime. Surprisingly the #variables in the second test was 7x as much as ngspice but xyce serial was 240sec faster and this was almost pure solving time.
e
One of my colleagues pointed out to me that the version of the BSIM4 model you are probably running is 4.6.1, which has until recently been the only version in the Xyce source code. However, there are two newer versions commonly in use: 4.7 and 4.8. My colleague very recently (last week) added 4.7.0 to the Xyce source and added 4.8 to Xyce over the past few days. According to him, the function that is producing all the warnings got changed in 4.7.0. So, possibly if you run this with a build of Xyce that includes the 4.7 model these warnings will go away. I haven’t tried this yet; it is just a theory. But it would be convenient if that turns out to be the case.
s
@Eric Keiter I have not found any reference to BSIM 4.8 has that been merged to master yet? BSIM 4.7 has indeed been added 10 days ago
s
Results are in line with mine (if scaled to a slower computer, lol) ~50% of the time is spend parsing the netlist and building internal data structures. about the
-r
flag it not only forces raw writing, it also saves all variables if no
.save
lines are present in the netlist. This is (i believe) different from Xyce, which is saving everything if
-r
is specified, even if
.print
lines are present in the netlist. @Steven Bos I found a very different behavior between ngspice and Xyce if doing a .dc analysis and some capacitors with a IC=.. condition are present in the netlist. Ngspice removes all capacitors (and shorts all inductors) in DC analysis (this makes sense since frequency is 0) Xyce replaces capacitors with an IC condition with voltage sources with value equal to the IC voltage in a .DC simulation. This is something we must be very careful about. IC conditions are typically used in transient analysis to define an initial state, In my experience nobody did care about IC conditions if doing a DC analysis, but for Xyce it affects DC results. This is the behavior described in the Xyce reference manual: "If one is doing a transient with DC operating point calculation or a DC operating point analysis, the initial condition is applied by inserting a voltage source across the capacitor to force the operating point to find a solution with the capacitor charged to the specific voltage. The resulting operating point will be one that is consistent with the capacitor having the given voltage in steady state".
🌍 1
s
Good catch @Stefan Schippers!
k
@skandha deepsita @akhendra kumar padavala
❤️ 1
@Dr. P Akhendra Kumar
❤️ 1