Hi <@U0175T39732> I will check it out. Hard to t...
# xyce
e
Hi @User I will check it out. Hard to tell exactly but it looks like the nonlinear solver is stalling and failing to reduce the error norm enough to converge. The single terminal connection warnings are probably not a big deal, so I think your intuition is correct about that.
m
Thanks. It does work in ngspice and hspice.
e
OK, interestingly, it ran for me in the Xyce development branch just now. I was running in serial on my OSX laptop. And, when I ran it using a verbose build of Xyce, the solver didn’t appear to have any trouble near the 4.8e-9 time (which is where is is bogging down for you). Hmmm.
Are you running on Linux?
m
Yes, on Linux. Either serial or parallel
e
OK. Which version of Xyce? 7.2?
m
Xyce DEVELOPMENT-202105141522-g-b37dc80c-opensource
e
OK. I’ll log into a linux machine and try it there. BTW, there was another “sync” with the external github repository this morning. I can’t think of any obvious updates in that update, however, that would make a difference. On my OSX laptop, my build of Xyce 7.2 also succeeded just now. Possibly there is a subtle compiler difference? On OSX I usually build using clang++.
I did notice that during the DCOP calculation that the solvers complained a lot before finding the answer. But once it was running in transient it seemed to progress OK.
m
That makes sense. Lots of uninitialized memory nodes
I don't specify an initial voltage anywhere
It looks like it is using: g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
e
OK, that helps. I have to run at the moment, but I’ll dig into this more later this afternoon/evening.
m
No worries. Thanks for the help
e
Hi @Matthew Guthaus, here is a small update. So far, we’ve been unable to reproduce your issue. It ran fine for me on OSX using clang, as I mentioned before. Then I attempted Linux, RHEL7, using gcc 5.3. (I just happened to be set up to use that gcc variant). Then one of my colleagues attempted BSD with clang. Finally that same colleague used exactly the same config as your: Ubuntu with g++ 9.3.0. All of them ran all the way thru without failure. So, we have a mystery.
By the way, my colleague set up this github repo, that he calls “XyceBundle” for helping with building Xyce. He used recipe there to build Xyce. That repo is here: https://github.com/tvrusso/XyceBundle
The scripts in that repo are essentially just reproducing what is in our building guide.
I mention “XyceBundle” just in case there is any differences in how you built Xyce.
m
Actually, speaking of that, I didn't run any tests to confirm my build actually "works". I think I read somewhere that there are some
I'll take a look at that repo
e
Cool
Yes, we do have a regression suite here: https://github.com/Xyce/Xyce_Regression
The test suite contains scripts for running it. I think the XyceBundle repo also includes scripts for invoking it.
m
I upgraded to 7.3 and re-ran this test. Interestingly, it fails with >1 threads but passes with a single thread. The error changes too. With 3 threads it does what I posted before, but with 2 or 4 it does this:
Copy code
ZOLTAN Load balancing method = 10 (HYPERGRAPH)                                                                                                                                               
                                                                                                                                                                                             
        Step size reached minimum step size bound                                                                                                                                            
                                                                                                                                                                                             
        Step size reached minimum step size bound                                                                                                                                            
DC Operating Point Failed.  Exiting transient loop
When I built Xyce, I used these options:
Copy code
../configure CXXFLAGS="-O3 -std=c++11" ARCHDIR="/software/XyceLibs/Parallel" CPPFLAGS="-I/usr/include/suitesparse" --enable-mpi CXX=mpicxx CC=mpicc F77=mpif77 --prefix=/software/Xyce/Parallel
I was going to use that Xyce_Bundle, but it only builds the serial version.
Copy code
mpicxx --version
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

$ mpicc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

$ mpif77 --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
e
Hi @Matthew Guthaus, one thing to try is to force parallel Xyce to use the “parallel load, serial solve” option. The output, above, which says
ZOLTAN Load balancing method = 10 (HYPERGRAPH)
indicates that Xyce is attempting to use an iterative solver. Iterative solvers are easier to set up in parallel but tend to be less robust. Direct solvers (which is the Xyce default in serial) are very robust, but hard to make scale well in parallel. Anyway, parallel Xyce looks at the size of the problem, and if the problem size is below a number (I can’t recall the exact number, but it might be 1000 unknowns), it will do the linear solver phase of the calculation using a serial direct solver on proc 0. The device evaluation will still be parallel, however. Direct solvers actually work pretty well at much larger sizes than 1000, so arguably our threshold for direct/iterative is too low. Anyway, you can force Xyce to use a direct method by adding
.options LINSOL type=klu
to the netlist.
I am glad to hear this test passes in serial with Xyce 7.3. Hopefully by forcing it to use KLU it will work in parallel as well.
m
I think it might've passed in 7.2 as well in serial...
e
Ah. I had the vague impression that it had failed in all modes before. But I’m getting confused about which circuit I’m remembering now.
m
I didn't actually try serial. I didn't think it would be dependent on this...
e
Of course when you force KLU, that means part of the calculation is serial. So, that will limit the speedup you get, based on Amdal’s law.
m
Bleh 😞
e
Well, at the size of problem you are running, most of the work is in the device evals, which are still parallel. So you will get speedup.
m
This is just a unit test!
e
How big?
m
Oh, 10000x bigger
e
Just checked the output from when I ran this a few days ago. Total Devices is 2605, Number of Unknowns = 11597. That is a size where the KLU option is probably the best choice, even though by default Xyce attempts the iterative solve.
10000x bigger than 11597? so number of unknows = 100M? Am I interpreting that correctly?
BTW, the previous circuits where you reported speedup in parallel probably were using KLU.\
m
Well, this example is a 64-bit SRAM, so going up to 64kbit is the goal. We do, however, do some pruning internally of bitcells, so it probably won't be exactly 10kx
This is a tiny tiny circuit
e
got it. I have to run to lunch right now but I’ll have more to say later …
m
I'll try the klu option
e
So, based on my own experiments, this will run with the KLU option. When I ran it in serial last week, about 80% of the calculation was the device evaluation and 20% was the linear solve phase. With the KLU option in parallel, only the device eval is parallelized, but it is a large enough fraction of the calculation that you should get meaningful speedup. Based on Amdahl’s law, the theoretical max speedup for this circuit would be (1/(1-0.8)) = 5x.
It will also run with the parallel linear iterative solver, if you specify a better preconditioner. If you replace
.options linsol type=klu
with
.options linsol use_ifpack_factory=1
then the iterative solver will work.
The good news about using IFPACK is that the iterative solver will work. The bad news is that for a circuit of this size, iterative solvers are generally much slower than the direct solver. So at this size, there isn’t much benefit to using it. But (back to the good news) as you move to larger circuits, the direct solver will eventually fail, as they dont’ scale very well. The iterative solver will scale much better. So (hopefully) it will continue to work for larger problems. Iterative solvers will start to beat the direct solvers once you start getting into the 100k unknown range.
I’ll note that I experimented with two other preconditioners just now but they didn’t perform well. The BTF preconditioner was unable to find the BTF stucture in the matrix. ShyLU (the hybrid-hybrid solver) also struggled, but I haven’t diagnosed why yet. I have hope that that ShyLU can be made to work but I’ll have to consult with someone about it.
The other comment I’ll make is that we have a threaded direct solver in the pipeline, called BASKER. It is almost always faster than KLU. It wasn’t quite “ready for prime time” in the 7.3 release, so it wasn’t officially there. But I was told that pretty recently BASKER was mature enough to pass our entire test suite, so it should become available soon.
👍 1
m
I can then change this in OpenRAM to select which one. Thanks!
It's very useful that Xyce prints there time summaries as well. I'm not sure why other tools don't
👍 1