Hi < U0175T39732> I will check it out Hard to tell exactly b open-source-silicon.dev #xyce

Hi <@U0175T39732> I will check it out. Hard to t...

Eric Keiter

05/17/2021, 9:38 PM

Hi @User I will check it out. Hard to tell exactly but it looks like the nonlinear solver is stalling and failing to reduce the error norm enough to converge. The single terminal connection warnings are probably not a big deal, so I think your intuition is correct about that.

Matthew Guthaus

05/17/2021, 9:39 PM

Thanks. It does work in ngspice and hspice.

Eric Keiter

05/17/2021, 9:44 PM

OK, interestingly, it ran for me in the Xyce development branch just now. I was running in serial on my OSX laptop. And, when I ran it using a verbose build of Xyce, the solver didn’t appear to have any trouble near the 4.8e-9 time (which is where is is bogging down for you). Hmmm.

Eric Keiter

05/17/2021, 9:44 PM

Are you running on Linux?

Matthew Guthaus

05/17/2021, 9:45 PM

Yes, on Linux. Either serial or parallel

Eric Keiter

05/17/2021, 9:45 PM

OK. Which version of Xyce? 7.2?

Matthew Guthaus

05/17/2021, 9:46 PM

Xyce DEVELOPMENT-202105141522-g-b37dc80c-opensource

Eric Keiter

05/17/2021, 9:50 PM

OK. I’ll log into a linux machine and try it there. BTW, there was another “sync” with the external github repository this morning. I can’t think of any obvious updates in that update, however, that would make a difference. On my OSX laptop, my build of Xyce 7.2 also succeeded just now. Possibly there is a subtle compiler difference? On OSX I usually build using clang++.

Eric Keiter

05/17/2021, 9:50 PM

I did notice that during the DCOP calculation that the solvers complained a lot before finding the answer. But once it was running in transient it seemed to progress OK.

Matthew Guthaus

05/17/2021, 9:51 PM

That makes sense. Lots of uninitialized memory nodes

Matthew Guthaus

05/17/2021, 9:52 PM

I don't specify an initial voltage anywhere

Matthew Guthaus

05/17/2021, 9:53 PM

It looks like it is using: g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Eric Keiter

05/17/2021, 9:53 PM

OK, that helps. I have to run at the moment, but I’ll dig into this more later this afternoon/evening.

Matthew Guthaus

05/17/2021, 10:06 PM

No worries. Thanks for the help

Eric Keiter

05/18/2021, 9:03 PM

Hi @Matthew Guthaus, here is a small update. So far, we’ve been unable to reproduce your issue. It ran fine for me on OSX using clang, as I mentioned before. Then I attempted Linux, RHEL7, using gcc 5.3. (I just happened to be set up to use that gcc variant). Then one of my colleagues attempted BSD with clang. Finally that same colleague used exactly the same config as your: Ubuntu with g++ 9.3.0. All of them ran all the way thru without failure. So, we have a mystery.

Eric Keiter

05/18/2021, 9:06 PM

By the way, my colleague set up this github repo, that he calls “XyceBundle” for helping with building Xyce. He used recipe there to build Xyce. That repo is here: https://github.com/tvrusso/XyceBundle

Eric Keiter

05/18/2021, 9:06 PM

The scripts in that repo are essentially just reproducing what is in our building guide.

Eric Keiter

05/18/2021, 9:06 PM

I mention “XyceBundle” just in case there is any differences in how you built Xyce.

Matthew Guthaus

05/18/2021, 9:07 PM

Actually, speaking of that, I didn't run any tests to confirm my build actually "works". I think I read somewhere that there are some

Matthew Guthaus

05/18/2021, 9:08 PM

I'll take a look at that repo

Eric Keiter

05/18/2021, 9:08 PM

Cool

Eric Keiter

05/18/2021, 9:09 PM

Yes, we do have a regression suite here: https://github.com/Xyce/Xyce_Regression

Eric Keiter

05/18/2021, 9:09 PM

The test suite contains scripts for running it. I think the XyceBundle repo also includes scripts for invoking it.

Matthew Guthaus

05/21/2021, 3:40 PM

I upgraded to 7.3 and re-ran this test. Interestingly, it fails with >1 threads but passes with a single thread. The error changes too. With 3 threads it does what I posted before, but with 2 or 4 it does this:

Copy code

ZOLTAN Load balancing method = 10 (HYPERGRAPH)                                                                                                                                               
                                                                                                                                                                                             
        Step size reached minimum step size bound                                                                                                                                            
                                                                                                                                                                                             
        Step size reached minimum step size bound                                                                                                                                            
DC Operating Point Failed.  Exiting transient loop

When I built Xyce, I used these options:

Copy code

../configure CXXFLAGS="-O3 -std=c++11" ARCHDIR="/software/XyceLibs/Parallel" CPPFLAGS="-I/usr/include/suitesparse" --enable-mpi CXX=mpicxx CC=mpicc F77=mpif77 --prefix=/software/Xyce/Parallel

I was going to use that Xyce_Bundle, but it only builds the serial version.

Matthew Guthaus

05/21/2021, 3:48 PM

Copy code

mpicxx --version
g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

$ mpicc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

$ mpif77 --version
GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Eric Keiter

05/21/2021, 6:11 PM

Hi @Matthew Guthaus, one thing to try is to force parallel Xyce to use the “parallel load, serial solve” option. The output, above, which says

ZOLTAN Load balancing method = 10 (HYPERGRAPH)

indicates that Xyce is attempting to use an iterative solver. Iterative solvers are easier to set up in parallel but tend to be less robust. Direct solvers (which is the Xyce default in serial) are very robust, but hard to make scale well in parallel. Anyway, parallel Xyce looks at the size of the problem, and if the problem size is below a number (I can’t recall the exact number, but it might be 1000 unknowns), it will do the linear solver phase of the calculation using a serial direct solver on proc 0. The device evaluation will still be parallel, however. Direct solvers actually work pretty well at much larger sizes than 1000, so arguably our threshold for direct/iterative is too low. Anyway, you can force Xyce to use a direct method by adding

.options LINSOL type=klu

to the netlist.

Eric Keiter

05/21/2021, 6:13 PM

I am glad to hear this test passes in serial with Xyce 7.3. Hopefully by forcing it to use KLU it will work in parallel as well.

Matthew Guthaus

05/21/2021, 6:13 PM

I think it might've passed in 7.2 as well in serial...

Eric Keiter

05/21/2021, 6:14 PM

Ah. I had the vague impression that it had failed in all modes before. But I’m getting confused about which circuit I’m remembering now.

Matthew Guthaus

05/21/2021, 6:14 PM

I didn't actually try serial. I didn't think it would be dependent on this...

Eric Keiter

05/21/2021, 6:14 PM

Of course when you force KLU, that means part of the calculation is serial. So, that will limit the speedup you get, based on Amdal’s law.

Matthew Guthaus

05/21/2021, 6:14 PM

Bleh 😞

Eric Keiter

05/21/2021, 6:15 PM

Well, at the size of problem you are running, most of the work is in the device evals, which are still parallel. So you will get speedup.

Matthew Guthaus

05/21/2021, 6:15 PM

This is just a unit test!

Eric Keiter

05/21/2021, 6:15 PM

How big?

Matthew Guthaus

05/21/2021, 6:15 PM

Oh, 10000x bigger

Eric Keiter

05/21/2021, 6:17 PM

Just checked the output from when I ran this a few days ago. Total Devices is 2605, Number of Unknowns = 11597. That is a size where the KLU option is probably the best choice, even though by default Xyce attempts the iterative solve.

Eric Keiter

05/21/2021, 6:20 PM

10000x bigger than 11597? so number of unknows = 100M? Am I interpreting that correctly?

Eric Keiter

05/21/2021, 6:21 PM

BTW, the previous circuits where you reported speedup in parallel probably were using KLU.\

Matthew Guthaus

05/21/2021, 6:23 PM

Well, this example is a 64-bit SRAM, so going up to 64kbit is the goal. We do, however, do some pruning internally of bitcells, so it probably won't be exactly 10kx

Matthew Guthaus

05/21/2021, 6:23 PM

This is a tiny tiny circuit

Eric Keiter

05/21/2021, 6:24 PM

got it. I have to run to lunch right now but I’ll have more to say later …

Matthew Guthaus

05/21/2021, 6:24 PM

I'll try the klu option

Eric Keiter

05/21/2021, 10:02 PM

So, based on my own experiments, this will run with the KLU option. When I ran it in serial last week, about 80% of the calculation was the device evaluation and 20% was the linear solve phase. With the KLU option in parallel, only the device eval is parallelized, but it is a large enough fraction of the calculation that you should get meaningful speedup. Based on Amdahl’s law, the theoretical max speedup for this circuit would be (1/(1-0.8)) = 5x.

Eric Keiter

05/21/2021, 10:05 PM

It will also run with the parallel linear iterative solver, if you specify a better preconditioner. If you replace

.options linsol type=klu

with

.options linsol use_ifpack_factory=1

then the iterative solver will work.

Eric Keiter

05/21/2021, 10:07 PM

The good news about using IFPACK is that the iterative solver will work. The bad news is that for a circuit of this size, iterative solvers are generally much slower than the direct solver. So at this size, there isn’t much benefit to using it. But (back to the good news) as you move to larger circuits, the direct solver will eventually fail, as they dont’ scale very well. The iterative solver will scale much better. So (hopefully) it will continue to work for larger problems. Iterative solvers will start to beat the direct solvers once you start getting into the 100k unknown range.

Eric Keiter

05/21/2021, 10:10 PM

I’ll note that I experimented with two other preconditioners just now but they didn’t perform well. The BTF preconditioner was unable to find the BTF stucture in the matrix. ShyLU (the hybrid-hybrid solver) also struggled, but I haven’t diagnosed why yet. I have hope that that ShyLU can be made to work but I’ll have to consult with someone about it.

Eric Keiter

05/21/2021, 10:12 PM

The other comment I’ll make is that we have a threaded direct solver in the pipeline, called BASKER. It is almost always faster than KLU. It wasn’t quite “ready for prime time” in the 7.3 release, so it wasn’t officially there. But I was told that pretty recently BASKER was mature enough to pass our entire test suite, so it should become available soon.

👍 1

Matthew Guthaus

05/22/2021, 2:40 AM

I can then change this in OpenRAM to select which one. Thanks!

Matthew Guthaus

05/22/2021, 2:44 AM

It's very useful that Xyce prints there time summaries as well. I'm not sure why other tools don't

👍 1

5 Views

Open in Slack

Previous Next