Is there any extant guidance on mpirun options for optimizin open-source-silicon.dev #xyce

Join Slack

Is there any extant guidance on mpirun options for...

# xyce

Robert Rogers

10/12/2021, 7:24 PM

Is there any extant guidance on mpirun options for optimizing Xyce performance?

Matthew Guthaus

10/13/2021, 12:41 PM

This will really depend on the circuit... If you look up the Xyce papers, they give results with varying parallelism.

Robert Rogers

10/13/2021, 2:32 PM

Ive checked most of the documents on Sandia's documentation page, and didn't see that kind of guidance. Do you happen to remember what the paper title or some of the keywords are?

Matthew Guthaus

10/13/2021, 3:11 PM

I just ran my design with a few options and picked the best. Usually the returns diminish after a number of threads. There were a number of papers but I don't recall which one specifically "‪Eric Keiter‬ - ‪Google Scholar‬" https://scholar.google.com/citations?user=1oBZpMQAAAAJ&hl=en

Eric Keiter

10/13/2021, 4:32 PM

@Robert Rogers There is some guidance in the users guide, chapter 10, entitled “Guidance for Running Xyce in Parallel”.

Eric Keiter

10/13/2021, 4:33 PM

In general, the best options (and best number of MPI processes) depends a lot on the size of the problem and (for really large problems) the structure.

Eric Keiter

10/13/2021, 4:33 PM

If the problem size is really small, like just handful of devices, then you won’t benefit from parallel.

Matthew Guthaus

10/13/2021, 4:33 PM

But you will still benefit from the awesome parser 🙂

Matthew Guthaus

10/13/2021, 4:34 PM

(over other sims)

Eric Keiter

10/13/2021, 4:34 PM

Ah, yes, the parser is more efficient than spice3-based codes.

Eric Keiter

10/13/2021, 4:35 PM

For modest sized problems (up to maybe 50,000 unknowns), the “parallel load, serial solve” option works best. In that case the device evaluations are done in parallel and the solve is done with a serial direct solver. Since 1/2 the problem is still serial in this case, there will be limits on how much speedup you can get. But you will get real speedup.

Eric Keiter

10/13/2021, 4:37 PM

For truly large problems, then it gets a bit more tricky. Direct solvers will start to die once you get into the 100s of thousands of unknowns. So, the only option then is to use an iterative solver instead. Iterative solvers are much easier to make parallel, b/c there is less communication. But they are much less robust.

Eric Keiter

10/13/2021, 4:37 PM

We have a bunch of different solver options for that case, which are described in the guide.

Eric Keiter

10/13/2021, 4:38 PM

As to the optimal number of processors for your circuit, that is often just going to be determined via experiments. But remember these are MPI processes, not (for example) GPU threads. So they aren’t small.

Robert Rogers

10/13/2021, 5:11 PM

I did see the section in the User Guide, but was specifically wondering about process to mpi "slot" mapping and the like. I think the takeaway i'm getting there is that it's very problem specific

Eric Keiter

10/13/2021, 5:15 PM

Ah. Yes, that issue is fairly problem and hardware dependent.

Robert Rogers

10/13/2021, 5:20 PM

I had an inkling. Coming from the world of field-domain solvers, I was curious about possible rules of thumb. For instance, in FEM based field solvers you can break up physical regions to run roughly in parallel via domain decomposition.

Eric Keiter

10/13/2021, 5:22 PM

We have completely separate parallel partitioning for the two phases I mentioned. For device evaluation the default behavior is to simply distribute devices in “first come first serve” manner. Ie, in the order they come in the input file. There are other device partitioning options, which attempt to group devices logically for load balance.

Eric Keiter

10/13/2021, 5:23 PM

For the linear solve (when doing parallel linear solves) we use a hypergraph partitioning scheme. But how exactly it is used depends on exactly which preconditioner is used. The hypergraph partitioning may be applied to blocks of the matrix, or the individual elements.

Robert Rogers

10/13/2021, 5:24 PM

Which parts of the Trilinos backend deal with preconditioners? I should probably start playing around with alternatives at some point.

Eric Keiter

10/13/2021, 5:28 PM

It depends a bit on which solver is being used. If using the Aztec gmres solver, then I believe it uses IFPACK under the covers.

Eric Keiter

10/13/2021, 5:42 PM

In general, preconditioning strategies for circuits (in our experience) usually require an overall strategy that includes multiple stages. So, usually just plugging in a trilinos preconditioner by itself isn’t likely to be sufficient. Most of the common preconditioners developed in Trilinos are oriented towards PDE finite-element problems. Circuits don’t have the nice spatial structure found in those problems.

Eric Keiter

10/13/2021, 5:42 PM

Other crucial layers. (besides the preconditioner itself) include singleton removal and various reordering routines.

Open in Slack

Previous Next