Is there any extant guidance on mpirun options for...
# xyce
r
Is there any extant guidance on mpirun options for optimizing Xyce performance?
m
This will really depend on the circuit... If you look up the Xyce papers, they give results with varying parallelism.
r
Ive checked most of the documents on Sandia's documentation page, and didn't see that kind of guidance. Do you happen to remember what the paper title or some of the keywords are?
m
I just ran my design with a few options and picked the best. Usually the returns diminish after a number of threads. There were a number of papers but I don't recall which one specifically "‪Eric Keiter‬ - ‪Google Scholar‬" https://scholar.google.com/citations?user=1oBZpMQAAAAJ&hl=en
e
@Robert Rogers There is some guidance in the users guide, chapter 10, entitled “Guidance for Running Xyce in Parallel”.
In general, the best options (and best number of MPI processes) depends a lot on the size of the problem and (for really large problems) the structure.
If the problem size is really small, like just handful of devices, then you won’t benefit from parallel.
m
But you will still benefit from the awesome parser 🙂
(over other sims)
e
Ah, yes, the parser is more efficient than spice3-based codes.
For modest sized problems (up to maybe 50,000 unknowns), the “parallel load, serial solve” option works best. In that case the device evaluations are done in parallel and the solve is done with a serial direct solver. Since 1/2 the problem is still serial in this case, there will be limits on how much speedup you can get. But you will get real speedup.
For truly large problems, then it gets a bit more tricky. Direct solvers will start to die once you get into the 100s of thousands of unknowns. So, the only option then is to use an iterative solver instead. Iterative solvers are much easier to make parallel, b/c there is less communication. But they are much less robust.
We have a bunch of different solver options for that case, which are described in the guide.
As to the optimal number of processors for your circuit, that is often just going to be determined via experiments. But remember these are MPI processes, not (for example) GPU threads. So they aren’t small.
r
I did see the section in the User Guide, but was specifically wondering about process to mpi "slot" mapping and the like. I think the takeaway i'm getting there is that it's very problem specific
e
Ah. Yes, that issue is fairly problem and hardware dependent.
r
I had an inkling. Coming from the world of field-domain solvers, I was curious about possible rules of thumb. For instance, in FEM based field solvers you can break up physical regions to run roughly in parallel via domain decomposition.
e
We have completely separate parallel partitioning for the two phases I mentioned. For device evaluation the default behavior is to simply distribute devices in “first come first serve” manner. Ie, in the order they come in the input file. There are other device partitioning options, which attempt to group devices logically for load balance.
For the linear solve (when doing parallel linear solves) we use a hypergraph partitioning scheme. But how exactly it is used depends on exactly which preconditioner is used. The hypergraph partitioning may be applied to blocks of the matrix, or the individual elements.
r
Which parts of the Trilinos backend deal with preconditioners? I should probably start playing around with alternatives at some point.
e
It depends a bit on which solver is being used. If using the Aztec gmres solver, then I believe it uses IFPACK under the covers.
In general, preconditioning strategies for circuits (in our experience) usually require an overall strategy that includes multiple stages. So, usually just plugging in a trilinos preconditioner by itself isn’t likely to be sufficient. Most of the common preconditioners developed in Trilinos are oriented towards PDE finite-element problems. Circuits don’t have the nice spatial structure found in those problems.
Other crucial layers. (besides the preconditioner itself) include singleton removal and various reordering routines.