Workshop on Fast Direct Solvers, Purdue CCAM, 2016

Department of Mathematics, Purdue University, Nov 12-13, 2016

Program (with links to abstracts) - subject to change

 

Sat, Nov 12 Sun, Nov 13
7:50 Registration & Light Breakfast 7:50 Registration & Light Breakfast
8:20-8:30 Opening Remarks    
Morning Session 1 (Chair: Jie Shen) Morning Session 1 (Chair: Jianlin Xia)
8:30-9:20 Yousef Saad: Low-Rank Correction Preconditioning Techniques 8:30-9:20 David Keyes: Accelerating Cyclic Reduction with Hierarchical Matrix Operations on Distributed-Memory Machines
9:20-10:10 Gunnar Martinsson: Randomized Methods for Accelerating Direct Solvers 9:20-10:10 Victor Pan: Transformations of Matrix Structures and Fast Approximate Computations with Vandermonde and Cauchy Matrices
10:10-10:40 Coffee break 10:10-10:40 Coffee break
Morning Session 2 (Chair: Robert Skeel) Morning Session 2 (Chair: Guang Lin)
10:40-11:10 Eric Darve: Developing Preconditioners with Guaranteed Convergence Rate Using Hierarchical Matrices 10:40-11:10 Alex Townsend: Why Are So Many Matrices of Low Rank?
11:10-11:40 Heather Wilber: Numerical Computing with Functions on the Sphere and Disk 11:10-11:40 Grégoire Pichon: Sparse Supernodal Solver Using Hierarchical Compression
11:40-12:10 Yuanzhe Xi: Fast Contour-Integral Preconditioner 11:40-12:10 Daria Sushnikova: 'Compress and Eliminate' Solver for Sparse and Block Low-Rank Matrices
12:10-1:30 Lunch 12:10-12:40 Yingwei Wang: Fast Structured Spectral Methods
   
Afternoon Session 1 (Chair: Ahmed Sameh)    
1:30-2:20 Tim Davis: GPU-Acceleration of Sparse QR and Cholesky Factorization    
2:20-2:50 George Turkiyyah: Hierarchical Matrix Algorithms on GPUs    
2:50-3:20 Sergey Solovyev: Two Versions of MPI-Based Direct Sparse Solver with Low Rank Approximation for Intermediate Data Compression    
3:20-3:50 Coffee break    
Afternoon Session 2 (Chair: Alex Pothen)    
3:50-4:20 Adrianna Gillman: A Fast Direct Solver for Boundary Value Problems Involving Locally Perturbed Geometries    
4:20-4:50 Bryan Quaife: An Efficient Preconditioner for the Fast Simulation of a 2D Stokes Flow in Porous Media    
4:50-5:20 Yu-Hong Yeung: Updating Linear Systems of Equations with Augmented System Solvers: Surgery Simulations and Power Grid Contingency Analyses    
5:20-5:50 Liang Zhao: Low-rank Approximation of a Matrix: Novel Insights and New Progress    
5:50-6:20 Efstratios Gallopoulos: Classical Fast Poisson Solvers and Parallelism via Partial Fractions    
6:30 Dinner    

 

Abstracts

 


 

Developing Preconditioners with Guaranteed Convergence Rate Using Hierarchical Matrices

Eric Darve, Stanford University

 

Hierarchical factorization methods can lead to efficient preconditioners with a prescribed tolerance epsilon. However, if we want to ensure that the preconditioned system has a bounded condition number and that the number of iterations in an iterative scheme remains small, we need in principle to pick an epsilon that is smaller than the smallest eigenvalue of the matrix. As a result, for many problems, epsilon may have to be chosen quite small, and generally this makes these algorithms less efficient for large ill-conditioned problems. We will present a modified version of hierarchical factorization methods which does not have this limitation, and for which a guaranteed convergence can be obtained. This method only applies to certain types of PDEs. We will present results for elliptic PDEs.


 

GPU-Acceleration of Sparse QR and Cholesky Factorization

Tim Davis, Texas A&M University

 

GPUs provide excellent computational throughput for high performance computing, but they require us to rethink and redesign our algorithms.  I will present two algorithms for sparse direct factorization that exploit one or multiple GPUs:  sparse QR and Cholesky factorization.



A Fast Direct Solver for Boundary Value Problems Involving Locally Perturbed Geometries

Adrianna Gillman, Rice University


This talk presents a fast direct solver for boundary value problems where solutions are desired for geometries corresponding to local perturbations of the original geometry. Problems of this form arise in a variety of applications such as optimal design and flagella propelled movement simulations. The proposed solution technique utilizes a block linear system representation of the local perturbation which, thanks to a Woodbury formula, allows for the inverse of the matrix for the original system to be reused for all local perturbations. The cost of building the direct solver for the locally perturbed geometries grows linearly with the number of unknowns. Since the original system matrix is amenable to HSS, HBS, H-matrix, etc. solvers, the constant prefactor for the precomputation is much smaller than building as direct solver from scratch or updating the hierarchical data structure for each new geometry. Numerical results will illustrate the performance of the new direct solver.



Accelerating Cyclic Reduction with Hierarchical Matrix Operations on Distributed-Memory Machines

David Keyes, Extreme Computing Research Center and Program in Applied Mathematics and Computational Science, KAUST

 

We present a fast direct solver for block tridiagonal linear systems arising from the discretization of 3D elliptic operators. The solver exploits algorithmic synergies between cyclic reduction and hierarchical matrix arithmetic operations to produce a solver with $O(N \log^2 N)$ arithmetic complexity and $O(N \log N)$ memory footprint, with significant exploitable concurrency both in its outer loops and inner hierarchical compression operations. The algorithm can be seen as an alternative to multifrontal nested dissection elimination, which uses regular constant-sized fronts in multiple passes to eliminate half the mesh nodes in each pass. This regularity and uniform structure allow the hierarchical compression to be done very efficiently and the computations to be distributed optimally in a distributed-memory setting. We provide a baseline for performance and applicability by comparing with an algebraic multigrid solver and multifrontal solvers with HSS matrices. Strong and weak scalability results show that the solver is competitive with state-of-the-art fast direct solvers on examples arising from the discretization of Poisson, Helmholtz, and convection-diffusion equations. [Work with Gustavo Chavez and George Turkiyyah]


 

Randomized Methods for Accelerating Direct Solvers

Gunnar Martinsson, University of Colorado at Boulder

 

Methods based on randomized sampling have over the last several years proven to be powerful tools for computing low-rank approximations to matrices. In this talk, we describe how such techniques can be extended to certain "rank-structured" matrices, for which only certain off-diagonal blocks (not the matrix itself) admit accurate low-rank approximations.

 
Time permitting, the talk will also describe some recent work on how machinery developed for solving elliptic PDEs can be used to build parallel-in-time integration techniques for certain hyperbolic PDEs.


 

Transformations of Matrix Structures and Fast Approximate Computations with Vandermonde and Cauchy Matrices

Victor Y. Pan, Lehman College of the City University of New York

 

Matrices with the structures of Toeplitz, Hankel, Vandermonde and Cauchy types are omnipresent in modern computations in Sciences, Engineering, and Signal and Image Processing. These four matrix classes have distinct features, but in our 1990 paper in Mathematics of Computation we showed that Vandermonde and Hankel multipliers can be applied to transform each class into the others, and then we demonstrated how by using these transforms we can readily extend any successful matrix inversion algorithm from one of these classes to all the others. The power of this approach was widely recognized later, when novel numerically stable algorithms solved nonsingular Toeplitz linear systems of equations in quadratic (versus classical cubic) arithmetic time based on transforming Toeplitz into Cauchy matrix structures. More recent papers combined the same transformation with a link of the Cauchy matrices to the Hierarchical Semiseparable matrix structure, which is a specialization of matrix representations employed by the Fast Multipole Method. This produced numerically stable algorithms that approximated the solution of a nonsingular Toeplitz linear system of equations in nearly linear arithmetic time. We extend the latter approximation algorithms for Toeplitz linear systems to approximate multiplication of Vandermonde and Cauchy matrices by a vector with potential extension to approximate solution of Vandermonde and Cauchy linear systems of equations provided that they are nonsingular and well-conditioned. The arithmetic cost of the known numerical approximation algorithms for these tasks decreases from quadratic to nearly linear, and similarly for computations with matrices whose structure generalizes the structures of Vandermonde and Cauchy matrices and for multipoint polynomial and rational evaluation. The details are in our papers Transformations of Matrix Structures Work Again, Linear Algebra and Its Applications, 465, 1–32, 2015, and Fast Approximate Computations with Cauchy Matrices and Polynomials, Math. of Computation, in print.


 

Sparse Supernodal Solver Using Hierarchical Compression

Grégoire Pichon, LaBRI, INRIA Bordeaux-SO

 

In this talk, we present the PaStiX sparse supernodal solver, using hierarchical compression to reduce the burden on large blocks appearing during the nested dissection process. To improve the efficiency of our sparse update kernel for both BLR (block low rank) and HODLR (hierarchically off-diagonal low-rank), we investigate to BDLR (boundary distance low-rank) method to preselect rows and columns in the low-rank approximation algorithm. We will also discuss ordering strategies to enhance data locality and compressibility.


An Efficient Preconditioner for the Fast Simulation of a 2D Stokes Flow in Porous Media

Bryan Quaife, Florida State University


We consider a boundary integral equation (BIE) formulation of the two-dimensional Stokes equations in a porous geometry. While BIEs are well-suited for resolving the complex geometry, they lead to a dense linear system of equations that is computationally expensive to solve. A fast multipole accelerated iterative method is possible, but for complex geometries, many iterations are required. Therefore, we apply the inverse fast multipole method (IFMM), which is based on the framework of $\mathcal{H}^2$ matrices, to precondition the linear system. We examine the effect of the preconditioners tolerance and compare its efficacy with a block-diagonal preconditioner on several geometries. This is joint work with Eric Darve and Pieter Coulier.



Low-Rank Correction Preconditioning Techniques

Yousef Saad, University of Minnesota


This presentation will discuss a class of preconditioning methods for solving linear systems of equations that are based on exploiting low-rank approximations to certain matrices. These methods have a number of appealing features. Because they are essentially approximate inverse techniques, they handle indefiniteness quite well. Furthermore, they are amenable to SIMD comptuations such those inherent to GPUs. The talk will first describe a few techniques geared toward a Domain Decomposition framework for a parallel computing environment. These techniques are all based on exploiting Schur complements and low-rank corrections. We will then see how to extend this DD approach by considering so-called `hierarchical interface decomposition orderings' which are essentially algebraic generalizations of `wirebaskets' techniques used in Domain Decomposition methods. Finally, we will describe ongoing work to develop low-rank correction methods for highly indefinite systems such as those that arise from Helmholtz or Maxwell equations.


 

Two Versions of MPI-Based Direct Sparse Solver with Low Rank Approximation for Intermediate Data Compression

Sergey Solovyev, Victor Kostin, Institute of Petroleum Geology and Geophysics SB RAS, Russia
Hongwei Liu, EXPEC ARC, Geophysics Technology, Saudi Aramco


We compare Left-Looking and Right-Looking approaches to create an MPI-based version of multifrontal direct sparse solver with compression based on use of HSS format for diagonal blocks and low-rank approximation of off-diagonal matrix blocks. The solver based on Left –Looking approach has less memory consumptions but suffers from low scalability, which is a consequence of overloading of the elimination tree top nodes. The issue is resolved in the Right-Looking version of the solver, and it scales better though needs more memory.


 

'Compress and Eliminate' Solver for Sparse and Block Low-Rank Matrices

Daria Sushnikova, Russian Academy of Sciences

 

We propose fast direct sparse or hierarchical solver [5], based on the idea that inversion of sparse and hierarchical matrix has FMM structure [3,2].

Our goal is to build approximate block-sparse factorization of sparse or hierarchical matrix $A$. Using the assumption that inverse matrix has $\mathcal{H}^2$ structure we reduce fill-in in factorization by additional compression procedure. As a result of the algorithm we obtain factorization $A = Q^{\top}LL^{\top}Q,$ where $Q$ is FMM-like unitary matrix, $L$ is sparse block-triangular matrix.

Similar idea of using block low-rank structure of inverse matrix is presented in $\mathcal{H}$-lib package [4], algorithm has $\mathcal{O}(N)$ complicity, but it has a big constant. Idea of block rows compression was presented in paper [1], but authors approximate whole off-diagonal part of block row, that leads to big computational constant. And our approach is to compress only ``far'' blocks.

We propose $\mathcal{O}(N)$ robust and easy in implementation sparse triangular factorization for sparse and hierarchical matrices.

[1] Jeffrey N Chadwick and David S Bindel. An efficient solver for sparse linear systems based on rank-structured cholesky factorization. arXiv preprint arXiv:1507.05593, 2015.
[2] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys., 73(2):325–348, December 1987.
[3] W. Hackbusch and S. Borm. H2-matrix approximation of integral operators by interpolation. Appl. Numer. Math., 43:129–143, 2002.
[4] Wolfgang Hackbusch. Hlib package. http://www.hlib.org/.
[5] Sushnikova D. A., Oseledets I. V. " Compress and eliminate" solver for symmetric positive definite sparse matrices //arXiv preprint arXiv:1603.09133. – 2016.


 

Why Are So Many Matrices of Low Rank?

Alex Townsend, Cornell University


In computational mathematics, matrices and functions that appear in practice are so often of  low rank. Since random ("average") matrices are almost surely of full rank, mathematics needs to explain the remarkable abundance of low rank structures. In this talk, we will give a new characterization of low rank matrices that relates singular values to Zolotarev's third problem. We use this and other techniques to explore why (1) Droplets on a pond, (2) Non-equally sampling of functions, (3) Elliptic PDEs, and (4) The world flags, all lead to low rank objects. This talk is joint work with Bernhard Beckermann and Gil Strang.



Hierarchical Matrix Algorithms on GPUs

George Turkiyyah, Extreme Computing Research Center, KAUST and Department of Computer Science, American University of Beirut

The high throughput of GPUs makes them an attractive platform for executing expensive matrix computations, including operations on hierarchical matrices. However, the tree data structures used in hierarchical matrix representations and the recursive nature of the algorithms that operate on them are not a natural match to the architecture of manycore processors and make it difficult to obtain performance on them. In this talk, we describe how hierarchical matrix algorithms can be reformulated to express more arithmetic intensity and more concurrency---features that are essential for obtaining performance on GPUs. In particular, we show how the trees representing the bases of the block rows and columns, and the quadtrees representing the coupling blocks are flattened, and how batched linear algebra routines can be used level-by-level in the tree representations to exploit the GPU hardware optimally.  The resulting algorithms, including matrix-vector multiplication, compression, hierarchical matrix generation from matrix-vector products, and matrix-matrix multiplication, show near-ideal execution performance on a roofline model.



Fast Structured Spectral Methods

Yingwei Wang, Purdue University

Spectral methods have been used extensively in numerical PDEs due to their higher order of accuracy when compared to finite differences and finite elements methods. However, while other low order methods usually lead to sparse linear systems, spectral methods often suffer from a huge computational complexity caused by dense matrices. In this talk, I will show that some of these dense matrices enjoy a hidden low-rank structure. This property could be exploited to dramatically reduce the computational cost and give birth to direct solvers with nearly optimal complexity and memory, thanks to structured matrices. Furthermore, I will present a framework of fast structured spectral methods, including Jacobi transforms, spectral Galerkin and collocation methods. Both theoretical analysis and numerical experiments verify the efficiency and accuracy of our proposed methods.



Numerical Computing with Functions on the Sphere and Disk

Heather Wilber, Cornell University


By synthesizing a classic procedure known as the double Fourier sphere (DFS) method with a structure-preserving variant of Gaussian elimination, approximants to functions on the sphere and disk can be constructed that (1) preserve the bi-periodicity of the sphere, (2) are smooth over the poles of the sphere (and origin of the disk), (3) allow for the use of FFT-based algorithms, and (4) are near-optimal in their underlying discretizations. This method is used to develop a suite of fast, scalable algorithms that exploit the low rank form of approximants to reduce many operations to essentially 1D procedures. This includes algorithms for differentiation, integration, and vector calculus. Combining these ideas with Fourier and ultra-spherical spectral methods results in an optimal complexity solver for Poisson’s equation, which can be used to solve problems with 10^8 degrees of freedom in just under a minute on a laptop computer. All of these algorithms have been implemented and are publicly available in the open-source computing system called Chebfun.


 

Fast Contour-Integral Preconditioner

Yuanzhe Xi, University of Minnesota


In this talk, we present a fast contour-integral preconditioner for linear systems with indefinite sparse matrices A. By resorting to rational approximation techniques, the algorithm decomposes the spectrum of A into two disjoint regions and approximates the restriction of A^{-1} on these regions separately. We show a systematic way to construct these rational functions so that they can be applied stably and inexpensively. An attractive feature of the proposed approach is that the construction and application of the preconditioner can exploit two levels of parallelism. Moreover, the proposed preconditioner can be modified at a negligible cost into a preconditioner for a near-by matrix of the form A-cI, which can be useful in some applications. The efficiency and robustness of the proposed preconditioner are demonstrated on a few tests with challenging model problems, including problems arising from the Helmholtz equation in three dimensions. This is joint work with Yousef Saad.


 

Updating Linear Systems of Equations with Augmented System Solvers: Surgery Simulations and Power Grid Contingency Analyses

Yu-Hong Yeung and Alex Pothen, Department of Computer Science, Purdue University


Linear systems of equations that are incrementally updated arise in various real-world dynamic problems. Such updates may be needed in real-time as in surgery simulations, or there may be many updates to compute, as in power grid security analyses. In both cases a fast solver is needed for the modified systems to allow accurate solutions to be obtained in a timely manner. Often the reflected modifications in the matrix are relatively small compared to the size of the original system.


We present fast algorithms for efficiently solving the modified systems without repeated factorizations of the matrix, while maintaining accuracy comparable to that of a direct solver. These algorithms are also faster and more robust than iterative solvers. We solve the systems by combining an augmented matrix approach, a hybrid linear equation solver (direct and Krylov space solvers), and sophisticated exploitation of sparsity in the matrices and vectors. Symmetry is also preserved if both the original and the modified systems are symmetric. Both the factors of the original matrix and the solutions to the original system are utilized in solving the modified systems efficiently. The time complexity of our algorithm is linear in the number of nonzeros in the factors of the initial matrix.


We demonstrate the efficiencies of these algorithms with two real-world problems: surgery simulations and power grid contingency analyses. Surgery simulators based on haptics and computer graphics have the potential to provide better training tools for surgery than the traditional way of using animal models, which differ from human organs in their visco-elastic properties.


In astigmatism surgery, these simulators need to update images of the eye from ten to hundred times per second to provide realistic visualization. We are able to provide ten to hundreds of updates per second for meshes with hundreds of thousands of elements. To the best of our knowledge, this is the first such result for meshes of this size.


The fast update problem arises in the contingency analysis of electric power grids also, when operators have to pre-compute strategies for actions to take when a number of components in the grid fail. This problem is computationally intensive due to the large number of cases that must be analyzed. The augmented matrix approach makes it possible to successively update the solutions with increasing numbers of failed elements. Our experimental results show that our algorithms provide accurate solutions over a hundred times faster than a parallel direct solver for a synthetic power grid with over 700k buses. To the best of our knowledge, this is also the first such analysis for a power grid of this size.


This is joint work with Jessica Crouch (Old Dominion University), Mahantesh Halappanavar and Zhenyu Huang (Pacific Northwest National Laboratory).



Low-rank Approximation of a Matrix: Novel Insights and New Progress

Victor Y. Pan, Lehman College of the City University of New York
Liang Zhao, Graduate Center of the City University of New York


Empirical performance of the celebrated algorithms for low-rank approximation of a matrix by means of random sampling has been consistently efficient in various studies with various sparse and structured multipliers, but so far formal support for this empirical observation has been missing. Our new insight into this subject enables us to provide such an elusive formal support.


Furthermore, our approach promises significant acceleration of the known algorithms by means of sampling with more efficient sparse and structured multipliers. It should also lead to enhanced performance of other fundamental matrix algorithms.