Fast Solvers 2016Purdue 
Program (with links to abstracts)  subject to change
Abstracts
Developing Preconditioners with Guaranteed Convergence Rate Using Hierarchical Matrices Eric Darve, Stanford University
Hierarchical factorization methods can lead to efficient preconditioners with a prescribed tolerance epsilon. However, if we want to ensure that the preconditioned system has a bounded condition number and that the number of iterations in an iterative scheme remains small, we need in principle to pick an epsilon that is smaller than the smallest eigenvalue of the matrix. As a result, for many problems, epsilon may have to be chosen quite small, and generally this makes these algorithms less efficient for large illconditioned problems. We will present a modified version of hierarchical factorization methods which does not have this limitation, and for which a guaranteed convergence can be obtained. This method only applies to certain types of PDEs. We will present results for elliptic PDEs.
GPUAcceleration of Sparse QR and Cholesky Factorization Tim Davis, Texas A&M University
GPUs provide excellent computational throughput for high performance computing, but they require us to rethink and redesign our algorithms. I will present two algorithms for sparse direct factorization that exploit one or multiple GPUs: sparse QR and Cholesky factorization. A Fast Direct Solver for Boundary Value Problems Involving Locally Perturbed Geometries Adrianna Gillman, Rice University This talk presents a fast direct solver for boundary value problems where solutions are desired for geometries corresponding to local perturbations of the original geometry. Problems of this form arise in a variety of applications such as optimal design and flagella propelled movement simulations. The proposed solution technique utilizes a block linear system representation of the local perturbation which, thanks to a Woodbury formula, allows for the inverse of the matrix for the original system to be reused for all local perturbations. The cost of building the direct solver for the locally perturbed geometries grows linearly with the number of unknowns. Since the original system matrix is amenable to HSS, HBS, Hmatrix, etc. solvers, the constant prefactor for the precomputation is much smaller than building as direct solver from scratch or updating the hierarchical data structure for each new geometry. Numerical results will illustrate the performance of the new direct solver. Accelerating Cyclic Reduction with Hierarchical Matrix Operations on DistributedMemory Machines David Keyes, Extreme Computing Research Center and Program in Applied Mathematics and Computational Science, KAUST
We present a fast direct solver for block tridiagonal linear systems arising from the discretization of 3D elliptic operators. The solver exploits algorithmic synergies between cyclic reduction and hierarchical matrix arithmetic operations to produce a solver with $O(N \log^2 N)$ arithmetic complexity and $O(N \log N)$ memory footprint, with significant exploitable concurrency both in its outer loops and inner hierarchical compression operations. The algorithm can be seen as an alternative to multifrontal nested dissection elimination, which uses regular constantsized fronts in multiple passes to eliminate half the mesh nodes in each pass. This regularity and uniform structure allow the hierarchical compression to be done very efficiently and the computations to be distributed optimally in a distributedmemory setting. We provide a baseline for performance and applicability by comparing with an algebraic multigrid solver and multifrontal solvers with HSS matrices. Strong and weak scalability results show that the solver is competitive with stateoftheart fast direct solvers on examples arising from the discretization of Poisson, Helmholtz, and convectiondiffusion equations. [Work with Gustavo Chavez and George Turkiyyah]
Randomized Methods for Accelerating Direct Solvers Gunnar Martinsson, University of Colorado at Boulder
Methods based on randomized sampling have over the last several years proven to be powerful tools for computing lowrank approximations to matrices. In this talk, we describe how such techniques can be extended to certain "rankstructured" matrices, for which only certain offdiagonal blocks (not the matrix itself) admit accurate lowrank approximations.
Transformations of Matrix Structures and Fast Approximate Computations with Vandermonde and Cauchy Matrices Victor Y. Pan, Lehman College of the City University of New York
Matrices with the structures of Toeplitz, Hankel, Vandermonde and Cauchy types are omnipresent in modern computations in Sciences, Engineering, and Signal and Image Processing. These four matrix classes have distinct features, but in our 1990 paper in Mathematics of Computation we showed that Vandermonde and Hankel multipliers can be applied to transform each class into the others, and then we demonstrated how by using these transforms we can readily extend any successful matrix inversion algorithm from one of these classes to all the others. The power of this approach was widely recognized later, when novel numerically stable algorithms solved nonsingular Toeplitz linear systems of equations in quadratic (versus classical cubic) arithmetic time based on transforming Toeplitz into Cauchy matrix structures. More recent papers combined the same transformation with a link of the Cauchy matrices to the Hierarchical Semiseparable matrix structure, which is a specialization of matrix representations employed by the Fast Multipole Method. This produced numerically stable algorithms that approximated the solution of a nonsingular Toeplitz linear system of equations in nearly linear arithmetic time. We extend the latter approximation algorithms for Toeplitz linear systems to approximate multiplication of Vandermonde and Cauchy matrices by a vector with potential extension to approximate solution of Vandermonde and Cauchy linear systems of equations provided that they are nonsingular and wellconditioned. The arithmetic cost of the known numerical approximation algorithms for these tasks decreases from quadratic to nearly linear, and similarly for computations with matrices whose structure generalizes the structures of Vandermonde and Cauchy matrices and for multipoint polynomial and rational evaluation. The details are in our papers Transformations of Matrix Structures Work Again, Linear Algebra and Its Applications, 465, 1–32, 2015, and Fast Approximate Computations with Cauchy Matrices and Polynomials, Math. of Computation, in print.
Sparse Supernodal Solver Using Hierarchical Compression Grégoire Pichon, LaBRI, INRIA BordeauxSO
In this talk, we present the PaStiX sparse supernodal solver, using hierarchical compression to reduce the burden on large blocks appearing during the nested dissection process. To improve the efficiency of our sparse update kernel for both BLR (block low rank) and HODLR (hierarchically offdiagonal lowrank), we investigate to BDLR (boundary distance lowrank) method to preselect rows and columns in the lowrank approximation algorithm. We will also discuss ordering strategies to enhance data locality and compressibility. An Efficient Preconditioner for the Fast Simulation of a 2D Stokes Flow in Porous Media Bryan Quaife, Florida State University
We consider a boundary integral equation (BIE) formulation of the twodimensional Stokes equations in a porous geometry. While BIEs are wellsuited for resolving the complex geometry, they lead to a dense linear system of equations that is computationally expensive to solve. A fast multipole accelerated iterative method is possible, but for complex geometries, many iterations are required. Therefore, we apply the inverse fast multipole method (IFMM), which is based on the framework of $\mathcal{H}^2$ matrices, to precondition the linear system. We examine the effect of the preconditioners tolerance and compare its efficacy with a blockdiagonal preconditioner on several geometries. This is joint work with Eric Darve and Pieter Coulier. LowRank Correction Preconditioning Techniques Yousef Saad, University of Minnesota This presentation will discuss a class of preconditioning methods for solving linear systems of equations that are based on exploiting lowrank approximations to certain matrices. These methods have a number of appealing features. Because they are essentially approximate inverse techniques, they handle indefiniteness quite well. Furthermore, they are amenable to SIMD comptuations such those inherent to GPUs. The talk will first describe a few techniques geared toward a Domain Decomposition framework for a parallel computing environment. These techniques are all based on exploiting Schur complements and lowrank corrections. We will then see how to extend this DD approach by considering socalled `hierarchical interface decomposition orderings' which are essentially algebraic generalizations of `wirebaskets' techniques used in Domain Decomposition methods. Finally, we will describe ongoing work to develop lowrank correction methods for highly indefinite systems such as those that arise from Helmholtz or Maxwell equations.
Two Versions of MPIBased Direct Sparse Solver with Low Rank Approximation for Intermediate Data Compression
Sergey Solovyev, Victor Kostin, Institute of Petroleum
Geology and Geophysics SB RAS, Russia We compare LeftLooking and RightLooking approaches to create an MPIbased version of multifrontal direct sparse solver with compression based on use of HSS format for diagonal blocks and lowrank approximation of offdiagonal matrix blocks. The solver based on Left –Looking approach has less memory consumptions but suffers from low scalability, which is a consequence of overloading of the elimination tree top nodes. The issue is resolved in the RightLooking version of the solver, and it scales better though needs more memory.
'Compress and Eliminate' Solver for Sparse and Block LowRank Matrices Daria Sushnikova, Russian Academy of Sciences
We propose fast direct sparse or hierarchical solver
[5], based on the idea that inversion of sparse and hierarchical matrix
has FMM structure [3,2].
Why Are So Many Matrices of Low Rank? Alex Townsend, Cornell University In computational mathematics, matrices and functions that appear in practice are so often of low rank. Since random ("average") matrices are almost surely of full rank, mathematics needs to explain the remarkable abundance of low rank structures. In this talk, we will give a new characterization of low rank matrices that relates singular values to Zolotarev's third problem. We use this and other techniques to explore why (1) Droplets on a pond, (2) Nonequally sampling of functions, (3) Elliptic PDEs, and (4) The world flags, all lead to low rank objects. This talk is joint work with Bernhard Beckermann and Gil Strang. Hierarchical Matrix Algorithms on GPUs George Turkiyyah, Extreme Computing Research Center, KAUST and Department of Computer Science, American University of BeirutThe high throughput of GPUs makes them an attractive platform for executing expensive matrix computations, including operations on hierarchical matrices. However, the tree data structures used in hierarchical matrix representations and the recursive nature of the algorithms that operate on them are not a natural match to the architecture of manycore processors and make it difficult to obtain performance on them. In this talk, we describe how hierarchical matrix algorithms can be reformulated to express more arithmetic intensity and more concurrencyfeatures that are essential for obtaining performance on GPUs. In particular, we show how the trees representing the bases of the block rows and columns, and the quadtrees representing the coupling blocks are flattened, and how batched linear algebra routines can be used levelbylevel in the tree representations to exploit the GPU hardware optimally. The resulting algorithms, including matrixvector multiplication, compression, hierarchical matrix generation from matrixvector products, and matrixmatrix multiplication, show nearideal execution performance on a roofline model. Fast Structured Spectral Methods Yingwei Wang, Purdue UniversitySpectral methods have been used extensively in numerical PDEs due to their higher order of accuracy when compared to finite differences and finite elements methods. However, while other low order methods usually lead to sparse linear systems, spectral methods often suffer from a huge computational complexity caused by dense matrices. In this talk, I will show that some of these dense matrices enjoy a hidden lowrank structure. This property could be exploited to dramatically reduce the computational cost and give birth to direct solvers with nearly optimal complexity and memory, thanks to structured matrices. Furthermore, I will present a framework of fast structured spectral methods, including Jacobi transforms, spectral Galerkin and collocation methods. Both theoretical analysis and numerical experiments verify the efficiency and accuracy of our proposed methods. Numerical Computing with Functions on the Sphere and Disk Heather Wilber, Cornell University By synthesizing a classic procedure known as the double Fourier sphere (DFS) method with a structurepreserving variant of Gaussian elimination, approximants to functions on the sphere and disk can be constructed that (1) preserve the biperiodicity of the sphere, (2) are smooth over the poles of the sphere (and origin of the disk), (3) allow for the use of FFTbased algorithms, and (4) are nearoptimal in their underlying discretizations. This method is used to develop a suite of fast, scalable algorithms that exploit the low rank form of approximants to reduce many operations to essentially 1D procedures. This includes algorithms for differentiation, integration, and vector calculus. Combining these ideas with Fourier and ultraspherical spectral methods results in an optimal complexity solver for Poisson’s equation, which can be used to solve problems with 10^8 degrees of freedom in just under a minute on a laptop computer. All of these algorithms have been implemented and are publicly available in the opensource computing system called Chebfun.
Fast ContourIntegral Preconditioner Yuanzhe Xi, University of Minnesota In this talk, we present a fast contourintegral preconditioner for linear systems with indefinite sparse matrices A. By resorting to rational approximation techniques, the algorithm decomposes the spectrum of A into two disjoint regions and approximates the restriction of A^{1} on these regions separately. We show a systematic way to construct these rational functions so that they can be applied stably and inexpensively. An attractive feature of the proposed approach is that the construction and application of the preconditioner can exploit two levels of parallelism. Moreover, the proposed preconditioner can be modified at a negligible cost into a preconditioner for a nearby matrix of the form AcI, which can be useful in some applications. The efficiency and robustness of the proposed preconditioner are demonstrated on a few tests with challenging model problems, including problems arising from the Helmholtz equation in three dimensions. This is joint work with Yousef Saad.
Updating Linear Systems of Equations with Augmented System Solvers: Surgery Simulations and Power Grid Contingency Analyses YuHong Yeung and Alex Pothen, Department of Computer Science, Purdue University
Lowrank Approximation of a Matrix: Novel Insights and New Progress Victor Y. Pan, Lehman College of the City University of New York
