An Interpolation Perspective on Modern Machine Learning
Misha Belkin, Ohio State University
A striking feature of modern supervised machine learning is its consistent use of techniques that nearly interpolate the data. Deep networks often containing several orders of magnitude more parameters than data points and are trained to obtain near zero error on the training set. Yet, at odds with most theory, they show excellent test performance.
In this talk I will discuss and give some historical context for the phenomenon of interpolation (zero training loss). I will show how it provides a new perspective on machine learning and statistical inference, forcing us to rethink some commonly held assumptions and point to the significant gaps in our understanding, even in the simplest settings, of why such classifiers generalize well.
I will also demonstrate the power of this point of view by describing its advantages for optimization and showing how its simplicity can be used to construct very efficient and theoretically sound methods for training large-scale kernel machines.
Linear algebra support for scalable kernel methods
David Bindel, Cornell University
Gaussian processes (GPs) define a distribution over functions that generalizes the multivariate normal distribution over vector spaces. Long used as a tool for spatio-temporal statistical modeling, GPs are also a key part of the modern arsenal in machine learning. Unfortunately, GP regression and kernel hyper-parameter estimation with N training examples involve a dense N-by-N kernel matrix, and standard factorization approaches to the underlying linear algebra problems have O(N3) scaling. In this talk, we discuss recent work on more scalable kernel methods that draw on a combination of preconditioned Krylov subspace methods, stochastic estimators, and tricks with structured matrices.
This is joint work with Kun Dong, David Eriksson, Eric Lee, and Andrew Wilson.
Approximation theoretic advice for supervised learning
Paul Constantine, University of Colorado, Boulder
Algorithms for constructive approximation of functions share many features with supervised learning methods. For example, least-squares is a common step in both polynomial approximation and regression analysis. Also, kernel methods are a bridge between radial basis function approximation and Gaussian process regression. Although the algorithms are similar, analyses of constructive approximation methods and regression methods differ significantly. Moreover, the underlying processes that are assumed to generate the data set in each case (approximation versus regression) are quite different. Therefore, interpretations of results differ, too. These differences become apparent when we ask simple questions like: what do error and convergence mean in each context? Or how do practitioners from each field assess confidence in their results?
I will discuss essential concepts in approximation theory and contrast them with similar ideas from regression analysis with emphasis on practical consequences, i.e., how can complexity and error analysis in approximation theory inform computational strategies to improve supervised learning, and vice versa. I will explore implications for the related fields of "computer experiments" in statistics and "uncertainty quantification" in computational science, where the data sets are comprised of a set of evaluations of deterministic function representing a physics-based computer model and their associated physical inputs.
Here's the take-home: don't seek comfort in theoretical guarantees and interpretations if you cannot practically verify that the assumptions needed to derive those guarantees or interpretations are satisfied in your particular problem or data set.
How good is your classifier? Revisiting the role of evaluation metrics in machine learning
Sanmi Koyejo, University of Illinois
With the increasing integration of machine learning into real systems, it is crucial that trained models are optimized to reflect real world tradeoffs. Increasing interest in proper evaluation has led to a wide variety of metrics employed in practice, often specially designed by experts. However, modern training strategies have not kept up with the explosion of metrics, leaving practitioners to resort to heuristics. To address this shortcoming, I will present a simple, yet consistent post-processing rule which improves the performance of trained binary, multilabel, and multioutput classifiers. Building on these results, I will propose a framework for metric elicitation, which addresses the broader question of how one might select an evaluation metric for real world problems so that it reflects true preferences.
Optimization Methods for Machine-Learning
Sven Leyffer, Argonne National Labs
We review optimization methods that underpin machine-learning approaches. Our goal is to provide an overview of the connections between optimization and machine learning. We discuss the impact of nonsmooth optimization methods, splitting methods, gradient and Newton-type methods. We also examine the impact of different model formulations such as robust optimization, mixed-integer optimization, and complementarity constraints. Our goal is to highlight both the limitations and promises of optimization methodologies and formulations.
Learning and Geometry for Stochastic Dynamical Systems in high dimensions
Mauro Maggioni, John Hopkins University
We discuss geometry-based statistical learning techniques for performing model reduction and modeling of certain classes of stochastic high-dimensional dynamical systems. We consider two complementary settings. In the first one, we are given long trajectories of a system, e.g. from molecular dynamics, and we estimate, in a robust fashion, an effective number of degrees of freedom of the system, which may vary in the state space of then system, and a local scale where the dynamics is well-approximated by a reduced dynamics with a small number of degrees of freedom. We then use these ideas to produce an approximation to the generator of the system and obtain, via eigenfunctions of an empirical Fokker-Planck equation (constructed from data), reaction coordinates for the system that capture the large time behavior of the dynamics. We present various examples from molecular dynamics illustrating these ideas.
In the second setting we only have access to a (large number of expensive) simulators that can return short paths of the stochastic system, and introduce a statistical learning framework for estimating local approximations to the system, that can be (automatically) pieced together to form a fast global reduced model for the system, called ATLAS. ATLAS is guaranteed to be accurate (in the sense of producing stochastic paths whose distribution is close to that of paths generated by the original system) not only at small time scales, but also at large time scales, under suitable assumptions on the dynamics. We discuss applications to homogenization of rough diffusions in low and high dimensions, as well as relatively simple systems with separations of time scales, and deterministic chaotic systems in high-dimensions, that are well-approximated by stochastic diffusion-like equations.
Approximation using fixed sets of weight vectors in single hidden-layer networks
David Stewart, Professor of Mathematics, University of Iowa
Neural networks with a single hidden layer can be understood as giving approximations using sums of ridge functions. If we fix the set of weight vectors, as is done in the case of Extreme Learning Machines, then the problem of approximation by sums of ridge functions should give a lower bound on the number of hidden units needed in such systems. We are able to show that good approximations in general require a number of weight vectors that is at least quadratic in the dimension of the problem. Similar recommendations might be made for more conventional single hidden layer neural networks if the optimization over the weights before the activation function is not considered effective.
Graph reconstruction via discrete Morse theory
Yusu Wang, Ohio State University
Graphs form one of the most important types of data in various applications across science and engineering. They could be geometric in nature, such as road networks in GIS, or relational and abstract, such as protein-protein interaction networks. A fundamental problem is to reconstruct a hidden graph from different types of potentially noisy input data. In this talk, I will focus on a specific framework to reconstruct a hidden geometric graph from density data via a persistence-guided discrete Morse based framework. I will describe this framework using an application from automatic road network reconstruction. I will then provide theoretical understanding of this framework, including theoretical guarantees of the reconstruction when input data satisfies certain noise model.
This is joint work with various collaborators: Tamal K. Dey, Yanjie Li, Jiayuan Wang and Suyi Wang.
Robust Learning of Trimmed Estimators via Manifold Sampling
Stefan Wild, Argonne National Lab
We adapt a manifold sampling algorithm for the nonsmooth, nonconvex formulations of learning that arise when imposing robustness to outliers present in training data. We demonstrate the approach on objectives based on trimmed loss and highlight the challenges resulting from the nonsmooth, nonconvex paradigm needed for robust learning. Initial results demonstrate that the method has favorable scaling properties for robust training on trimmed linear regression and multiclass classification problems. Savings in time on large-scale problems arise at the expense of not certifying global optimality. We focus on the first-order, large-scale paradigm in the empirical studies, but the method also extends to cases where the loss is determined by a black-box oracle. Joint work with Matt Menickelly.
Conference funded by IMA conference grant, Purdue College of Science, Purdue Department of Mathematics and Purdue Department of Computer Science.