Abstract
Estimation/prediction with "contaminated" data is a classical problem considered in Huber's seminal work in the 60's. This set-up regained a lot of attention recently in the modern literature of robust statistics and machine learning. A modern take on the statistical optimality of "robust" estimators concerns with the "best" convergence rate on the sample size, effective dimension, contamination fraction and, sometimes, the failure probability restricted to computationally tractable estimators. In this talk we consider the least-squares regression problem with a linear subgaussian class, including the setting where a fraction of the label sample is contaminated by adversarial outliers. Our contributions to this problem span different perspectives. First, we study outlier-robust high-dimensional least-squares when the parameter is sparse/low-rank and show tractable optimality adaptively to sparsity/low-rankness and the contamination fraction. Second, the literature in high-dimensional least squares typically assumes the noise is independent of the features. We present a statistical theory which allows for heterogenous feature-dependent noise without extra structural assumptions. A third concern is optimality wrt failure probability. Most rates for this problem are optimal only "on average" but suboptimal when concerned with the failure probability. We present new optimal rates for an estimator whose tuning is adaptive to the failure probability having optimal rates uniformly on any confidence level. Our estimator is based on a new 'Sorted Huber loss" which we show by numerical experiments can significantly outperform the classical Huber loss. Fourthly, we consider the problem of trace-regression with matrix decomposition. While natural in high-dimensional statistics and compressed sensing, to the best of our knowledge no optimality theory has been presented before. Our statistical theory crucially relies on new concentration inequalities for the Multiplier Process and Product Processes derived via Talagrand-like generic chaining techniques. We also crucially need Chevet's inequality to show optimality of a robust estimator with label contamination.