SIAM Journal on Mathematics of Data Science
Number of Followers: 1 Hybrid journal (It can contain Open Access articles) ISSN (Online) 25770187 Published by Society for Industrial and Applied Mathematics [17 journals] 
 $k$Variance: A Clustered Notion of Variance

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Justin Solomon, Kristjan Greenewald, Haikady Nagaraja
Pages: 957  978
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 3, Page 957978, September 2022.
We introduce $k$variance, a generalization of variance built on the machinery of random bipartite matchings. $k$variance measures the expected cost of matching two sets of $k$ samples from a distribution to each other, capturing local rather than global information about a measure as $k$ increases; it is easily approximated stochastically using sampling and linear programming. In addition to defining $k$variance and proving its basic properties, we provide indepth analysis of this quantity in several key cases, including onedimensional measures, clustered measures, and measures concentrated on lowdimensional subsets of ${\mathbb R}^n$. We conclude with experiments and open problems motivated by this new way to summarize distributional shape.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220707T07:00:00Z
DOI: 10.1137/20M1385895
Issue No: Vol. 4, No. 3 (2022)

 Convergence of a Constrained Vector Extrapolation Scheme

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Mathieu Barré, Adrien Taylor, Alexandre d'Aspremont
Pages: 979  1002
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 3, Page 9791002, September 2022.
Among extrapolation methods, Anderson acceleration (AA) is a popular technique for speeding up convergence of iterative processes toward their limit points. AA proceeds by extrapolating a better approximation of the limit using a weighted combination of previous iterates. Whereas AA was originally developed in the context of nonlinear integral equations, or to accelerate the convergence of iterative methods for solving linear systems, it is also used to extrapolate the solution of nonlinear systems. Simple additional stabilization strategies can be used in this context to control conditioning issues. In this work, we study a constrained vector extrapolation scheme based on an offline version of AA with fixed window size, for solving nonlinear systems arising in optimization problems, where the stabilization strategy consists in bounding the magnitude of the extrapolation weights. We provide explicit convergence bounds for this method and, as a byproduct, upper bounds on a constrained version of the Chebyshev problem on polynomials.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220711T07:00:00Z
DOI: 10.1137/21M1428030
Issue No: Vol. 4, No. 3 (2022)

 Convergence of a PiggybackStyle Method for the Differentiation of
Solutions of Standard SaddlePoint Problems
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Lea Bogensperger, Antonin Chambolle, Thomas Pock
Pages: 1003  1030
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 3, Page 10031030, September 2022.
We analyze a “piggyback''style method for computing the derivative of a loss which depends on the solution of a convexconcave saddlepoint problem, with respect to the bilinear term. We attempt to derive guarantees for the algorithm under minimal regularity assumptions on the functions. Our final convergence results include possibly nonsmooth objectives. We illustrate the versatility of the proposed piggyback algorithm by learning optimized shearlet transforms, which are a class of popular sparsifying transforms in the field of imaging.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220714T07:00:00Z
DOI: 10.1137/21M1455887
Issue No: Vol. 4, No. 3 (2022)

 DESTRESS: ComputationOptimal and CommunicationEfficient Decentralized
Nonconvex FiniteSum Optimization
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Boyue Li, Zhize Li, Yuejie Chi
Pages: 1031  1051
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 3, Page 10311051, September 2022.
Emerging applications in multiagent environments such as internetofthings, networked sensing, autonomous systems, and federated learning, call for decentralized algorithms for finitesum optimizations that are resource efficient in terms of both computation and communication. In this paper, we consider the prototypical setting where the agents work collaboratively to minimize the sum of local loss functions by only communicating with their neighbors over a predetermined network topology. We develop a new algorithm, called DEcentralized STochastic REcurSive gradient methodS (DESTRESS) for nonconvex finitesum optimization, which matches the optimal incremental firstorder oracle complexity of centralized algorithms for finding firstorder stationary points, while maintaining communication efficiency. Detailed theoretical and numerical comparisons corroborate that the resource efficiencies of DESTRESS improve upon prior decentralized algorithms over a wide range of parameter regimes. DESTRESS leverages several key algorithm design ideas including stochastic recursive gradient updates with minibatches for local computation, gradient tracking with extra mixing (i.e., multiple gossiping rounds) for periteration communication, together with careful choices of hyperparameters and new analysis frameworks to provably achieve a desirable computationcommunication tradeoff.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220804T07:00:00Z
DOI: 10.1137/21M1450677
Issue No: Vol. 4, No. 3 (2022)

 A Nonlinear Matrix Decomposition for Mining the Zeros of Sparse Data

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Lawrence K. Saul
Pages: 431  463
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 431463, June 2022.
We describe a simple iterative solution to a widely recurring problem in multivariate data analysis: given a sparse nonnegative matrix ${\mathbf{X}}$, how to estimate a lowrank matrix ${{\Theta}}$ such that ${{X}} \approx f({{\Theta}})$, where $f$ is an elementwise nonlinearity' We develop a latent variable model for this problem and consider those sparsifying nonlinearities, popular in neural networks, that map all negative values to zero. The model seeks to explain the variability of sparse highdimensional data in terms of a smaller number of degrees of freedom. We show that exact inference in this model is tractable and derive an expectationmaximization (EM) algorithm to estimate the lowrank matrix ${{\Theta}}$. Notably, we do not parameterize ${{\Theta}}$ as a product of smaller matrices to be alternately optimized; instead, we estimate ${{\Theta}}$ directly via the singular value decomposition of matrices that are repeatedly inferred (at each iteration of the EM algorithm) from the model's posterior distribution. We use the model to analyze large sparse matrices that arise from data sets of binary, grayscale, and color images. In all of these cases, we find that the model discovers much lowerrank decompositions than purely linear approaches.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220407T07:00:00Z
DOI: 10.1137/21M1405769
Issue No: Vol. 4, No. 2 (2022)

 What Kinds of Functions Do Deep Neural Networks Learn' Insights from
Variational Spline Theory
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Rahul Parhi, Robert D. Nowak
Pages: 464  489
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 464489, June 2022.
We develop a variational framework to understand the properties of functions learned by fitting deep neural networks with rectified linear unit (ReLU) activations to data. We propose a new function space, which is related to classical bounded variationtype spaces, that captures the compositional structure associated with deep neural networks. We derive a representer theorem showing that deep ReLU networks are solutions to regularized datafitting problems over functions from this space. The function space consists of compositions of functions from the Banach space of secondorder bounded variation in the Radon domain. This Banach space has a sparsitypromoting norm, giving insight into the role of sparsity in deep neural networks. The neural network solutions have skip connections and rankbounded weight matrices, providing new theoretical support for these common architectural choices. The variational problem we study can be recast as a finitedimensional neural network training problem with regularization schemes related to the notions of weight decay and pathnorm regularization. Finally, our analysis builds on techniques from variational spline theory, providing new connections between deep neural networks and splines.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220413T07:00:00Z
DOI: 10.1137/21M1418642
Issue No: Vol. 4, No. 2 (2022)

 Statistical Methods for Minimax Estimation in Linear Models with Unknown
Design Over Finite Alphabets
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Merle Behr, Axel Munk
Pages: 490  513
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 490513, June 2022.
We provide a minimax optimal estimation procedure for ${F}$ and ${\Omega}$ in matrix valued linear models $Y = {{F}} {{\Omega}} + Z$, where the parameter matrix ${\Omega}$ and the design matrix ${F}$ are unknown but the latter takes values in a known finite set. The proposed finite alphabet linear model is justified in a variety of applications, ranging from signal processing to cancer genetics. We show that this allows one to separate ${F}$ and ${\Omega}$ uniquely under weak identifiability conditions, a task which is not doable, in general. To this end we quantify in the noiseless case, that is, $Z = 0$, the perturbation range of $Y$ in order to obtain stable recovery of ${F}$ and ${\Omega}$. Based on this, we derive an iterative Lloyd's type estimation procedure that attains minimax estimation rates for ${\Omega}$ and ${F}$ for Gaussian error matrix $Z$. In contrast to the least squares solution the estimation procedure can be computed efficiently and scales linearly with the total number of observations. We confirm our theoretical results in a simulation study and illustrate it with a genetic sequencing data example.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220425T07:00:00Z
DOI: 10.1137/21M1398860
Issue No: Vol. 4, No. 2 (2022)

 Approximation Bounds for Sparse Programs

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Armin Askari, Alexandre d'Aspremont, Laurent El Ghaoui
Pages: 514  530
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 514530, June 2022.
We show that sparsityconstrained optimization problems over lowdimensional spaces tend to have a small duality gap. We use the ShapleyFolkman theorem to derive both datadriven bounds on the duality gap and an efficient primalization procedure to recover feasible points satisfying these bounds. These error bounds are proportional to the rate of growth of the objective with the target cardinality, which means in particular that the relaxation is nearly tight as soon as the target cardinality is large enough so that only uninformative features are added.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220427T07:00:00Z
DOI: 10.1137/21M1398677
Issue No: Vol. 4, No. 2 (2022)

 Stochastic Geometry to Generalize the Mondrian Process

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Eliza O'Reilly, Ngoc Mai Tran
Pages: 531  552
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 531552, June 2022.
The stable under iteration (STIT) tessellation process is a stochastic process that produces a recursive partition of space with cut directions drawn independently from a distribution over the sphere. The case of random axisaligned cuts is known as the Mondrian process. Random forests and Laplace kernel approximations built from the Mondrian process have led to efficient online learning methods and Bayesian optimization. In this work, we utilize tools from stochastic geometry to resolve some fundamental questions concerning STIT processes in machine learning. First, we show that STIT processes can be efficiently simulated by lifting to a higherdimensional axisaligned Mondrian process. Second, we characterize all possible kernels that STIT processes and their mixtures can approximate. We also give a uniform convergence rate for the approximation error of the STIT kernels to the targeted kernels, completely generalizing the work of Balog et al. [The Mondrian kernel, 2016] from the Mondrian case. Third, we obtain consistency results for STIT forests in density estimation and regression. Finally, we give a precise formula for the density estimator arising from a STIT forest. This allows for precise comparisons between the STIT forest, the STIT kernel, and the targeted kernel in density estimation. Our paper calls for further developments at the novel intersection of stochastic geometry and machine learning.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220502T07:00:00Z
DOI: 10.1137/20M1354490
Issue No: Vol. 4, No. 2 (2022)

 Quantitative Approximation Results for ComplexValued Neural Networks

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Andrei Caragea, Dae Gwan Lee, Johannes Maly, Götz Pfander, Felix Voigtlaender
Pages: 553  580
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 553580, June 2022.
Until recently, applications of neural networks in machine learning have almost exclusively relied on realvalued networks. It was recently observed, however, that complexvalued neural networks (CVNNs) exhibit superior performance in applications in which the input is naturally complexvalued, such as MRI fingerprinting. While the mathematical theory of realvalued networks has, by now, reached some level of maturity, this is far from true for complexvalued networks. In this paper, we analyze the expressivity of complexvalued networks by providing explicit quantitative error bounds for approximating $C^n$ functions on compact subsets of $\mathbb{C}^d$ by CVNNs that employ the modReLU activation function, given by $\sigma(z) = {ReLU}( z  1), {sgn} (z)$, which is one of the most popular complex activation functions used in practice. We show that the derived approximation rates are optimal (up to log factors) in the class of modReLU networks with weights of moderate growth.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220502T07:00:00Z
DOI: 10.1137/21M1429540
Issue No: Vol. 4, No. 2 (2022)

 WassersteinBased Projections with Applications to Inverse Problems

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Howard Heaton, Samy Wu Fung, Alex Tong Lin, Stanley Osher, Wotao Yin
Pages: 581  603
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 581603, June 2022.
Inverse problems consist of recovering a signal from a collection of noisy measurements. These are typically cast as optimization problems, with classic approaches using a data fidelity term and an analytic regularizer that stabilizes recovery. Recent plugandplay (PnP) works propose replacing the operator for analytic regularization in optimization methods by a datadriven denoiser. These schemes obtain stateoftheart results, but at the cost of limited theoretical guarantees. To bridge this gap, we present a new algorithm that takes samples from the manifold of true data as input and outputs an approximation of the projection operator onto this manifold. Under standard assumptions, we prove this algorithm generates a learned operator, called Wassersteinbased projection (WP), that approximates the true projection with high probability. Thus, WPs can be inserted into optimization methods in the same manner as PnP, but now with theoretical guarantees. Provided numerical examples show WPs obtain stateoftheart results for unsupervised PnP signal recovery. All codes for this work can be found at https://github.com/swufung/WassersteinBasedProjections.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220505T07:00:00Z
DOI: 10.1137/20M1376790
Issue No: Vol. 4, No. 2 (2022)

 Tukey Depths and HamiltonJacobi Differential Equations

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Martin MolinaFructuoso, Ryan Murray
Pages: 604  633
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 604633, June 2022.
Widespread application of modern machine learning has increased the need for robust statistical algorithms. This work studies one such fundamental statistical concept known as the Tukey depth. We study the problem in the continuum (population) limit. In particular, we formally derive the associated necessary conditions, which take the form of a firstorder partial differential equation which is necessarily satisfied at points where the Tukey depth is smooth. We discuss the interpretation of this formal necessary condition in terms of the viscosity solution of a HamiltonJacobi equation, but with a nonclassical Hamiltonian with discontinuous dependence on the gradient at zero. We prove that this equation possesses a unique viscosity solution and that this solution always bounds the Tukey depth from below. In certain cases we prove that the Tukey depth is equal to the viscosity solution, and we give some illustrations of standard numerical methods from the optimal control community which deal directly with the partial differential equation. We conclude by outlining several promising research directions both in terms of new numerical algorithms and theoretical challenges.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220510T07:00:00Z
DOI: 10.1137/21M1411998
Issue No: Vol. 4, No. 2 (2022)

 Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Samuel Horváth, Lihua Lei, Peter Richtárik, Michael I. Jordan
Pages: 634  648
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 634648, June 2022.
Adaptivity is an important yet understudied property in modern optimization theory. The gap between the stateoftheart theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step size schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the “geometrization” technique introduced by [L. Lei and M. I. Jordan, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017, pp. 148156] and the SARAH algorithm of [L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč, Proceedings of the 34th International Conference on Machine Learning, 2017, pp. 26132621], we propose the geometrized SARAH algorithm for nonconvex finitesum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the PolyakŁojasiewicz (PL) constant, if present. In addition, it achieves the bestavailable convergence rate for nonPL objectives simultaneously while outperforming existing algorithms for PL objectives.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220512T07:00:00Z
DOI: 10.1137/21M1394308
Issue No: Vol. 4, No. 2 (2022)

 A Variational Formulation of Accelerated Optimization on Riemannian
Manifolds
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Valentin Duruisseaux, Melvin Leok
Pages: 649  674
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 649674, June 2022.
It was shown recently by [W. Su, S. Boyd, and E. Candes, J. Mach. Learn. Res., 17 (2016), pp. 143] that Nesterov's accelerated gradient method for minimizing a smooth convex function $f$ can be thought of as the time discretization of a secondorder ODE and that $f(x(t))$ converges to its optimal value at a rate of $\mathcal{O}(1/t^2)$ along any trajectory $x(t)$ of this ODE. A variational formulation was introduced in [A. Wibisono, A. Wilson, and M. Jordan, Proc Natl. Acad. Sci. USA, 113 (2016), pp. E7351E7358] which allowed for accelerated convergence at a rate of $\mathcal{O}(1/t^p)$, for arbitrary $p>0$, in normed vector spaces. This framework was exploited in [V. Duruisseaux, J. Schmitt, and M. Leok, SIAM J. Sci. Comput., 43 (2021), pp. A2949A2980] using timeadaptive geometric integrators to design efficient explicit algorithms for symplectic accelerated optimization. In [F. Alimisis, A. Orvieto, G. Bécigneul, and A. Lucchi, Proceedings of the 23rd International AISTATS Conference, 2020, pp. 12971307], a secondorder ODE was proposed as the continuoustime limit of a Riemannian accelerated algorithm, and it was shown that the objective function $f(x(t))$ converges to its optimal value at a rate of $\mathcal{O}(1/t^2)$ along solutions of this ODE, thereby generalizing the earlier Euclidean result to the Riemannian manifold setting. In this paper, we show that on Riemannian manifolds, the convergence rate of $f(x(t))$ to its optimal value can also be accelerated to an arbitrary convergence rate $\mathcal{O}(1/t^p)$, by considering a family of timedependent Bregman Lagrangian and Hamiltonian systems on Riemannian manifolds. This generalizes the results of Wibisono, Wilson, and Jordan to Riemannian manifolds and also provides a variational framework for accelerated optimization on Riemannian manifolds. In particular, we will establish results for objective functions on Riemannian manifolds that are geodesically convex, weakly quasiconvex, and strongly convex. An approach based on the timeinvariance property of the family of Bregman Lagrangians and Hamiltonians was used to construct very efficient optimization algorithms by Duruisseaux, Schmitt, and Leok, and we establish a similar timeinvariance property in the Riemannian setting. This lays the foundation for constructing similarly efficient optimization algorithms on Riemannian manifolds, once the Riemannian analogues of timeadaptive Hamiltonian variational integrators have been developed. The experience with the numerical discretization of variational accelerated optimization flows on vector spaces suggests that the combination of timeadaptivity and symplecticity is important for the efficient, robust, and stable discretization of these variational flows describing accelerated optimization. One expects that a geometric numerical integrator that is timeadaptive, symplectic, and Riemannian manifold preserving will yield a class of similarly promising optimization algorithms on manifolds.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220602T07:00:00Z
DOI: 10.1137/21M1395648
Issue No: Vol. 4, No. 2 (2022)

 Speedy Categorical Distributional Reinforcement Learning and Complexity
Analysis
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Markus Böck, Clemens Heitzinger
Pages: 675  693
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 675693, June 2022.
In distributional reinforcement learning, the entire distribution of the return instead of just the expected return is modeled. The approach with categorical distributions as the approximation method is wellknown in Qlearning, and convergence results have been established in the tabular case. In this work, speedy Qlearning is extended to categorical distributions, a finitetime analysis is performed, and probably approximately correct bounds in terms of the Cramér distance are established. It is shown that also in the distributional case the new update rule yields faster policy evaluation in comparison to the standard Qlearning one and that the sample complexity is essentially the same as the one of the valuebased algorithmic counterpart. Without the need for more stateactionreward samples, one gains significantly more information about the return with categorical distributions. Even though the results do not easily extend to the case of policy control, a slight modification to the update rule yields promising numerical results.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220606T07:00:00Z
DOI: 10.1137/20M1364436
Issue No: Vol. 4, No. 2 (2022)

 Nonlinear Weighted Directed Acyclic Graph and A Priori Estimates for
Neural Networks
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Yuqing Li, Tao Luo, Chao Ma
Pages: 694  720
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 694720, June 2022.
In an attempt to better understand structural benefits and generalization power of deep neural networks, we first present a novel graph theoretical formulation of neural network models, including fully connected, residual network (ResNet) and densely connected networks (DenseNet). Second, we extend the error analysis of the population risk for a twolayer network [W. E., C. Ma, and L. Wu, Commun. Math. Sci., 17 (2019), pp. 14071425] and ResNet [W. E., C. Ma, and Q. Wang, Commun. Math. Sci., 18 (2020), pp. 17551774] to DenseNet, and show further that for neural networks satisfying certain mild conditions, similar estimates can be obtained. These estimates are a priori in nature since they depend solely on the information prior to the training process, in particular, the bounds for the estimation errors do not suffer from the curse of dimensionality.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220613T07:00:00Z
DOI: 10.1137/21M140955X
Issue No: Vol. 4, No. 2 (2022)

 Intrinsic Dimension Adaptive Partitioning for Kernel Methods

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Thomas Hamm, Ingo Steinwart
Pages: 721  749
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 721749, June 2022.
We prove minimax optimal learning rates for kernel ridge regression, respectively, support vector machines, based on a data dependent partition of the input space, where the dependence of the dimension of the input space is replaced by the fractal dimension of the support of the data generating distribution. We further show that these optimal rates can be achieved by a training validation procedure without any prior knowledge on this intrinsic dimension of the data. Finally, we conduct extensive experiments which demonstrate that our considered learning methods are actually able to generalize from a dataset that is nontrivially embedded in a much higher dimensional space just as well as from the original dataset.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220623T07:00:00Z
DOI: 10.1137/21M1435690
Issue No: Vol. 4, No. 2 (2022)

 Two Steps at a TimeTaking GAN Training in Stride with Tseng's Method

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Axel Böhm, Michael Sedlmayer, Ernö Robert Csetnek, Radu Ioan Boţ
Pages: 750  771
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 750771, June 2022.
Motivated by the training of generative adversarial networks (GANs), we study methods for solving minimax problems with additional nonsmooth regularizers. We do so by employing monotone operator theory, in particular the forwardbackwardforward method, which avoids the known issue of limit cycling by correcting each update by a second gradient evaluation and does so requiring fewer projection steps compared to the extragradient method in the presence of constraints. Furthermore, we propose a seemingly new scheme which recycles old gradients to mitigate the additional computational cost. In doing so we rediscover a known method, related to optimistic gradient descent ascent. For both schemes we prove novel convergence rates for convexconcave minimax problems via a unifying approach. The derived error bounds are in terms of the gap function for the ergodic iterates. For the deterministic and the stochastic problem we show a convergence rate of $\mathcal{O}({1}/{k})$ and $\mathcal{O}({1}/{\sqrt{k}})$, respectively. We complement our theoretical results with empirical improvements in the training of Wasserstein GANs on the CIFAR10 dataset.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220623T07:00:00Z
DOI: 10.1137/21M1420939
Issue No: Vol. 4, No. 2 (2022)

 Sequential Construction and Dimension Reduction of Gaussian Processes
Under Inequality Constraints
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: François Bachoc, Andrés F. LópezLopera, Olivier Roustant
Pages: 772  800
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 772800, June 2022.
Accounting for inequality constraints, such as boundedness, monotonicity, or convexity, is challenging when modeling costlytoevaluate black box functions. In this regard, finitedimensional Gaussian process (GP) regression models bring a valuable solution, as they guarantee that the inequality constraints are satisfied everywhere. Nevertheless, these models are currently restricted to small dimensional situations (up to dimension 5). Addressing this issue, we introduce the MaxMod algorithm that sequentially inserts onedimensional knots or adds active variables, thereby performing at the same time dimension reduction and efficient knot allocation. We prove the convergence of this algorithm. In intermediary steps of the proof, we propose the notion of multiaffine extension and study its properties. We also prove the convergence of finitedimensional GPs, when the knots are not dense in the input space, extending the recent literature. With simulated and real data, we demonstrate that the MaxMod algorithm remains efficient in higher dimension (at least in dimension 20), and needs fewer knots than other constrained GP models from the state of the art, to reach a given approximation error.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220623T07:00:00Z
DOI: 10.1137/21M1407513
Issue No: Vol. 4, No. 2 (2022)

 Autodifferentiable Ensemble Kalman Filters

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Yuming Chen, Daniel SanzAlonso, Rebecca Willett
Pages: 801  833
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 801833, June 2022.
Data assimilation is concerned with sequentially estimating a temporally evolving state. This task, which arises in a wide range of scientific and engineering applications, is particularly challenging when the state is highdimensional and the statespace dynamics are unknown. This paper introduces a machine learning framework for learning dynamical systems in data assimilation. Our autodifferentiable ensemble Kalman filters (ADEnKFs) blend ensemble Kalman filters for state recovery with machine learning tools for learning the dynamics. In doing so, ADEnKFs leverage the ability of ensemble Kalman filters to scale to highdimensional states and the power of automatic differentiation to train highdimensional surrogate models for the dynamics. Numerical results using the Lorenz96 model show that ADEnKFs outperform existing methods that use expectationmaximization or particle filters to merge data assimilation and machine learning. In addition, ADEnKFs are easy to implement and require minimal tuning.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220623T07:00:00Z
DOI: 10.1137/21M1434477
Issue No: Vol. 4, No. 2 (2022)

 FeelGood Thompson Sampling for Contextual Bandits and Reinforcement
Learning
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Tong Zhang
Pages: 834  857
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 834857, June 2022.
Thompson sampling has been widely used for contextual bandit problems due to the flexibility of its modeling power. However, a general theory for this class of methods in the frequentist setting is still lacking. In this paper, we present a theoretical analysis of Thompson sampling, with a focus on frequentist regret bounds. In this setting, we show that the standard Thompson sampling is not aggressive enough in exploring new actions, leading to suboptimality in some pessimistic situations. A simple modification called FeelGood Thompson sampling, which favors high reward models more aggressively than the standard Thompson sampling, is proposed to remedy this problem. We show that the theoretical framework can be used to derive Bayesian regret bounds for standard Thompson sampling and frequentist regret bounds for FeelGood Thompson sampling. It is shown that in both cases, we can reduce the bandit regret problem to online least squares regression estimation. For the frequentist analysis, the online least squares regression bound can be directly obtained using online aggregation techniques which have been well studied. The resulting bandit regret bound matches the minimax lower bound in the finite action case. Moreover, the analysis can be generalized to handle a class of linearly embeddable contextual bandit problems (which generalizes the popular linear contextual bandit model). The obtained result again matches the minimax lower bound. Finally we illustrate that the analysis can be extended to handle some MDP problems.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220623T07:00:00Z
DOI: 10.1137/21M140924X
Issue No: Vol. 4, No. 2 (2022)

 Persistent Laplacians: Properties, Algorithms and Implications

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Facundo Mémoli, Zhengchao Wan, Yusu Wang
Pages: 858  884
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 858884, June 2022.
We present a thorough study of the theoretical properties and devise efficient algorithms for the persistent Laplacian, an extension of the standard combinatorial Laplacian to the setting of pairs (or, in more generality, sequences) of simplicial complexes $K \hookrightarrow L$, which was recently introduced by Wang, Nguyen, and Wei. In particular, in analogy with the nonpersistent case, we first prove that the nullity of the $q$th persistent Laplacian $\Delta_q^{K,L}$ equals the $q$th persistent Betti number of the inclusion $(K \hookrightarrow L)$. We then present an initial algorithm for finding a matrix representation of $\Delta_q^{K,L}$, which itself helps interpret the persistent Laplacian. We exhibit a novel relationship between the persistent Laplacian and the notion of Schur complement of a matrix which has several important implications. In the graph case, it both uncovers a link with the notion of effective resistance and leads to a persistent version of the Cheeger inequality. This relationship also yields an additional, very simple algorithm for finding (a matrix representation of) the $q$th persistent Laplacian which in turn leads to a novel and fundamentally different algorithm for computing the $q$th persistent Betti number for a pair $K\hookrightarrow L$ which can be significantly more efficient than standard algorithms. Finally, we study persistent Laplacians for simplicial filtrations and establish novel functoriality properties and stability results for their eigenvalues. Our work brings methods from spectral graph theory, circuit theory, and persistent homology together with a topological view of the combinatorial Laplacian on simplicial complexes.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220624T07:00:00Z
DOI: 10.1137/21M1435471
Issue No: Vol. 4, No. 2 (2022)

 Overparameterization and Generalization Error: Weighted Trigonometric
Interpolation
Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Yuege Xie, HungHsu Chou, Holger Rauhut, Rachel Ward
Pages: 885  908
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 885908, June 2022.
Motivated by surprisingly good generalization properties of learned deep neural networks in overparameterized scenarios and by the related double descent phenomenon, this paper analyzes the relation between smoothness and low generalization error in an overparameterized linear learning problem. We study a random Fourier series model, where the task is to estimate the unknown Fourier coefficients from equidistant samples. We derive exact expressions for the generalization error of both plain and weighted least squares estimators. We show precisely how a bias toward smooth interpolants, in the form of weighted trigonometric interpolation, can lead to smaller generalization error in the overparameterized regime compared to the underparameterized regime. This provides insight into the power of overparameterization, which is common in modern machine learning.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220627T07:00:00Z
DOI: 10.1137/21M1390955
Issue No: Vol. 4, No. 2 (2022)

 GNMR: A Provable OneLine Algorithm for Low Rank Matrix Recovery

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Pini Zilber, Boaz Nadler
Pages: 909  934
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 909934, June 2022.
Low rank matrix recovery problems appear in a broad range of applications. In this work we present GNMRan extremely simple iterative algorithm for low rank matrix recovery, based on a GaussNewton linearization. On the theoretical front, we derive recovery guarantees for GNMR in both matrix sensing and matrix completion settings. Some of these results improve upon the best currently known for other methods. A key property of GNMR is that it implicitly keeps the factor matrices approximately balanced throughout its iterations. On the empirical front, we show that for matrix completion with uniform sampling, GNMR performs better than several popular methods, especially when given very few observations close to the information limit.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220628T07:00:00Z
DOI: 10.1137/21M1433812
Issue No: Vol. 4, No. 2 (2022)

 Benefit of Interpolation in Nearest Neighbor Algorithms

Free preprint version: Loading...Rate this result: What is this?Please help us test our new preprint finding feature by giving the preprint link a rating.
A 5 star rating indicates the linked preprint has the exact same content as the published article.
Authors: Yue Xing, Qifan Song, Guang Cheng
Pages: 935  956
Abstract: SIAM Journal on Mathematics of Data Science, Volume 4, Issue 2, Page 935956, June 2022.
In some studies (e.g., [C. Zhang et al. in Proceedings of the 5th International Conference on Learning Representations, OpenReview.net, 2017]) of deep learning, it is observed that overparametrized deep neural networks achieve a small testing error even when the training error is almost zero. Despite numerous works toward understanding this socalled doubledescent phenomenon (e.g., [M. Belkin et al., Proc. Natl. Acad. Sci. USA, 116 (2019), pp. 1584915854; M. Belkin, D. Hsu, and J. Xu, SIAM J. Math. Data Sci., 2 (2020), pp. 11671180]), in this paper, we turn to another way to enforce zero training error (without overparametrization) through a data interpolation mechanism. Specifically, we consider a class of interpolated weighting schemes in the nearest neighbors (NN) algorithms. By carefully characterizing the multiplicative constant in the statistical risk, we reveal a Ushaped performance curve for the level of data interpolation in both classification and regression setups. This sharpens the existing result [M. Belkin, A. Rakhlin, and A. B. Tsybakov, in Proceedings of Machine Learning Research 89, PMLR, 2019, pp. 16111619] that zero training error does not necessarily jeopardize predictive performances and claims a counterintuitive result that a mild degree of data interpolation actually strictly improves the prediction performance and statistical stability over those of the (uninterpolated) $k$NN algorithm. In the end, the universality of our results, such as change of distance measure and corrupted testing data, will also be discussed.
Citation: SIAM Journal on Mathematics of Data Science
PubDate: 20220630T07:00:00Z
DOI: 10.1137/21M1437457
Issue No: Vol. 4, No. 2 (2022)
