|
|
- A new maximum mean discrepancy based two-sample test for equal
distributions in separable metric spaces-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper presents a novel two-sample test for equal distributions in separable metric spaces, utilizing the maximum mean discrepancy (MMD). The test statistic is derived from the decomposition of the total variation of data in the reproducing kernel Hilbert space, and can be regarded as a V-statistic-based estimator of the squared MMD. The paper establishes the asymptotic null and alternative distributions of the test statistic. To approximate the null distribution accurately, a three-cumulant matched chi-squared approximation method is employed. The parameters for this approximation are consistently estimated from the data. Additionally, the paper introduces a new data-adaptive method based on the median absolute deviation to select the kernel width of the Gaussian kernel, and a new permutation test combining two different Gaussian kernel width selection methods, which improve the adaptability of the test to different data sets. Fast implementation of the test using matrix calculation is discussed. Extensive simulation studies and three real data examples are presented to demonstrate the good performance of the proposed test. PubDate: 2024-08-25
- Wasserstein principal component analysis for circular measures
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We consider the 2-Wasserstein space of probability measures supported on the unit-circle, and propose a framework for Principal Component Analysis (PCA) for data living in such a space. We build on a detailed investigation of the optimal transportation problem for measures on the unit-circle which might be of independent interest. In particular, building on previously obtained results, we derive an expression for optimal transport maps in (almost) closed form and propose an alternative definition of the tangent space at an absolutely continuous probability measure, together with fundamental characterizations of the associated exponential and logarithmic maps. PCA is performed by mapping data on the tangent space at the Wasserstein barycentre, which we approximate via an iterative scheme, and for which we establish a sufficient a posteriori condition to assess its convergence. Our methodology is illustrated on several simulated scenarios and a real data analysis of measurements of optical nerve thickness. PubDate: 2024-08-24
- Individualized causal mediation analysis with continuous treatment using
conditional generative adversarial networks-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Traditional methods used in causal mediation analysis with continuous treatment often focus on estimating average causal effects, limiting their applicability in precision medicine. Machine learning techniques have emerged as a powerful approach for precisely estimating individualized causal effects. This paper proposes a novel method called CGAN-ICMA-CT that leverages Conditional Generative Adversarial Networks (CGANs) to infer individualized causal effects with continuous treatment. We thoroughly investigate the convergence properties of CGAN-ICMA-CT and show that the estimated distribution of our inferential conditional generator converges to the true conditional distribution under mild conditions. We conduct numerical experiments to validate the effectiveness of CGAN-ICMA-CT and compare it with four commonly used methods: linear regression, support vector machine regression, decision tree, and random forest regression. The results demonstrate that CGAN-ICMA-CT outperforms these methods regarding accuracy and precision. Furthermore, we apply the CGAN-ICMA-CT model to the real-world Job Corps dataset, showcasing its practical utility. By utilizing CGAN-ICMA-CT, we estimate the individualized causal effects of the Job Corps program on the number of arrests, providing insights into both direct effects and effects mediated through intermediate variables. Our findings confirm the potential of CGAN-ICMA-CT in advancing individualized causal mediation analysis with continuous treatment in precision medicine settings. PubDate: 2024-08-23
- Correction to: R-VGAL: a sequential variational Bayes algorithm for
generalised linear mixed models-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2024-08-21
- Taming numerical imprecision by adapting the KL divergence to negative
probabilities-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The Kullback–Leibler (KL) divergence is frequently used in data science. For discrete distributions on large state spaces, approximations of probability vectors may result in a few small negative entries, rendering the KL divergence undefined. We address this problem by introducing a parameterized family of substitute divergence measures, the shifted KL (sKL) divergence measures. Our approach is generic and does not increase the computational overhead. We show that the sKL divergence shares important theoretical properties with the KL divergence and discuss how its shift parameters should be chosen. If Gaussian noise is added to a probability vector, we prove that the average sKL divergence converges to the KL divergence for small enough noise. We also show that our method solves the problem of negative entries in an application from computational oncology, the optimization of Mutual Hazard Networks for cancer progression using tensor-train approximations. PubDate: 2024-08-13
- A Bayesian approach to modeling finite element discretization error
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In this work, the uncertainty associated with the finite element discretization error is modeled following the Bayesian paradigm. First, a continuous formulation is derived, where a Gaussian process prior over the solution space is updated based on observations from a finite element discretization. To avoid the computation of intractable integrals, a second, finer, discretization is introduced that is assumed sufficiently dense to represent the true solution field. A prior distribution is assumed over the fine discretization, which is then updated based on observations from the coarse discretization. This yields a posterior distribution with a mean that serves as an estimate of the solution, and a covariance that models the uncertainty associated with this estimate. Two particular choices of prior are investigated: a prior defined implicitly by assigning a white noise distribution to the right-hand side term, and a prior whose covariance function is equal to the Green’s function of the partial differential equation. The former yields a posterior distribution with a mean close to the reference solution, but a covariance that contains little information regarding the finite element discretization error. The latter, on the other hand, yields posterior distribution with a mean equal to the coarse finite element solution, and a covariance with a close connection to the discretization error. For both choices of prior a contradiction arises, since the discretization error depends on the right-hand side term, but the posterior covariance does not. We demonstrate how, by rescaling the eigenvalues of the posterior covariance, this independence can be avoided. PubDate: 2024-08-09
- Roughness regularization for functional data analysis with free knots
spline estimation-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In the era of big data, an ever-growing volume of information is recorded, either continuously over time or sporadically, at distinct time intervals. Functional Data Analysis (FDA) stands at the cutting edge of this data revolution, offering a powerful framework for handling and extracting meaningful insights from such complex datasets. The currently proposed FDA methods can often encounter challenges, especially when dealing with curves of varying shapes. This can largely be attributed to the method’s strong dependence on data approximation as a key aspect of the analysis process. In this work, we propose a free knots spline estimation method for functional data with two penalty terms and demonstrate its performance by comparing the results of several clustering methods on simulated and real data. PubDate: 2024-08-08
- AR-ADASYN: angle radius-adaptive synthetic data generation approach for
imbalanced learning-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Imbalanced data often leads to biased models favoring the majority class, limiting the representation of the training data and complicating generalization to minority classes. Previous studies have addressed this problem by rebalancing the dataset using resampling methods, such as SMOTE. However, these methods often encounter challenges such as overfitting and information loss. To overcome these limitations, we propose a novel approach called Angle Radius-Adaptive Synthetic Sampling (AR-ADASYN). The proposed method aims to maintain the distribution of minority classes by taking into account both the distance and the angle between existing samples of the minority class. We conducted experiments on 14 datasets composed of real data to compare the classification performance of the proposed method with other oversampling techniques. The results revealed that the proposed method demonstrated superior classification performance compared to the other oversampling techniques. As a result, AR-ADASYN showed potential as a valuable tool for improving the robustness and generalization of machine learning models trained on imbalanced datasets. PubDate: 2024-08-08
- Learning variational autoencoders via MCMC speed measures
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Variational autoencoders (VAEs) are popular likelihood-based generative models which can be efficiently trained by maximising an evidence lower bound. There has been much progress in improving the expressiveness of the variational distribution to obtain tighter variational bounds and increased generative performance. Whilst previous work has leveraged Markov chain Monte Carlo methods for constructing variational densities, gradient-based methods for adapting the proposal distributions for deep latent variable models have received less attention. This work suggests an entropy-based adaptation for a short-run metropolis-adjusted Langevin or Hamiltonian Monte Carlo (HMC) chain while optimising a tighter variational bound to the log-evidence. Experiments show that this approach yields higher held-out log-likelihoods as well as improved generative metrics. Our implicit variational density can adapt to complicated posterior geometries of latent hierarchical representations arising in hierarchical VAEs. PubDate: 2024-08-06
- The COR criterion for optimal subset selection in distributed estimation
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The problem of selecting an optimal subset in distributed regression is a crucial issue, as each distributed data subset may contain redundant information, which can be attributed to various sources such as outliers, dispersion, inconsistent duplicates, too many independent variables, and excessive data points, among others. Efficient reduction and elimination of this redundancy can help alleviate inconsistency issues for statistical inference. Therefore, it is imperative to track redundancy while measuring and processing data. We develop a criterion for optimal subset selection that is related to Covariance matrices, Observation matrices, and Response vectors (COR). We also derive a novel distributed interval estimation for the proposed criterion and establish the existence of optimal subset length. Finally, numerical experiments are conducted to verify the experimental feasibility of the proposed criterion. PubDate: 2024-08-02
- High-dimensional missing data imputation via undirected graphical model
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Multiple imputation is a practical approach in analyzing incomplete data, with multiple imputation by chained equations (MICE) being popularly used. MICE specifies a conditional distribution for each variable to be imputed, but estimating it is inherently a high-dimensional problem for large-scale data. Existing approaches propose to utilize regularized regression models, such as lasso. However, the estimation of them occurs iteratively across all incomplete variables, leading to a considerable increase in computational burden, as demonstrated in our simulation study. To overcome this computational bottleneck, we propose a novel method that estimates the conditional independence structure among variables before the imputation procedure. We extract such information from an undirected graphical model, leveraging the graphical lasso method based on the inverse probability weighting estimator. Our simulation study verifies the proposed method is way faster against the existing methods, while still maintaining comparable imputation performance. PubDate: 2024-08-01
- Detection of spatiotemporal changepoints: a generalised additive model
approach-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The detection of changepoints in spatio-temporal datasets has been receiving increased focus in recent years and is utilised in a wide range of fields. With temporal data observed at different spatial locations, the current approach is typically to use univariate changepoint methods in a marginal sense with the detected changepoint being representative of a single location only. We present a spatio-temporal changepoint method that utilises a generalised additive model (GAM) dependent on the 2D spatial location and the observation time to account for the underlying spatio-temporal process. We use the full likelihood of the GAM in conjunction with the pruned linear exact time (PELT) changepoint search algorithm to detect multiple changepoints across spatial locations in a computationally efficient manner. When compared to a univariate marginal approach our method is shown to perform more efficiently in simulation studies at detecting true changepoints and demonstrates less evidence of overfitting. Furthermore, as the approach explicitly models spatio-temporal dependencies between spatial locations, any changepoints detected are common across the locations. We demonstrate an application of the method to an air quality dataset covering the COVID-19 lockdown in the United Kingdom. PubDate: 2024-08-01
- Distributed subsampling for multiplicative regression
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Multiplicative regression is a useful alternative tool in modeling positive response data. This paper proposes two distributed estimators for multiplicative error model on distributed system with non-randomly distributed massive data. We first present a Poisson subsampling procedure to obtain a subsampling estimator based on the least product relative error (LPRE) loss, which is effective on a distributed system. Theoretically, we justify the subsampling estimator by establishing its convergence rate, asymptotic normality and deriving the optimal subsampling probabilities in terms of the L-optimality criterion. Then, we provide a distributed LPRE estimator based on the Poisson subsampling (DLPRE-P), which is communication-efficient since it needs to transmit a very small subsample from local machines to the central site, which is empirically feasible, together with the gradient of the loss. Practically, due to the use of Newton–Raphson iteration, the Hessian matrix can be computed more robustly using the subsampled data than using one local dataset. We also show that the DLPRE-P estimator is statistically efficient as the global estimator, which is based on putting all the datasets together. Furthermore, we propose a distributed regularized LPRE estimator (DRLPRE-P) to consider the variable selection problem in high dimension. A distributed algorithm based on the alternating direction method of multipliers (ADMM) is developed for implementing the DRLPRE-P. The oracle property holds for DRLPRE-P. Finally, simulation experiments and two real-world data analyses are conducted to illustrate the performance of our methods. PubDate: 2024-08-01
- A Mallows-type model averaging estimator for ridge regression
with randomly right censored data-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Instead of picking up a single ridge parameter in ridge regression, this paper considers a frequentist model averaging approach to appropriately combine the set of ridge estimators with different ridge parameters, when the response is randomly right censored. Within this context, we propose a weighted least squares ridge estimation for unknown regression parameter. A new Mallows-type weight choice criterion is then developed to allocate model weights, where the unknown distribution function of the censoring random variable is replaced by the Kaplan–Meier estimator and the covariance matrix of random errors is substituted by its averaging estimator. Under some mild conditions, we show that when the fitting model is misspecified, the resulting model averaging estimator achieves optimality in terms of minimizing the loss function. Whereas, when the fitting model is correctly specified, the model averaging estimator of the regression parameter is root-n consistent. Additionally, for the weight vector which is obtained by minimizing the new criterion, we establish its rate of convergence to the infeasible optimal weight vector. Simulation results show that our method is better than some existing methods. A real dataset is analyzed for illustration as well. PubDate: 2024-07-29
- Byzantine-robust and efficient distributed sparsity learning: a surrogate
composite quantile regression approach-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Distributed statistical learning has gained significant traction recently, mainly due to the availability of unprecedentedly massive datasets. The objective of distributed statistical learning is to learn models by effectively utilizing data scattered across various machines. However, its performance can be impeded by three significant challenges: arbitrary noises, high dimensionality, and machine failures—the latter being specifically referred to as Byzantine failure. To address the first two challenges, we propose leveraging the potential of composite quantile regression in conjunction with the \(\ell _1\) penalty. However, this combination introduces a doubly nonsmooth objective function, posing new challenges. In such scenarios, most existing Byzantine-robust methods exhibit slow sublinear convergence rates and fail to achieve near-optimal statistical convergence rates. To fill this gap, we introduce a novel smoothing procedure that effectively handles the nonsmooth aspects. This innovation allows us to develop a Byzantine-robust sparsity learning algorithm that converges provably to the near-optimal convergence rate linearly. Moreover, we establish support recovery guarantees for our proposed methods. We substantiate the effectiveness of our approaches through comprehensive empirical analyses. PubDate: 2024-07-22
- ForLion: a new algorithm for D-optimal designs under general parametric
statistical models with mixed factors-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In this paper, we address the problem of designing an experimental plan with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for locally optimal approximate designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We present the relevant theoretical results for multinomial logit models (MLM) and generalized linear models (GLM), and demonstrate the superiority of our algorithm over state-of-the-art design algorithms using real-life experiments under MLM and GLM. Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs. PubDate: 2024-07-18
- Optimal designs for nonlinear mixed-effects models using competitive swarm
optimizer with mutated agents-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Nature-inspired meta-heuristic algorithms are increasingly used in many disciplines to tackle challenging optimization problems. Our focus is to apply a newly proposed nature-inspired meta-heuristics algorithm called CSO-MA to solve challenging design problems in biosciences and demonstrate its flexibility to find various types of optimal approximate or exact designs for nonlinear mixed models with one or several interacting factors and with or without random effects. We show that CSO-MA is efficient and can frequently outperform other algorithms either in terms of speed or accuracy. The algorithm, like other meta-heuristic algorithms, is free of technical assumptions and flexible in that it can incorporate cost structure or multiple user-specified constraints, such as, a fixed number of measurements per subject in a longitudinal study. When possible, we confirm some of the CSO-MA generated designs are optimal with theory by developing theory-based innovative plots. Our applications include searching optimal designs to estimate (i) parameters in mixed nonlinear models with correlated random effects, (ii) a function of parameters for a count model in a dose combination study, and (iii) parameters in a HIV dynamic model. In each case, we show the advantages of using a meta-heuristic approach to solve the optimization problem, and the added benefits of the generated designs. PubDate: 2024-07-17
- Sparse and geometry-aware generalisation of the mutual information for
joint discriminative clustering and feature selection-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple \(\ell _1\) penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses. PubDate: 2024-07-17
- A mixture of experts regression model for functional response with
functional covariates-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Due to the fast growth of data that are measured on a continuous scale, functional data analysis has undergone many developments in recent years. Regression models with a functional response involving functional covariates, also called “function-on-function”, are thus becoming very common. Studying this type of model in the presence of heterogeneous data can be particularly useful in various practical situations. We mainly develop in this work a function-on-function Mixture of Experts (FFMoE) regression model. Like most of the inference approach for models on functional data, we use basis expansion (B-splines) both for covariates and parameters. A regularized inference approach is also proposed, it accurately smoothes functional parameters in order to provide interpretable estimators. Numerical studies on simulated data illustrate the good performance of FFMoE as compared with competitors. Usefullness of the proposed model is illustrated on two data sets: the reference Canadian weather data set, in which the precipitations are modeled according to the temperature, and a Cycling data set, in which the developed power is explained by the speed, the cyclist heart rate and the slope of the road. PubDate: 2024-07-11
- Correction to: Explainable generalized additive neural networks with
independent neural network training-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2024-07-08
|