Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In this paper, we propose a generalized expectation model selection (GEMS) algorithm for latent variable selection in multidimensional item response theory models which are commonly used for identifying the relationships between the latent traits and test items. Under some mild assumptions, we prove the numerical convergence of GEMS for model selection by minimizing the generalized information criteria of observed data in the presence of missing data. For latent variable selection in the multidimensional two-parameter logistic (M2PL) models, we present an efficient implementation of GEMS to minimize the Bayesian information criterion. To ensure parameter identifiability, the variances of all latent traits are assumed to be unity and each latent trait is required to have an item exclusively associated with it. The convergence of GEMS for the M2PL models is verified. Simulation studies show that GEMS is computationally more efficient than the expectation model selection (EMS) algorithm and the expectation maximization based \(L_{1}\) -penalized method (EML1), and it yields better correct rate of latent variable selection and mean squared error of parameter estimates than the EMS and EML1. The GEMS algorithm is illustrated by analyzing a real dataset related to the Eysenck Personality Questionnaire. PubDate: 2023-11-25
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Hamiltonian Monte Carlo (HMC) algorithms, which combine numerical approximation of Hamiltonian dynamics on finite intervals with stochastic refreshment and Metropolis correction, are popular sampling schemes, but it is known that they may suffer from slow convergence in the continuous time limit. A recent paper of Bou-Rabee and Sanz-Serna (Ann Appl Prob, 27:2159-2194, 2017) demonstrated that this issue can be addressed by simply randomizing the duration parameter of the Hamiltonian paths. In this article, we use the same idea to enhance the sampling efficiency of a constrained version of HMC, with potential benefits in a variety of application settings. We demonstrate both the conservation of the stationary distribution and the ergodicity of the method. We also compare the performance of various schemes in numerical studies of model problems, including an application to high-dimensional covariance estimation. PubDate: 2023-11-24
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper introduces the randomized self-updating process (rSUP) algorithm for clustering large-scale data. rSUP is an extension of the self-updating process (SUP) algorithm, which has shown effectiveness in clustering data with characteristics such as noise, varying cluster shapes and sizes, and numerous clusters. However, SUP’s reliance on pairwise dissimilarities between data points makes it computationally inefficient for large-scale data. To address this challenge, rSUP performs location updates within randomly generated data subsets at each iteration. The Law of Large Numbers guarantees that the clustering results of rSUP converge to those of the original SUP as the partition size grows. This paper demonstrates the effectiveness and computational efficiency of rSUP in large-scale data clustering through simulations and real datasets. PubDate: 2023-11-24
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Multivariate panel data of mixed type are routinely collected in many different areas of application, often jointly with additional covariates which complicate the statistical analysis. Moreover, it is often of interest to identify unknown groups of subjects in a study population using such data structure, i.e., to perform clustering. In the Bayesian framework, we propose a finite mixture of multivariate generalised linear mixed effects regression models to cluster numeric, binary, ordinal and categorical panel outcomes jointly. The specification of suitable priors on the model parameters allows for convenient posterior inference based on Markov chain Monte Carlo (MCMC) sampling with data augmentation. This approach allows to classify subjects in the data and new subjects as well as to characterise the cluster-specific models. Model estimation and selection of the number of data clusters are simultaneously performed when approximating the posterior for a single model using MCMC sampling without resorting to multiple model estimations. The performance of the proposed methodology is evaluated in a simulation study. Its application is illustrated on two data sets, one from a longitudinal patient study to infer prognosis groups, and a second one from the Czech part of the EU-SILC survey where households are annually interviewed to obtain insights into changes in their financial capability. PubDate: 2023-11-22
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Multiple systems estimation using a Poisson loglinear model is a standard approach to quantifying hidden populations where data sources are based on lists of known cases. Information criteria are often used for selecting between the large number of possible models. Confidence intervals are often reported conditional on the model selected, providing an over-optimistic impression of estimation accuracy. A bootstrap approach is a natural way to account for the model selection. However, because the model selection step has to be carried out for every bootstrap replication, there may be a high or even prohibitive computational burden. We explore the merit of modifying the model selection procedure in the bootstrap to look only among a subset of models, chosen on the basis of their information criterion score on the original data. This provides large computational gains with little apparent effect on inference. We also incorporate rigorous and economical ways of approaching issues of the existence of estimators when applying the method to sparse data tables. PubDate: 2023-11-21
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Stochastic variational inference algorithms are derived for fitting various heteroskedastic time series models. We examine Gaussian, t, and skewed t response GARCH models and fit these using Gaussian variational approximating densities. We implement efficient stochastic gradient ascent procedures based on the use of control variates or the reparameterization trick and demonstrate that the proposed implementations provide a fast and accurate alternative to Markov chain Monte Carlo sampling. Additionally, we present sequential updating versions of our variational algorithms, which are suitable for efficient portfolio construction and dynamic asset allocation. PubDate: 2023-11-21
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We are interested in renewable estimations and algorithms for nonparametric models with streaming data. In our method, the nonparametric function of interest is expressed through a functional depending on a weight function and a conditional distribution function (CDF). The CDF is estimated by renewable kernel estimations together with function interpolations, based on which we propose the method of renewable weighted composite quantile regression (WCQR). Then, by fully utilizing the model structure, we obtain new selectors for the weight function, such that the WCQR can achieve asymptotic unbiasness when estimating specific functions in the model. We also propose practical bandwidth selectors for streaming data and find the optimal weight function by minimizing the asymptotic variance. The asymptotical results show that our estimator is almost equivalent to the oracle estimator obtained from the entire data together. Besides, our method also enjoys adaptiveness to error distributions, robustness to outliers, and efficiency in both estimation and computation. Simulation studies and real data analyses further confirm our theoretical findings. PubDate: 2023-11-17
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract A model-based biclustering method for multivariate discrete longitudinal data is proposed. We consider a finite mixture of generalized linear models to cluster units and, within each mixture component, we adopt a flexible and parsimonious parameterization of the component-specific canonical parameter to define subsets of variables (segments) sharing common dynamics over time. We develop an Expectation-Maximization-type algorithm for maximum likelihood estimation of model parameters. The performance of the proposed model is evaluated on a large scale simulation study, where we consider different choices for the sample the size, the number of measurement occasions, the number of components and segments. The proposal is applied to Italian crime data (font ISTAT) with the aim to detect areas sharing common longitudinal trajectories for specific subsets of crime types. The identification of such biclusters may potentially be helpful for policymakers to make decisions on safety. PubDate: 2023-11-17
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper addresses the problem of offline evaluation in tabular reinforcement learning (RL). We propose a novel method that leverages synthetic trajectories constructed from the available data using a “sampling with replacement” basis, combining the advantages of model-based and Monte Carlo policy evaluation. The method is accompanied by theoretically derived finite sample upper error bounds, offering performance guarantees and allowing for a trade-off between statistical efficiency and computational cost. The results from computational experiments demonstrate that our method consistently achieves lower upper error bounds and relative mean square errors compared to Importance Sampling, Doubly Robust methods, and other existing approaches. Furthermore, this method achieves these superior results in significantly shorter running times compared to traditional model-based approaches. These findings highlight the effectiveness and efficiency of this synthetic trajectory method for accurate offline policy evaluation in RL. PubDate: 2023-11-17
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract It is common to observe significant heterogeneity in clustered data across scientific fields. Cluster-wise conditional distributions are widely used to explore variations and relationships within and among clusters. This paper aims to capture such heterogeneity by employing cluster-wise finite mixture models. To address the heterogeneity among clusters, we introduce latent group structure and incorporate heterogeneous mixing proportions across different groups, accommodating the diverse characteristics observed in the data. The specific number of groups and their membership are unknown. To identify the latent group structure, we employ concave penalty functions to the pairwise differences of the preliminary consistent estimators for the mixing proportions. This approach enables the automatic division of clusters into finite subgroups. Theoretical results demonstrate that as the number of clusters and cluster sizes tend to infinity, the true latent group structure can be recovered with probability close to one, and the post-classification estimators exhibit oracle efficiency. We support our proposed approach’s performance and applicability through extensive simulations and analysis of basic consumption expenditure among urban households in China. PubDate: 2023-11-15
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In this paper, we study doubly robust estimation and robust empirical likelihood of regression parameter for generalized linear models with missing responses. A doubly robust estimating equation is proposed to estimate the regression parameter, and the resulting estimator has consistency and asymptotic normality, regardless of whether the assumed model contains the true model. A robust empirical log-likelihood ratio statistic for the regression parameter is constructed, showing that the statistic weakly converges to the standard \(\chi ^2\) distribution. The result can be directly used to construct the confidence region of the regression parameter. A method for selecting the tuning parameters of \(\psi \) -function is also given. Simulation studies show the robustness of the estimator of the regression parameter and evaluate the performance of the robust empirical likelihood method. A real data example shows that the proposed method is feasible. PubDate: 2023-11-14
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Supersaturated designs are used in science and engineering to efficiently explore a large number of factors with a limited number of runs. It is not uncommon in engineering to consider a few, if not all, factors at more than two levels. Multi- and mixed-level supersaturated designs may, therefore, be handy. While the two-level supersaturated designs are widely studied, the literature on multi- and mixed-level designs is still scarce. A recent paper establishes that the group LASSO should be preferred as an analysis method because it can retain the natural group structure of multi- and mixed-level designs. A few optimality criteria for such designs also exist in the literature. These criteria typically aim to find designs that maximize average pairwise orthogonality. However, the literature lacks guidance on the better or ‘right’ optimality criteria from a screening perspective. In addition, the existing optimal designs are often balanced and are rarely available. We propose two new optimality criteria based on the large-sample properties of group LASSO. Our criteria fill the gap in the literature by providing design selection criteria that are directly related to the preferred analysis method. We then construct Pareto-efficient designs on the two new criteria and demonstrate that (a) our optimality criteria can be used to order existing optimal designs on their screening performance, (b) the Pareto-efficient designs are often better than or as good as the existing optimal designs, and (c) the Pareto-efficient designs can be constructed using a coordinate exchange algorithm and are, therefore, available for any choice of the number of runs, factors, and levels. A repository of three- and four-level designs with the number of runs between 8 and 16 is also provided. PubDate: 2023-11-11
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Analogously to the well-known Langevin Monte Carlo method, in this article we provide a method to sample from a target distribution \(\varvec{\pi }\) by simulating a solution of a stochastic differential equation. Hereby, the stochastic differential equation is driven by a general Lévy process which—unlike the case of Langevin Monte Carlo—allows for non-smooth targets. Our method will be fully explored in the particular setting of target distributions supported on the half-line \((0,\infty )\) and a compound Poisson driving noise. Several illustrative examples conclude the article. PubDate: 2023-11-10
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We introduce a data-adaptive nonparametric dimension reduction tool to obtain a low-dimensional approximation of functional data contaminated by erratic measurement errors following symmetric or asymmetric distributions. We propose to apply robust submatrix completion techniques to matrices consisting of coefficients of basis functions calculated by projecting the observed trajectories onto a given orthogonal basis set. In this process, we use a composite asymmetric Huber loss function to accommodate domain-specific erratic behaviors in a data-adaptive manner. We further incorporate the \(L_1\) penalty to regularize the smoothness of latent factor curves. The proposed method can also be applied to partially observed functional data, where each trajectory contains individual-specific missing segments. Moreover, since our method does not require estimating the covariance operator, the extension to any dimensional functional data observed over a continuum is straightforward. We demonstrate the empirical performance in estimating lower-dimensional space and reconstruction of trajectories of the proposed method through simulation studies. We then apply the proposed method to two real datasets, one-dimensional Advanced Metering Infrastructure (AMI) data in South Korea and two-dimensional max precipitation spatial data collected in North America and South America. PubDate: 2023-11-08
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for network models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures. PubDate: 2023-11-08
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This paper adopts a tool from computational topology, the Euler characteristic curve (ECC) of a sample, to perform one- and two-sample goodness of fit tests. We call our procedure TopoTests. The presented tests work for samples of arbitrary dimension, having comparable power to the state-of-the-art tests in the one-dimensional case. It is demonstrated that the type I error of TopoTests can be controlled and their type II error vanishes exponentially with increasing sample size. Extensive numerical simulations of TopoTests are conducted to demonstrate their power for samples of various sizes. PubDate: 2023-11-08
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The paper tackles the problem of clustering multiple networks, directed or not, that do not share the same set of vertices, into groups of networks with similar topology. A statistical model-based approach based on a finite mixture of stochastic block models is proposed. A clustering is obtained by maximizing the integrated classification likelihood criterion. This is done by a hierarchical agglomerative algorithm, that starts from singleton clusters and successively merges clusters of networks. As such, a sequence of nested clusterings is computed that can be represented by a dendrogram providing valuable insights on the collection of networks. Using a Bayesian framework, model selection is performed in an automated way since the algorithm stops when the best number of clusters is attained. The algorithm is computationally efficient, when carefully implemented. The aggregation of clusters requires a means to overcome the label-switching problem of the stochastic block model and to match the block labels of the networks. To address this problem, a new tool is proposed based on a comparison of the graphons of the associated stochastic block models. The clustering approach is assessed on synthetic data. An application to a set of ecological networks illustrates the interpretability of the obtained results. PubDate: 2023-11-07
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Consider the \(\ell _{\alpha }\) regularized linear regression, also termed Bridge regression. For \(\alpha \in (0,1)\) , Bridge regression enjoys several statistical properties of interest such as sparsity and near-unbiasedness of the estimates (Fan and Li in J Am Stat Assoc 96(456): 1348–1360, 2001). However, the main difficulty lies in the non-convex nature of the penalty for these values of \(\alpha \) , which makes an optimization procedure challenging and usually it is only possible to find a local optimum. To address this issue, Polson et al. (J R Stat Soc B 76(4):713–733, 2013) took a sampling based fully Bayesian approach to this problem, using the correspondence between the Bridge penalty and a power exponential prior on the regression coefficients. However, their sampling procedure relies on Markov chain Monte Carlo (MCMC) techniques, which are inherently sequential and not scalable to large problem dimensions. Cross validation approaches are similarly computation-intensive. To this end, our contribution is a novel non-iterative method to fit a Bridge regression model. The main contribution lies in an explicit formula for Stein’s unbiased risk estimate for the out of sample prediction risk of Bridge regression, which can then be optimized to select the desired tuning parameters, allowing us to completely bypass MCMC as well as computation-intensive cross validation approaches. Our procedure yields results in a fraction of computational times compared to iterative schemes, without any appreciable loss in statistical performance. An R implementation is publicly available online at: https://github.com/loriaJ/Sure-tuned_BridgeRegression. PubDate: 2023-11-07
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Generalised hyperbolic (GH) processes are a class of stochastic processes that are used to model the dynamics of a wide range of complex systems that exhibit heavy-tailed behavior, including systems in finance, economics, biology, and physics. In this paper, we present novel simulation methods based on subordination with a generalised inverse Gaussian (GIG) process and using a generalised shot-noise representation that involves random thinning of infinite series of decreasing jump sizes. Compared with our previous work on GIG processes, we provide tighter bounds for the construction of rejection sampling ratios, leading to improved acceptance probabilities in simulation. Furthermore, we derive methods for the adaptive determination of the number of points required in the associated random series using concentration inequalities. Residual small jumps are then approximated using an appropriately scaled Brownian motion term with drift. Finally the rejection sampling steps are made significantly more computationally efficient through the use of squeezing functions based on lower and upper bounds on the Lévy density. Experimental results are presented illustrating the strong performance under various parameter settings and comparing the marginal distribution of the GH paths with exact simulations of GH random variates. The new simulation methodology is made available to researchers through the publication of a Python code repository. PubDate: 2023-11-07
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Various privacy-preserving frameworks that respect the individual’s privacy in the analysis of data have been developed in recent years. However, available model classes such as simple statistics or generalized linear models lack the flexibility required for a good approximation of the underlying data-generating process in practice. In this paper, we propose an algorithm for a distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as a distributed fitting of base learners using the \(L_2\) -loss. In order to account for the heterogeneity of different data location sites, we propose a distributed version of a row-wise tensor product that allows the computation of site-specific (smooth) effects. Our adaption of CWB preserves all the important properties of the original algorithm, such as an unbiased feature selection and the feasibility to fit models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled data. Next to a derivation of the equivalence of both algorithms, we also showcase the efficacy of our algorithm on a distributed heart disease data set and compare it with state-of-the-art methods. PubDate: 2023-11-07