Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Hilbert-Schmidt Independence Criterion (HSIC) has recently been introduced to the field of single-index models to estimate the directions. Compared with other well-established methods, the HSIC based method requires relatively weak conditions. However, its performance has not yet been studied in the prevalent high-dimensional scenarios, where the number of covariates can be much larger than the sample size. In this article, based on HSIC, we propose to estimate the possibly sparse directions in the high-dimensional single-index models through a parameter reformulation. Our approach estimates the subspace of the direction directly and performs variable selection simultaneously. Due to the non-convexity of the objective function and the complexity of the constraints, a majorize-minimize algorithm together with the linearized alternating direction method of multipliers is developed to solve the optimization problem. Since it does not involve the inverse of the covariance matrix, the algorithm can naturally handle large p small n scenarios. Through extensive simulation studies and a real data analysis, we show that our proposal is efficient and effective in the high-dimensional settings. The \(\texttt {Matlab}\) codes for this method are available online. PubDate: 2024-02-27

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Multivariate Hawkes Processes (MHPs) are a class of point processes that can account for complex temporal dynamics among event sequences. In this work, we study the accuracy and computational efficiency of three classes of algorithms which, while widely used in the context of Bayesian inference, have rarely been applied in the context of MHPs: stochastic gradient expectation-maximization, stochastic gradient variational inference and stochastic gradient Langevin Monte Carlo. An important contribution of this paper is a novel approximation to the likelihood function that allows us to retain the computational advantages associated with conjugate settings while reducing approximation errors associated with the boundary effects. The comparisons are based on various simulated scenarios as well as an application to the study of risk dynamics in the Standard & Poor’s 500 intraday index prices among its 11 sectors. PubDate: 2024-02-27

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Phylogenetic trees are key data objects in biology, and the method of phylogenetic reconstruction has been highly developed. The space of phylogenetic trees is a nonpositively curved metric space. Recently, statistical methods to analyze samples of trees on this space are being developed utilizing this property. Meanwhile, in Euclidean space, the log-concave maximum likelihood method has emerged as a new nonparametric method for probability density estimation. In this paper, we derive a sufficient condition for the existence and uniqueness of the log-concave maximum likelihood estimator on tree space. We also propose an estimation algorithm for one and two dimensions. Since various factors affect the inferred trees, it is difficult to specify the distribution of a sample of trees. The class of log-concave densities is nonparametric, and yet the estimation can be conducted by the maximum likelihood method without selecting hyperparameters. We compare the estimation performance with a previously developed kernel density estimator numerically. In our examples where the true density is log-concave, we demonstrate that our estimator has a smaller integrated squared error when the sample size is large. We also conduct numerical experiments of clustering using the Expectation-Maximization algorithm and compare the results with k-means++ clustering using Fréchet mean. PubDate: 2024-02-23

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Bootstrap and Jackknife estimates, \(T_{n,B}^*\) and \(T_{n,J},\) respectively, of a population parameter \(\theta \) are both used in statistical computations; n is the sample size, B is the number of Bootstrap samples. For any \(n_0\) and \(B_0,\) Bootstrap samples do not add new information about \(\theta \) being observations from the original sample and when \(B_0<\infty ,\) \(T_{n_0,B_0}^*\) includes also resampling variability, an additional source of uncertainty not affecting \(T_{n_0, J}.\) These are neglected in theoretical papers with results for the utopian \(T_{n, \infty }^*, \) that do not hold for \(B<\infty .\) The consequence is that \(T^*_{n_0, B_0}\) is expected to have larger mean squared error (MSE) than \(T_{n_0,J},\) namely \(T_{n_0,B_0}^*\) is inadmissible. The amount of inadmissibility may be very large when populations’ parameters, e.g. the variance, are unbounded and/or with big data. A palliating remedy is increasing B, the larger the better, but the MSEs ordering remains unchanged for \(B<\infty .\) This is confirmed theoretically when \(\theta \) is the mean of a population, and is observed in the estimated total MSE for linear regression coefficients. In the latter, the chance the estimated total MSE with \(T_{n,B}^*\) improves that with \(T_{n,J}\) decreases to 0 as B increases. PubDate: 2024-02-22

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Most scientific publications follow the familiar recipe of (i) obtain data, (ii) fit a model, and (iii) comment on the scientific relevance of the effects of particular covariates in that model. This approach, however, ignores the fact that there may exist a multitude of similarly-accurate models in which the implied effects of individual covariates may be vastly different. This problem of finding an entire collection of plausible models has also received relatively little attention in the statistics community, with nearly all of the proposed methodologies being narrowly tailored to a particular model class and/or requiring an exhaustive search over all possible models, making them largely infeasible in the current big data era. This work develops the idea of forward stability and proposes a novel, computationally-efficient approach to finding collections of accurate models we refer to as model path selection (MPS). MPS builds up a plausible model collection via a forward selection approach and is entirely agnostic to the model class and loss function employed. The resulting model collection can be displayed in a simple and intuitive graphical fashion, easily allowing practitioners to visualize whether some covariates can be swapped for others with minimal loss. PubDate: 2024-02-20

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract In social sciences, studies are often based on questionnaires asking participants to express ordered responses several times over a study period. We present a model-based clustering algorithm for such longitudinal ordinal data. Assuming that an ordinal variable is the discretization of an underlying latent continuous variable, the model relies on a mixture of matrix-variate normal distributions, accounting simultaneously for within- and between-time dependence structures. The model is thus able to concurrently model the heterogeneity, the association among the responses and the temporal dependence structure. An EM algorithm is developed and presented for parameters estimation, and approaches to deal with some arising computational challenges are outlined. An evaluation of the model through synthetic data shows its estimation abilities and its advantages when compared to competitors. A real-world application concerning changes in eating behaviors during the Covid-19 pandemic period in France will be presented. PubDate: 2024-02-17

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Effective estimation of covariance matrices is crucial for statistical analyses and applications. In this paper, we focus on the robust estimation of covariance matrix for interval-valued data in low and moderately high dimensions. In the low-dimensional scenario, we extend the Minimum Covariance Determinant (MCD) estimator to interval-valued data. We derive an iterative algorithm for computing this estimator, demonstrate its convergence, and theoretically establish that it retains the high breakdown-point property of the MCD estimator. Further, we propose a projection-based estimator and a regularization-based estimator to extend the MCD estimator to moderately high-dimensional settings, respectively. We propose efficient iterative algorithms for solving these two estimators and demonstrate their convergence properties. We conduct extensive simulation studies and real data analysis to validate the finite sample properties of these proposed estimators. PubDate: 2024-02-17

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract To improve the estimation efficiency of high-dimensional regression problems, penalized regularization is routinely used. However, accurately estimating the model remains challenging, particularly in the presence of correlated effects, wherein irrelevant covariates exhibit strong correlation with relevant ones. This situation, referred to as correlated data, poses additional complexities for model estimation. In this paper, we propose the elastic-net multi-step screening procedure (EnMSP), an iterative algorithm designed to recover sparse linear models in the context of correlated data. EnMSP uses a small repeated penalty strategy to identify truly relevant covariates in a few iterations. Specifically, in each iteration, EnMSP enhances the adaptive lasso method by adding a weighted \(l_2\) penalty, which improves the selection of relevant covariates. The method is shown to select the true model and achieve the \(l_2\) -norm error bound under certain conditions. The effectiveness of EnMSP is demonstrated through numerical comparisons and applications in financial data. PubDate: 2024-02-16

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract In this article we consider Bayesian parameter inference for a type of partially observed stochastic Volterra equation (SVE). SVEs are found in many areas such as physics and mathematical finance. In the latter field they can be used to represent long memory in unobserved volatility processes. In many cases of practical interest, SVEs must be time-discretized and then parameter inference is based upon the posterior associated to this time-discretized process. Based upon recent studies on time-discretization of SVEs (e.g. Richard et al. in Stoch Proc Appl 141:109–138, 2021) we use Euler–Maruyama methods for the afore-mentioned discretization. We then show how multilevel Markov chain Monte Carlo (MCMC) methods (Jasra et al. in SIAM J Sci Comp 40:A887–A902, 2018) can be applied in this context. In the examples we study, we give a proof that shows that the cost to achieve a mean square error (MSE) of \(\mathcal {O}(\epsilon ^2)\) , \(\epsilon >0\) , is \(\mathcal {O}(\epsilon ^{-\tfrac{4}{2H+1}})\) , where H is the Hurst parameter. If one uses a single level MCMC method then the cost is \(\mathcal {O}(\epsilon ^{-\tfrac{2(2H+3)}{2H+1}})\) to achieve the same MSE. We illustrate these results in the context of state-space and stochastic volatility models, with the latter applied to real data. PubDate: 2024-02-14

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Massive survival data are increasingly common in many research fields, and subsampling is a practical strategy for analyzing such data. Although optimal subsampling strategies have been developed for Cox models, little has been done for semiparametric accelerated failure time (AFT) models due to the challenges posed by non-smooth estimating functions for the regression coefficients. We develop optimal subsampling algorithms for fitting semi-parametric AFT models using the least-squares approach. By efficiently estimating the slope matrix of the non-smooth estimating functions using a resampling approach, we construct optimal subsampling probabilities for the observations. For feasible point and interval estimation of the unknown coefficients, we propose a two-step method, drawing multiple subsamples in the second stage to correct for overestimation of the variance in higher censoring scenarios. We validate the performance of our estimators through a simulation study that compares single and multiple subsampling methods and apply the methods to analyze the survival time of lymphoma patients in the Surveillance, Epidemiology, and End Results program. PubDate: 2024-02-14

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract This article proposes omnibus portmanteau tests for contrasting adequacy of time series models. The test statistics are based on combining the autocorrelation function of the conditional residuals, the autocorrelation function of the conditional squared residuals, and the cross-correlation function between these residuals and their squares. The maximum likelihood estimator is used to derive the asymptotic distribution of the proposed test statistics under a general class of time series models, including ARMA, GARCH, and other nonlinear structures. An extensive Monte Carlo simulation study shows that the proposed tests successfully control the type I error probability and tend to have more power than other competitor tests in many scenarios. Two applications to a set of weekly stock returns for 92 companies from the S &P 500 demonstrate the practical use of the proposed tests. PubDate: 2024-02-12

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract The problem of best subset selection in linear regression is considered with the aim to find a fixed size subset of features that best fits the response. This is particularly challenging when the total available number of features is very large compared to the number of data samples. Existing optimal methods for solving this problem tend to be slow while fast methods tend to have low accuracy. Ideally, new methods perform best subset selection faster than existing optimal methods but with comparable accuracy, or, being more accurate than methods of comparable computational speed. Here, we propose a novel continuous optimization method that identifies a subset solution path, a small set of models of varying size, that consists of candidates for the single best subset of features, that is optimal in a specific sense in linear regression. Our method turns out to be fast, making the best subset selection possible when the number of features is well in excess of thousands. Because of the outstanding overall performance, framing the best subset selection challenge as a continuous optimization problem opens new research directions for feature extraction for a large variety of regression models. PubDate: 2024-02-12

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Quantile regression (QR) models offer an interesting alternative compared with ordinary regression models for the response mean. Besides allowing a more appropriate characterization of the response distribution, the former is less sensitive to outlying observations than the latter. Indeed, the QR models allow modeling other characteristics of the response distribution, such as the lower and/or upper tails. However, in the presence of outlying observations, the estimates can still be affected. In this context, a robust quantile parametric regression model for bounded responses is developed, considering a new distribution, the Kumaraswamy Rectangular (KR) distribution. The KR model corresponds to a finite mixture structure similar to the Beta Rectangular distribution. That is, the KR distribution has heavier tails compared to the Kumaraswamy model. Indeed, we show that the correspondent KR quantile regression model is more robust and flexible than the usual Kumaraswamy one. Bayesian inference, which includes parameter estimation, model fit assessment, model comparison, and influence analysis, is developed through a hybrid-based MCMC approach. Since the quantile of the KR distribution is not analytically tractable, we consider the modeling of the conditional quantile based on a suitable data augmentation scheme. To link both quantiles in terms of a regression structure, a two-step estimation algorithm under a Bayesian approach is proposed to obtain the numerical approximation of the respective posterior distributions of the parameters of the regression structure for the KR quantile. Such an algorithm combines a Markov Chain Monte Carlo algorithm with the Ordinary Least Squares approach. Our proposal showed to be robust against outlying observations related to the response while keeping the estimation process simple without adding too much to the computational complexity. We showed the effectiveness of our estimation method with a simulation study, whereas two other studies showed some benefits of the proposed model in terms of robustness and flexibility. To exemplify the adequacy of our approach, under the presence of outlying observations, we analyzed two data sets regarding socio-economic indicators from Brazil and compared them with alternatives. PubDate: 2024-02-10

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract A new data-based smoothing parameter for circular kernel density (and its derivatives) estimation is proposed. Following the plug-in ideas, unknown quantities on an optimal smoothing parameter are replaced by suitable estimates. This paper provides a circular version of the well-known Sheather and Jones bandwidths (J R Stat Soc Ser B Stat Methodol 53(3):683–690, 1991. https://doi.org/10.1111/j.2517-6161.1991.tb01857.x), with direct and solve-the-equation plug-in rules. Theoretical support for our developments, related to the asymptotic mean squared error of the estimator of the density, its derivatives, and its functionals, for circular data, are provided. The proposed selectors are compared with previous data-based smoothing parameters for circular kernel density estimation. This paper also contributes to the study of the optimal kernel for circular data. An illustration of the proposed plug-in rules is also shown using real data on the time of car accidents. PubDate: 2024-02-07

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Recent advances in Bayesian models for random partitions have led to the formulation and exploration of Exchangeable Sequences of Clusters (ESC) models. Under ESC models, it is the cluster sizes that are exchangeable, rather than the observations themselves. This property is particularly useful for obtaining microclustering behavior, whereby cluster sizes grow sublinearly in the number of observations, as is common in applications such as record linkage, sparse networks and genomics. Unfortunately, the exchangeable clusters property comes at the cost of projectivity. As a consequence, in contrast to more traditional Dirichlet Process or Pitman–Yor process mixture models, samples a priori from ESC models cannot be easily obtained in a sequential fashion and instead require the use of rejection or importance sampling. In this work, drawing on connections between ESC models and discrete renewal theory, we obtain closed-form expressions for certain ESC models and develop faster methods for generating samples a priori from these models compared with the existing state of the art. In the process, we establish analytical expressions for the distribution of the number of clusters under ESC models, which was unknown prior to this work. PubDate: 2024-02-07

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Temporal sequences of satellite images constitute a highly valuable and abundant resource for analyzing regions of interest. However, the automatic acquisition of knowledge on a large scale is a challenging task due to different factors such as the lack of precise labeled data, the definition and variability of the terrain entities, or the inherent complexity of the images and their fusion. In this context, we present a fully unsupervised and general methodology to conduct spatio-temporal taxonomies of large regions from sequences of satellite images. Our approach relies on a combination of deep embeddings and time series clustering to capture the semantic properties of the ground and its evolution over time, providing a comprehensive understanding of the region of interest. The proposed method is enhanced by a novel procedure specifically devised to refine the embedding and exploit the underlying spatio-temporal patterns. We use this methodology to conduct an in-depth analysis of a 220 km \(^2\) region in northern Spain in different settings. The results provide a broad and intuitive perspective of the land where large areas are connected in a compact and well-structured manner, mainly based on climatic, phytological, and hydrological factors. PubDate: 2024-02-05

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Multiple-type outcomes are often encountered in many statistical applications, one may want to study the association between multiple responses and determine the covariates useful for prediction. However, literature on variable selection methods for multiple-type data is arguably underdeveloped. In this article, we develop a novel global–local shrinkage prior in multiple response-types settings, where the observed dataset consists of multiple response-types (e.g., continuous, count-valued, Bernoulli trials, etc.), by combining the perspectives of global–local shrinkage and the conjugate multivaraite distribution. One benefit of our model is that a transformation or a Gaussian approximation on the data is not needed to perform variable selection for multiple response-type data, and thus one can avoid computational difficulties and restrictions on the joint distribution of the responses. Another benefit is that it allows one to parsimoniously model cross-variable dependence. Specifically, our method uses basis functions with random effects, which can be presented as known covariates or pre-defined basis functions, to model dependence between responses and dependence can be detected by our proposed global–local shrinkage model with a sparsity-inducing model. We provide connections to the original horseshoe model and existing basis function models. An efficient block Gibbs sampler is developed, which is found to be effective in obtaining accurate estimates and variable selection results. We also provide a motivating analysis of public health and financial costs from natural disasters in the U.S. using data provided by the National Centers for Environmental Information. PubDate: 2024-02-03

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract Compositional data have posed challenges to current classification methods owing to the non-negative and unit-sum constraints, especially when a certain of the components are zeros. In this paper, we develop an effective classification method for multivariate compositional data with certain of the components equal to zero. Specifically, a Kent feature embedding technique is first proposed to transform compositional data and improve data quality. We then use support vector machine as the state-of-the-art machine learning model to build the classifier. The proposed method is proved to be effective through numerical simulations. Results on multiple real datasets, including species classification, day-night image classification and household’s consumption pattern recognition, further verify that the proposed method can achieve good classification performance and outperform the other competitors. This method would help to broaden the practical usage of compositional data with zeros in the task of classification. PubDate: 2024-01-31

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract The concepts of sparsity, and regularised estimation, have proven useful in many high-dimensional statistical applications. Dynamic factor models (DFMs) provide a parsimonious approach to modelling high-dimensional time series, however, it is often hard to interpret the meaning of the latent factors. This paper formally introduces a class of sparse DFMs whereby the loading matrices are constrained to have few non-zero entries, thus increasing interpretability of factors. We present a regularised M-estimator for the model parameters, and construct an efficient expectation maximisation algorithm to enable estimation. Synthetic experiments demonstrate consistency in terms of estimating the loading structure, and superior predictive performance where a low-rank factor structure may be appropriate. The utility of the method is further illustrated in an application forecasting electricity consumption across a large set of smart meters. PubDate: 2024-01-22

Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.

Abstract: Abstract The log-logistic regression model is one of the most commonly used accelerated failure time (AFT) models in survival analysis, for which statistical inference methods are mainly established under the frequentist framework. Recently, Bayesian inference for log-logistic AFT models using Markov chain Monte Carlo (MCMC) techniques has also been widely developed. In this work, we develop an alternative approach to MCMC methods and infer the parameters of the log-logistic AFT model via a mean-field variational Bayes (VB) algorithm. A piecewise approximation technique is embedded in deriving the VB algorithm to achieve conjugacy. The proposed VB algorithm is evaluated and compared with frequentist and MCMC techniques using simulated data under various scenarios. A publicly available dataset is employed for illustration. We have demonstrated that our proposed Variational Bayes (VB) algorithm consistently produces satisfactory estimation results and, in most scenarios, outperforms the likelihood-based method in terms of empirical mean squared error (MSE). When compared to MCMC, similar performance was achieved by our proposed VB, and, in certain scenarios, VB yielded the lowest MSE. Furthermore, the proposed VB algorithm offers a significantly reduced computational cost compared to MCMC, with an average speedup of 300 times. PubDate: 2024-01-18