Abstract: In this paper additive models with p-order autoregressive conditional symmetric errors based on penalized regression splines are proposed for modeling trend and seasonality in time series. The aim with this kind of approach is try to model the autocorrelation and seasonality properly to assess the existence of a significant trend. A backfitting iterative process jointly with a quasi-Newton algorithm are developed for estimating the additive components, the dispersion parameter and the autocorrelation coefficients. The effective degrees of freedom concerning the fitting are derived from an appropriate smoother. Inferential results and selection model procedures are proposed as well as some diagnostic methods, such as residual analysis based on the conditional quantile residual and sensitivity studies based on the local influence approach. Simulations studies are performed to assess the large sample behavior of the maximum penalized likelihood estimators. Finally, the methodology is applied for modeling the daily average temperature of San Francisco city from January 1995 to April 2020. PubDate: 2021-05-03

Abstract: Penalized spline smoothing is a well-established, nonparametric regression method that is efficient for one and two covariates. Its extension to more than two covariates is straightforward but suffers from exponentially increasing memory demands and computational complexity, which brings the method to its numerical limit. Penalized spline smoothing with multiple covariates requires solving a large-scale, regularized least-squares problem where the occurring matrices do not fit into storage of common computer systems. To overcome this restriction, we introduce a matrix-free implementation of the conjugate gradient method. We further present a matrix-free implementation of a simple diagonal as well as more advanced geometric multigrid preconditioner to significantly speed up convergence of the conjugate gradient method. All algorithms require a negligible amount of memory and therefore allow for penalized spline smoothing with multiple covariates. Moreover, for arbitrary but fixed covariate dimension, we show grid independent convergence of the multigrid preconditioner which is fundamental to achieve algorithmic scalability. PubDate: 2021-04-30

Abstract: Given noisy, partial observations of a time-homogeneous, finite-statespace Markov chain, conceptually simple, direct statistical inference is available, in theory, via its rate matrix, or infinitesimal generator, \({\mathsf {Q}}\) , since \(\exp ({\mathsf {Q}}t)\) is the transition matrix over time t. However, perhaps because of inadequate tools for matrix exponentiation in programming languages commonly used amongst statisticians or a belief that the necessary calculations are prohibitively expensive, statistical inference for continuous-time Markov chains with a large but finite state space is typically conducted via particle MCMC or other relatively complex inference schemes. When, as in many applications \({\mathsf {Q}}\) arises from a reaction network, it is usually sparse. We describe variations on known algorithms which allow fast, robust and accurate evaluation of the product of a non-negative vector with the exponential of a large, sparse rate matrix. Our implementation uses relatively recently developed, efficient, linear algebra tools that take advantage of such sparsity. We demonstrate the straightforward statistical application of the key algorithm on a model for the mixing of two alleles in a population and on the Susceptible-Infectious-Removed epidemic model. PubDate: 2021-04-19

Abstract: The maximal correlation is an attractive measure of dependence between the components of a random vector, however it presents the difficulty that its calculation is not easy. Here, we consider the case of bivariate vectors which components are order statistics from discrete distributions supported on \(N\ge 2\) points. Except for the case \(N=2\) , the maximal correlation does not have a closed form, so we propose the use of a gradient based optimization method. The gradient vector of the objective function, the correlation coefficient of pairs of order statistics, can be extraordinarily complicated and for that reason an automatic differentiation algorithm is proposed. PubDate: 2021-04-12

Abstract: Missing data reconstruction is a critical step in the analysis and mining of spatio-temporal data. However, few studies comprehensively consider missing data patterns, sample selection and spatio-temporal relationships. To take into account the uncertainty in the point forecast, some prediction intervals may be of interest. In particular, for (possibly long) missing sequences of consecutive time points, joint prediction regions are desirable. In this paper we propose a bootstrap resampling scheme to construct joint prediction regions that approximately contain missing paths of a time components in a spatio-temporal framework, with global probability \(1-\alpha \) . In many applications, considering the coverage of the whole missing sample-path might appear too restrictive. To perceive more informative inference, we also derive smaller joint prediction regions that only contain all elements of missing paths up to a small number k of them with probability \(1-\alpha \) . A simulation experiment is performed to validate the empirical performance of the proposed joint bootstrap prediction and to compare it with some alternative procedures based on a simple nominal coverage correction, loosely inspired by the Bonferroni approach, which are expected to work well standard scenarios. PubDate: 2021-04-05

Abstract: Sparse convex clustering is to group observations and conduct variable selection simultaneously in the framework of convex clustering. Although a weighted \(L_1\) norm is usually employed for the regularization term in sparse convex clustering, its use increases the dependence on the data and reduces the estimation accuracy if the sample size is not sufficient. To tackle these problems, this paper proposes a Bayesian sparse convex clustering method based on the ideas of Bayesian lasso and global-local shrinkage priors. We introduce Gibbs sampling algorithms for our method using scale mixtures of normal distributions. The effectiveness of the proposed methods is shown in simulation studies and a real data analysis. PubDate: 2021-04-05

Abstract: In this paper, a highly effective Bayesian sampling algorithm based on auxiliary variables is used to estimate the graded response model with non-ignorable missing response data. Compared with the traditional marginal likelihood method and other Bayesian algorithms, the advantages of the new algorithm are discussed in detail. Based on the Markov Chain Monte Carlo samples from the posterior distributions, the deviance information criterion and the logarithm of the pseudomarignal likelihood are employed to compare the different missing mechanism models. Two simulation studies are conducted and a detailed analysis of the sexual compulsivity scale data is carried out to further illustrate the proposed methodology. PubDate: 2021-04-02

Abstract: We propose a fast Newton algorithm for \(\ell _0\) regularized high-dimensional generalized linear models based on support detection and root finding. We refer to the proposed method as GSDAR. GSDAR is developed based on the KKT conditions for \(\ell _0\) -penalized maximum likelihood estimators and generates a sequence of solutions of the KKT system iteratively. We show that GSDAR can be equivalently formulated as a generalized Newton algorithm. Under a restricted invertibility condition on the likelihood function and a sparsity condition on the regression coefficient, we establish an explicit upper bound on the estimation errors of the solution sequence generated by GSDAR in supremum norm and show that it achieves the optimal order in finite iterations with high probability. Moreover, we show that the oracle estimator can be recovered with high probability if the target signal is above the detectable level. These results directly concern the solution sequence generated from the GSDAR algorithm, instead of a theoretically defined global solution. We conduct simulations and real data analysis to illustrate the effectiveness of the proposed method. PubDate: 2021-03-29

Abstract: Mixed-Poisson distributions have been used in many fields for modeling the over-dispersed count data sets. To open a new opportunity in modeling the over-dispersed count data sets, we introduce a new mixed-Poisson distribution using the generalized Lindley distribution as a mixing distribution. The moment and probability generating functions, factorial moments as well as skewness, and kurtosis measures are derived. Using the mean-parametrized version of the proposed distribution, we introduce a new count regression model which is an appropriate model for over-dispersed counts. The healthcare data sets are analyzed employing a new count regression model. We conclude that the new regression model works well in the case of over-dispersion. PubDate: 2021-03-28

Abstract: Survival data including potentially cured subjects are common in clinical studies and mixture cure rate models are often used for analysis. The non-cured probabilities are often predicted by non-parametric, high-dimensional, or even unstructured (e.g. image) predictors, which is a challenging task for traditional nonparametric methods such as spline and local kernel. We propose to use the neural network to model the nonparametric or unstructured predictors’ effect in cure rate models and retain the proportional hazards structure due to its explanatory ability. We estimate the parameters by Expectation–Maximization algorithm. Estimators are showed to be consistent. Simulation studies show good performance in both prediction and estimation. Finally, we analyze Open Access Series of Imaging Studies data to illustrate the practical use of our methods. PubDate: 2021-03-27

Abstract: Statisticians are frequently confronted with highly complex data such as clustered data, missing data or censored data. In this manuscript, we consider hierarchically clustered survival data. This type of data arises when a sample consists of clusters, and each cluster has several, correlated sub-clusters containing various, dependent survival times. Two approaches are commonly used to analysis such data and estimate the association between the survival times within a cluster and/or sub-cluster. The first approach is by using random effects in a frailty model while a second approach is by using copula models. Hereby we assume that the joint survival function is described by a copula function evaluated in the marginal survival functions of the different individuals within a cluster. In this manuscript, we introduce a copula model based on a nested Archimedean copula function for hierarchical survival data, where both the clusters and sub-clusters are allowed to be moderate to large and varying in size. We investigate one-stage, two-stage and three-stage parametric estimation procedures for the association parameters in this model. In a simulation study we check the finite sample properties of these estimators. Furthermore we illustrate the methods on a real life data-set on Chronic Granulomatous Disease. PubDate: 2021-03-26

Abstract: We consider the problem of sample degeneracy in Approximate Bayesian Computation. It arises when proposed values of the parameters, once given as input to the generative model, rarely lead to simulations resembling the observed data and are hence discarded. Such “poor” parameter proposals do not contribute at all to the representation of the parameter’s posterior distribution. This leads to a very large number of required simulations and/or a waste of computational resources, as well as to distortions in the computed posterior distribution. To mitigate this problem, we propose an algorithm, referred to as the Large Deviations Weighted Approximate Bayesian Computation algorithm, where, via Sanov’s Theorem, strictly positive weights are computed for all proposed parameters, thus avoiding the rejection step altogether. In order to derive a computable asymptotic approximation from Sanov’s result, we adopt the information theoretic “method of types” formulation of the method of Large Deviations, thus restricting our attention to models for i.i.d. discrete random variables. Finally, we experimentally evaluate our method through a proof-of-concept implementation. PubDate: 2021-03-22

Abstract: The article presents an algorithm for fast and error-free determination of statistics such as the arithmetic mean and variance of all contiguous subsequences and fixed-length contiguous subsequences for a sequence of industrial measurement data. Additionally, it shows that both floating-point and integer representation can be used to perform this kind of statistical calculations. The author proves a theorem on the number of bits of precision that an arithmetic type must have to guarantee error-free determination of the arithmetic mean and variance. The article also presents the extension of Welford’s formula for determining variance for the sliding window method—determining the variance of fixed-length contiguous subsequences. The section dedicated to implementation tests shows the running times of individual algorithms depending on the arithmetic type used. The research shows that the use of integers in calculations makes the determination of the aforementioned statistics much faster. PubDate: 2021-03-19

Abstract: Accident taxonomy or classification can be used to direct the attention of policymakers to specific concerns in traffic safety, and can subsequently bring about effective regulatory change. Despite the widespread usage of accident taxonomy for general motor vehicle crashes, its use for analyzing bus crashes is limited. We apply a two-stage clustering-based approach based on self-organizing maps followed by neural gas clustering to construct a data-driven taxonomy of bus crashes. Using the 2005–2015 data from general estimates system, we identify four clusters and expose the qualitative traits that characterize four distinct types of bus crash. Our analysis suggests that cluster characteristics are largely stable over time. Consequently, we make targeted policy recommendations for each of the four subtypes of bus crash. PubDate: 2021-03-16

Abstract: The application of spatial Cliff–Ord models requires information about spatial coordinates of statistical units to be reliable, which is usually the case in the context of areal data. With micro-geographic point-level data, however, such information is inevitably affected by locational errors, that can be generated intentionally by the data producer for privacy protection or can be due to inaccuracy of the geocoding procedures. This unfortunate circumstance can potentially limit the use of the spatial autoregressive modelling framework for the analysis of micro data, as the presence of locational errors may have a non-negligible impact on the estimates of model parameters. This contribution aims at developing a strategy to reduce the bias and produce more reliable inference for spatial models with location errors. The proposed estimation strategy models both the spatial stochastic process and the coarsening mechanism by means of a marked point process. The model is fitted through the maximisation of a doubly-marginalised likelihood function of the marked point process, which cleans out the effects of coarsening. The validity of the proposed approach is assessed by means of a Monte Carlo simulation study under different real-case scenarios, whereas it is applied to real data on house prices. PubDate: 2021-03-13

Abstract: We consider versions of the Metropolis algorithm which avoid the inefficiency of rejections. We first illustrate that a natural Uniform Selection algorithm might not converge to the correct distribution. We then analyse the use of Markov jump chains which avoid successive repetitions of the same state. After exploring the properties of jump chains, we show how they can exploit parallelism in computer hardware to produce more efficient samples. We apply our results to the Metropolis algorithm, to Parallel Tempering, to a Bayesian model, to a two-dimensional ferromagnetic 4 \(\times \) 4 Ising model, and to a pseudo-marginal MCMC algorithm. PubDate: 2021-03-13

Abstract: In this paper we consider mixture generalized autoregressive conditional heteroskedastic models, and propose a new iteration algorithm of type EM for the estimation of model parameters. The maximum likelihood estimates are shown to be consistent, and their asymptotic properties are investigated. More precisely, we derive simple expressions in closed form for the asymptotic covariance matrix and the expected Fisher information matrix of the ML estimator. Finally, we study the model selection and propose testing procedures. A simulation study and an application to financial real-series illustrate the results. PubDate: 2021-03-12

Abstract: This paper proposes an extension of principal component analysis to non-stationary multivariate time series data. A criterion for determining the number of final retained components is proposed. An advance correlation matrix is developed to evaluate dynamic relationships among the chosen components. The theoretical properties of the proposed method are given. Many simulation experiments show our approach performs well on both stationary and non-stationary data. Real data examples are also presented as illustrations. We develop four packages using the statistical software R that contain the needed functions to obtain and assess the results of the proposed method. PubDate: 2021-03-07

Abstract: We propose to extend CART for bivariate marked point processes to provide a segmentation of the space into homogeneous areas for interaction between marks. While usual CART tree considers marginal distribution of the response variable at each node, the proposed algorithm, SpatCART, takes into account the spatial location of the observations in the splitting criterion. We introduce a dissimilarity index based on Ripley’s intertype K-function quantifying the interaction between two populations. This index used for the growing step of the CART strategy, leads to a heterogeneity function consistent with the original CART algorithm. Therefore the new variant is a way to explore spatial data as a bivariate marked point process using binary classification trees. The proposed procedure is implemented in an R package, and illustrated on simulated examples. SpatCART is finally applied to a tropical forest example. PubDate: 2021-03-04

Abstract: In the original publication of the article, the corrections in Eq. (13) were missed, in which 2v − 1 was changed to 2v in the exponent. PubDate: 2021-03-01 DOI: 10.1007/s00180-020-01035-6