Subjects -> STATISTICS (Total: 130 journals)
 The end of the list has been reached or no journals were found for your choice.
Similar Journals
 Computational StatisticsJournal Prestige (SJR): 0.803 Citation Impact (citeScore): 1Number of Followers: 15      Hybrid journal (It can contain Open Access articles) ISSN (Print) 1613-9658 - ISSN (Online) 0943-4062 Published by Springer-Verlag  [2467 journals]
• An extended approach for the generalized powered uniform distribution

Abstract: Abstract A new uniform distribution model, generalized powered uniform distribution (GPUD), which is based on incorporating the parameter k into the probability density function (pdf) associated with the power of random variable values and includes a powered mean operator, is introduced in this paper. From this new model, the shape properties of the pdf as well as the higher-order moments, the moment generating function, the model that simulates the GPUD and other important statistics can be derived. This approach allows the generalization of the distribution presented by Jayakumar and Sankaran (2016) through the new $${ GPUD }_{ (J-S)}$$ distribution. Two sets of real data related to COVID-19 and bladder cancer were tested to demonstrate the proposed model’s potential. The maximum likelihood method was used to calculate the parameter estimators by applying the maxLik package in R. The results showed that this new model is more flexible and useful than other comparable models.
PubDate: 2022-11-16

• Robust estimation for the one-parameter exponential family integer-valued
GARCH(1,1) models based on a modified Tukey’s biweight function

Abstract: Abstract In this paper, we study a robust estimation method for observation-driven integer-valued time series models whose conditional distribution belongs to the one-parameter exponential family. Maximum likelihood estimator (MLE) is commonly used to estimate parameters, but it is highly affected by outliers. We resort to the Mallows’ quasi-likelihood estimator based on a modified Tukey’s biweight function as a robust estimator and establish its existence, uniqueness, consistency and asymptotic normality under some regularity conditions. Compared with MLE, simulation results illustrate the better performance of the new estimator. An application is performed on data for two real data sets, and a comparison with other existing robust estimators is also given.
PubDate: 2022-11-09

• The computing of the Poisson multinomial distribution and applications in
ecological inference and machine learning

Abstract: Abstract The Poisson multinomial distribution (PMD) describes the distribution of the sum of n independent but non-identically distributed random vectors, in which each random vector is of length m with 0/1 valued elements and only one of its elements can take value 1 with a certain probability. Those probabilities are different for the m elements across the n random vectors, and form an $$n \times m$$ matrix with row sum equals to 1. We call this $$n\times m$$ matrix the success probability matrix (SPM). Each SPM uniquely defines a $${ \text {PMD}}$$ . The $${ \text {PMD}}$$ is useful in many areas such as, voting theory, ecological inference, and machine learning. The distribution functions of $${ \text {PMD}}$$ , however, are usually difficult to compute and there is no efficient algorithm available for computing it. In this paper, we develop efficient methods to compute the probability mass function (pmf) for the PMD using multivariate Fourier transform, normal approximation, and simulations. We study the accuracy and efficiency of those methods and give recommendations for which methods to use under various scenarios. We also illustrate the use of the $${ \text {PMD}}$$ via three applications, namely, in ecological inference, uncertainty quantification in classification, and voting probability calculation. We build an R package that implements the proposed methods, and illustrate the package with examples. This paper has online supplementary materials.
PubDate: 2022-11-04

• Feature selection algorithms in generalized additive models under
concurvity

Abstract: Abstract In this paper, the properties of 10 different feature selection algorithms for generalized additive models (GAMs) are compared on one simulated and two real-world datasets under concurvity. Concurvity can be interpreted as a redundancy in the feature set of a GAM. Like multicollinearity in linear models, concurvity causes unstable parameter estimates in GAMs and makes the marginal effect of features harder interpret. Feature selection algorithms for GAMs can be separated into four clusters: stepwise, boosting, regularization and concurvity controlled methods. Our numerical results show that algorithms with no constraints on concurvity tend to select a large feature set, without significant improvements in predictive performance compared to a more parsimonious feature set. A large feature set is accompanied by harmful concurvity in the proposed models. To tackle the concurvity phenomenon, recent feature selection algorithms such as the mRMR and the HSIC-Lasso incorporated some constraints on concurvity in their objective function. However, these algorithms interpret concurvity as pairwise non-linear relationship between features, so they do not account for the case when a feature can be accurately estimated as a multivariate function of several other features. This is confirmed by our numerical results. Our own solution to the problem, a hybrid genetic–harmony search algorithm (HA) introduces constrains on multivariate concurvity directly. Due to this constraint, the HA proposes a small and not redundant feature set with predictive performance similar to that of models with far more features.
PubDate: 2022-11-03

• Multiple partition Markov model for B.1.1.7, B.1.351, B.1.617.2, and P.1
variants of SARS-CoV 2 virus

Abstract: Abstract With tools originating from Markov processes, we investigate the similarities and differences between genomic sequences in FASTA format coming from four variants of the SARS-CoV 2 virus, B.1.1.7 (UK), B.1.351 (South Africa), B.1.617.2 (India), and P.1 (Brazil). We treat the virus’ sequences as samples of finite memory Markov processes acting in $$A=\{a,c,g,t\}.$$ We model each sequence, revealing some heterogeneity between sequences belonging to the same variant. We identified the five most representative sequences for each variant using a robust notion of classification, see Fernández et al. (Math Methods Appl Sci 43(13):7537–7549. https://doi.org/10.1002/mma.5705 ). Using a notion derived from a metric between processes, see García et al. (Appl Stoch Models Bus Ind 34(6):868–878. https://doi.org/10.1002/asmb.2346), we identify four groups, each group representing a variant. It is also detected, by this metric, global proximity between the variants B.1.351 and B.1.1.7. With the selected sequences, we assemble a multiple partition model, see Cordeiro et al. (Math Methods Appl Sci 43(13):7677–7691. https://doi.org/10.1002/mma.6079), revealing in which states of the state space the variants differ, concerning the mechanisms for choosing the next element in A. Through this model, we identify that the variants differ in their transition probabilities in eleven states out of a total of 256 states. For these eleven states, we reveal how the transition probabilities change from variant (group of variants) to variant (group of variants). In other words, we indicate precisely the stochastic reasons for the discrepancies.
PubDate: 2022-11-01

• Multiway clustering with time-varying parameters

Abstract: Abstract This paper proposes a clustering approach for multivariate time series with time-varying parameters in a multiway framework. Although clustering techniques based on time series distribution characteristics have been extensively studied, methods based on time-varying parameters have only recently been explored and are missing for multivariate time series. This paper fills the gap by proposing a multiway approach for distribution-based clustering of multivariate time series. To show the validity of the proposed clustering procedure, we provide both a simulation study and an application to real air quality time series data.
PubDate: 2022-11-01

• Bayesian quantile regression models for heavy tailed bounded variables
using the No-U-Turn sampler

Abstract: Abstract When we are interested in knowing how covariates impact different levels of the response variable, quantile regression models can be very useful, with their practical use being benefited from the increasing of computational power. The use of bounded response variables is also very common when there are data containing percentages, rates, or proportions. In this work, with the generalized Gompertz distribution as the baseline distribution, we derive two new two-parameter distributions with bounded support, and new quantile parametric mixed regression models are proposed based on these distributions, which consider bounded response variables with heavy tails. Estimation of the parameters using the Bayesian approach is considered for both models, relying on the No-U-Turn sampler algorithm. The inferential methods can be implemented and then easily used for data analysis. Simulation studies with different quantiles ( $$q=0.1$$ , $$q=0.5$$ and $$q=0.9$$ ) and sample sizes ( $$n=100$$ , $$n=200$$ , $$n=500$$ , $$n=2000$$ , $$n=5000$$ ) were conducted for 100 replicas of simulated data for each combination of settings, in the (0, 1) and [0, 1), showing the good performance of the recovery of parameters for the proposed inferential methods and models, which were compared to Beta Rectangular and Kumaraswamy regression models. Furthermore, a dataset on extreme poverty is analyzed using the proposed regression models with fixed and mixed effects. The quantile parametric models proposed in this work are an alternative and complementary modeling tool for the analysis of bounded data.
PubDate: 2022-11-01

• Multivariate understanding of income and expenditure in United States
households with statistical learning

Abstract: Abstract In recent decades, data-driven approaches have been developed to analyze demographic and economic surveys on a large scale. Despite advances in multivariate techniques and learning methods, in practice the analysis and interpretations are often focused on a small portion of available data and limited to a single perspective. This paper aims to utilize a selected array of multivariate statistical learning methods in the analysis of income and expenditure patterns of households in the United States using the Public-Use Microdata from the Bureau of Labor Statistics Consumer Expenditure Survey (CE). The objective is to propose an effective data pipeline that provides visualizations and comprehensive interpretations for applications in governmental regulations and economic research, using thirty-five original survey variables covering the categories of demographics, income and expenditure. Details on feature extraction not only showcase CE as a unique publicly-shared big data resource with high potential for in-depth analysis, but also assist interested researchers with pre-processing. Challenges from missing values and categorical variables are treated in the exploratory analysis, while statistical learning methods are comprehensively employed to address multiple economic perspectives. Principal component analysis suggests that after-tax income, wage/salary income, and the quarterly expenditure in food, housing and overall as the five most important of the selected variables, while cluster analysis identifies and visualizes the implicit structure between variables. Based on this, canonical correlation analysis reveals high correlation between two selected groups of variables, one of income and the other of expenditure.
PubDate: 2022-11-01

• Nomclust 2.0: an R package for hierarchical clustering of objects
characterized by nominal variables

Abstract: Abstract In this paper, we present the second generation of the nomclust R package, which we developed for the hierarchical clustering of data containing nominal variables (nominal data). The package completely covers the hierarchical clustering process, from dissimilarity matrix calculation, over the choice of a clustering method, to the evaluation of the final clusters. Through the whole clustering process, similarity measures, clustering methods, and evaluation criteria developed solely for nominal data are used, which makes this package unique. In the first part of the paper, the theoretical background of the methods used in the package is described. In the second part, the functionality of the package is demonstrated in several examples. The second generation of the package is completely rewritten to be more natural for the workflow of R users. It includes new similarity measures and evaluation criteria. We also added several graphical outputs and support for S3 generic functions. Finally, due to code optimizations, the calculation time of dissimilarity matrix calculation was substantially reduced.
PubDate: 2022-11-01

• Non-parametric seasonal unit root tests under periodic non-stationary
volatility

Abstract: Abstract This paper presents a new non-parametric seasonal unit root testing framework that is robust to periodic non-stationary volatility in innovation variance by making an extension to the fractional seasonal variance ratio unit root tests of Eroğlu et al. (Econ Lett 167:75–80, 2018). The setup allows for both periodic heteroskedasticity structure of Burridge and Taylar (J Econ 104(1):91–117, 2001) and non-stationary volatility structure of Cavaliere and Taylor (Econ Theory 24(1):43-71, 2008). We show that the limiting null distributions of the variance ratio tests depend on nuisance parameters derived from the underlying volatility process. Monte Carlo simulations show that the standard variance ratio tests can be substantially oversized in the presence of such effects. Consequently, we propose wild bootstrap implementations of the variance ratio tests. Wild bootstrap resampling schemes are shown to deliver asymptotically pivotal inference. The simulation evidence depicts that the proposed bootstrap tests perform well in practice and essentially correct the size problems observed in the standard fractional seasonal variance ratio tests, even under extreme patterns of heteroskedasticity.
PubDate: 2022-11-01

• Overcoming convergence problems in PLS path modelling

Abstract: Abstract The present paper deals with convergence issues of Lohmöller’s procedure for the computation of the components in the PLS-PM algorithm. More datasets and proofs are given to highlight the convergence failure of this procedure. Consequently, a new procedure based on the Signless Lapalacien matrix of the indirect graph between constructs is introduced. In several cases that will be specified in this paper, both monotony and error convergence for this new procedure will be established. Several comparisons will be presented between the new procedure and the two conventionally used procedures (Lohmöller’s and Hanafi-Wold’s procedures).
PubDate: 2022-11-01

• Oblique decision tree induction by cross-entropy optimization based on the
von Mises–Fisher distribution

Abstract: Abstract Oblique decision trees recursively divide the feature space by using splits based on linear combinations of attributes. Compared to their univariate counterparts, which only use a single attribute per split, they are often smaller and more accurate. A common approach to learn decision trees is by iteratively introducing splits on a training set in a top–down manner, yet determining a single optimal oblique split is in general computationally intractable. Therefore, one has to rely on heuristics to find near-optimal splits. In this paper, we adapt the cross-entropy optimization method to tackle this problem. The approach is motivated geometrically by the observation that equivalent oblique splits can be interpreted as connected regions on a unit hypersphere which are defined by the samples in the training data. In each iteration, the algorithm samples multiple candidate solutions from this hypersphere using the von Mises–Fisher distribution which is parameterized by a mean direction and a concentration parameter. These parameters are then updated based on the best performing samples such that when the algorithm terminates a high probability mass is assigned to a region of near-optimal solutions. Our experimental results show that the proposed method is well-suited for the induction of compact and accurate oblique decision trees in a small amount of time.
PubDate: 2022-11-01

• Ensemble updating of categorical state vectors

Abstract: Abstract An ensemble updating method for categorical state vectors is proposed. The method is based on a Bayesian view of the ensemble Kalman filter (EnKF). In the EnKF, Gaussian approximations to the forecast and filtering distributions are introduced, and the forecast ensemble is updated with a linear shift. Given that the Gaussian approximation to the forecast distribution is correct, the EnKF linear update corresponds to conditional simulation from a Gaussian distribution with mean and covariance such that the posterior samples marginally are distributed according to the Gaussian approximation to the filtering distribution. In the proposed approach for categorical vectors, the Gaussian approximations are replaced with a (possibly higher order) Markov chain model, and the linear update is replaced with simulation based on a class of decomposable graphical models. To make the update robust against errors in the assumed forecast and filtering distributions, an optimality criterion is formulated, for which the resulting optimal updating procedure can be found by solving a linear program. We explore the properties of the proposed updating procedure in a simulation example where each state variable can take three values.
PubDate: 2022-11-01

• Regularized target encoding outperforms traditional methods in supervised
machine learning with high cardinality features

Abstract: Abstract Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.
PubDate: 2022-11-01

• Improved confidence intervals based on ranked set sampling designs within
a parametric bootstrap approach

Abstract: Abstract We study the problem of obtaining confidence intervals (CIs) within a parametric framework under different ranked set sampling (RSS) designs. This is an important research issue since it has not yet been adequately addressed in the RSS literature. We focused on evaluating CIs based on a recently developed parametric bootstrap approach, and the asymptotic maximum likelihood CIs under simple random sampling (SRS) was taken as the counterpart. A comprehensive simulation study was carried out to evaluate the accuracy and precision of the CIs. We have considered as sampling designs the paired RSS, neoteric RSS, and double RSS, besides the original RSS and SRS. Different estimation methods and bootstrap CIs were evaluated. In addition, the robustness of the CIs to imperfect ranking was evaluated by inducing varied levels of ranking errors. The simulated results allowed us to identify accurate bootstrap CIs based on RSS and some of its extensions, which outperform the usual asymptotic or bootstrap CIs based on SRS in terms of accuracy (coverage rate) and/or precision (average width).
PubDate: 2022-11-01

• The 2017 Data Challenge of the American Statistical Association

PubDate: 2022-11-01

• Uniform design with prior information of factors under weighted
wrap-around $$L_2$$ -discrepancy

Abstract: Abstract Uniform design is one of the most frequently used designs of experiment, and all factors are usually regarded as equally important in the existing literature of uniform design. If some prior information of certain factors is known, the potential importance of factors should be distinguished. In this paper, by assigning different weights to factors with different importance, the weighted wrap-around $$L_2$$ -discrepancy is proposed to measure the uniformity of design when some prior information of certain factors are known. The properties of weighted wrap-around $$L_2$$ -discrepancy are explored. Accordingly, the weighted generalized wordlength pattern is proposed to describe the aberration of these kinds of designs. The relationship between the weighted wrap-around $$L_2$$ -discrepancy and weighted generalized wordlength pattern is built, and a lower bound of weighted wrap-around $$L_2$$ -discrepancy is obtained. Numerical results show that both weighted wrap-around $$L_2$$ -discrepancy and weighted generalized wordlength pattern are precisely to capture the difference of importance among the columns of design.
PubDate: 2022-11-01

• A variational inference for the Lévy adaptive regression with
multiple kernels

Abstract: Abstract This paper presents a variational Bayes approach to a Lévy adaptive regression kernel (LARK) model that represents functions with an overcomplete system. In particular, we develop a variational inference method for a LARK model with multiple kernels (LARMuK) which estimates arbitrary functions that could have jump discontinuities. The algorithm is based on a variational Bayes approximation method with simulated annealing. We compare the proposed algorithm to a simulation-based reversible jump Markov chain Monte Carlo (RJMCMC) method using numerical experiments and discuss its potential and limitations.
PubDate: 2022-11-01

• Fast simulation of tempered stable Ornstein–Uhlenbeck processes

Abstract: Abstract Constructing Lévy-driven Ornstein–Uhlenbeck processes is a task closely related to the notion of self-decomposability. In particular, their transition laws are linked to the properties of what will be hereafter called the a-remainder of their self-decomposable stationary laws. In the present study we fully characterize the Lévy triplet of these $$a$$ -remainders and we provide a general framework to deduce the transition laws of the finite variation Ornstein–Uhlenbeck processes associated with tempered stable distributions. We focus finally on the subclass of the exponentially-modulated tempered stable laws and we derive the algorithms for an exact generation of the skeleton of Ornstein–Uhlenbeck processes related to such distributions, with the further advantage of adopting procedures which are tens of times faster than those already available in the existing literature.
PubDate: 2022-11-01

• Optimal control for parameter estimation in partially observed
hypoelliptic stochastic differential equations

Abstract: Abstract We deal with the problem of parameter estimation in stochastic differential equations (SDEs) in a partially observed framework. We aim to design a method working for both elliptic and hypoelliptic SDEs, the latters being characterized by degenerate diffusion coefficients. This feature often causes the failure of constrast estimator based on Euler Maruyama discretization scheme and dramatically impairs classic stochastic filtering methods used to reconstruct the unobserved states. All of theses issues make the estimation problem in hypoelliptic SDEs difficult to solve. To overcome this, we construct a well-defined cost function no matter the elliptic nature of the SDEs. We also bypass the filtering step by considering a control theory perspective. The unobserved states are estimated by solving deterministic optimal control problems using numerical methods which do not need strong assumptions on the diffusion coefficient conditioning. Numerical simulations made on different partially observed hypoelliptic SDEs reveal our method produces accurate estimate while dramatically reducing the computational price comparing to other estimation procedures.
PubDate: 2022-11-01

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762