A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

  Subjects -> STATISTICS (Total: 130 journals)
The end of the list has been reached or no journals were found for your choice.
Similar Journals
Journal Cover
Computational Statistics
Journal Prestige (SJR): 0.803
Citation Impact (citeScore): 1
Number of Followers: 15  
 
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1613-9658 - ISSN (Online) 0943-4062
Published by Springer-Verlag Homepage  [2467 journals]
  • Kernel density estimation by stagewise algorithm with a simple dictionary

    • Free pre-print version: Loading...

      Abstract: Abstract This study proposes multivariate kernel density estimation by stagewise minimization algorithm based on U-divergence and a simple dictionary. The dictionary consists of an appropriate scalar bandwidth matrix and a part of the original data. The resulting estimator brings us data-adaptive weighting parameters and bandwidth matrices, and provides a sparse representation of kernel density estimation. We develop the non-asymptotic error bound of the estimator that we obtained via the proposed stagewise minimization algorithm. It is confirmed from simulation studies that the proposed estimator performs as well as, or sometimes better than, other well-known density estimators.
      PubDate: 2022-12-02
       
  • Deep support vector quantile regression with non-crossing constraints

    • Free pre-print version: Loading...

      Abstract: Abstract We propose a new nonparametric regression approach that combines deep neural networks with support vector quantile regression models. The nature of deep neural networks enables complex nonlinear regression quantiles to be estimated more accurately. Because deep learning models have a complicated structure, the proposed method can easily fit both smooth and non-smooth data sets. For this reason, we can effectively model data sets with truncated points or locally different smoothness in which spline-based smoothing methods often fail. Stepwise fitting is used to increase computing speed when fitting multiple quantiles. This produces stable fits, especially when observations are scarce near the target quantile. In addition, we employ certain constraints to prevent the fitted quantiles from crossing. The benefits of the proposed method are more apparent when the errors are heteroscedastic, although quantile regression does not require homogeneous errors. We illustrate the flexibility of the proposed method using simulated data sets and six real data examples with univariate and multivariate input variables.
      PubDate: 2022-12-02
       
  • Fair evaluation of classifier predictive performance based on binary
           confusion matrix

    • Free pre-print version: Loading...

      Abstract: Abstract Evaluating the ability of a classifier to make predictions on unseen data and increasing it by tweaking the learning algorithm are two of the main reasons motivating the evaluation of classifier predictive performance. In this study the behavior of Balanced \(AC_1\)  — a novel classifier accuracy measure — is investigated under different class imbalance conditions via a Monte Carlo simulation. The behavior of Balanced \(AC_1\) is compared against that of several well-known performance measures based on binary confusion matrix. Study results reveal the suitability of Balanced \(AC_1\) with both balanced and imbalanced data sets. A real example of the effects of class imbalance on the behavior of the investigated classifier performance measures is provided by comparing the performance of several machine learning algorithms in a churn prediction problem.
      PubDate: 2022-11-29
       
  • An evolutionary estimation procedure for generalized semilinear regression
           trees

    • Free pre-print version: Loading...

      Abstract: Abstract In many applications, the presence of interactions or even mild non-linearities can affect inference and predictions. For that reason, we suggest the use of a class of models laying between statistics and machine learning and we propose a learning procedure. The models combine a linear part and a tree component that is selected via an evolutionary algorithm, and they can be adopted for any kinds of response, such as, for instance, continuous, categorical, ordinal responses, and survival times. They are inherently interpretable but more flexible than standard regression models, as they easily capture non-linear and interaction effects. The proposed genetic-like learning algorithm allows avoiding a greedy search of the tree component. In a simulation study, we show that the proposed approach has a performance comparable with other machine learning algorithms, with a substantial gain in interpretability and transparency, and we illustrate the method on a real data set.
      PubDate: 2022-11-27
       
  • An extended approach for the generalized powered uniform distribution

    • Free pre-print version: Loading...

      Abstract: Abstract A new uniform distribution model, generalized powered uniform distribution (GPUD), which is based on incorporating the parameter k into the probability density function (pdf) associated with the power of random variable values and includes a powered mean operator, is introduced in this paper. From this new model, the shape properties of the pdf as well as the higher-order moments, the moment generating function, the model that simulates the GPUD and other important statistics can be derived. This approach allows the generalization of the distribution presented by Jayakumar and Sankaran (2016) through the new \({ GPUD }_{ (J-S)}\) distribution. Two sets of real data related to COVID-19 and bladder cancer were tested to demonstrate the proposed model’s potential. The maximum likelihood method was used to calculate the parameter estimators by applying the maxLik package in R. The results showed that this new model is more flexible and useful than other comparable models.
      PubDate: 2022-11-16
       
  • Robust estimation for the one-parameter exponential family integer-valued
           GARCH(1,1) models based on a modified Tukey’s biweight function

    • Free pre-print version: Loading...

      Abstract: Abstract In this paper, we study a robust estimation method for observation-driven integer-valued time series models whose conditional distribution belongs to the one-parameter exponential family. Maximum likelihood estimator (MLE) is commonly used to estimate parameters, but it is highly affected by outliers. We resort to the Mallows’ quasi-likelihood estimator based on a modified Tukey’s biweight function as a robust estimator and establish its existence, uniqueness, consistency and asymptotic normality under some regularity conditions. Compared with MLE, simulation results illustrate the better performance of the new estimator. An application is performed on data for two real data sets, and a comparison with other existing robust estimators is also given.
      PubDate: 2022-11-09
       
  • The computing of the Poisson multinomial distribution and applications in
           ecological inference and machine learning

    • Free pre-print version: Loading...

      Abstract: Abstract The Poisson multinomial distribution (PMD) describes the distribution of the sum of n independent but non-identically distributed random vectors, in which each random vector is of length m with 0/1 valued elements and only one of its elements can take value 1 with a certain probability. Those probabilities are different for the m elements across the n random vectors, and form an \(n \times m\) matrix with row sum equals to 1. We call this \(n\times m\) matrix the success probability matrix (SPM). Each SPM uniquely defines a \({ \text {PMD}}\) . The \({ \text {PMD}}\) is useful in many areas such as, voting theory, ecological inference, and machine learning. The distribution functions of \({ \text {PMD}}\) , however, are usually difficult to compute and there is no efficient algorithm available for computing it. In this paper, we develop efficient methods to compute the probability mass function (pmf) for the PMD using multivariate Fourier transform, normal approximation, and simulations. We study the accuracy and efficiency of those methods and give recommendations for which methods to use under various scenarios. We also illustrate the use of the \({ \text {PMD}}\) via three applications, namely, in ecological inference, uncertainty quantification in classification, and voting probability calculation. We build an R package that implements the proposed methods, and illustrate the package with examples. This paper has online supplementary materials.
      PubDate: 2022-11-04
       
  • Feature selection algorithms in generalized additive models under
           concurvity

    • Free pre-print version: Loading...

      Abstract: Abstract In this paper, the properties of 10 different feature selection algorithms for generalized additive models (GAMs) are compared on one simulated and two real-world datasets under concurvity. Concurvity can be interpreted as a redundancy in the feature set of a GAM. Like multicollinearity in linear models, concurvity causes unstable parameter estimates in GAMs and makes the marginal effect of features harder interpret. Feature selection algorithms for GAMs can be separated into four clusters: stepwise, boosting, regularization and concurvity controlled methods. Our numerical results show that algorithms with no constraints on concurvity tend to select a large feature set, without significant improvements in predictive performance compared to a more parsimonious feature set. A large feature set is accompanied by harmful concurvity in the proposed models. To tackle the concurvity phenomenon, recent feature selection algorithms such as the mRMR and the HSIC-Lasso incorporated some constraints on concurvity in their objective function. However, these algorithms interpret concurvity as pairwise non-linear relationship between features, so they do not account for the case when a feature can be accurately estimated as a multivariate function of several other features. This is confirmed by our numerical results. Our own solution to the problem, a hybrid genetic–harmony search algorithm (HA) introduces constrains on multivariate concurvity directly. Due to this constraint, the HA proposes a small and not redundant feature set with predictive performance similar to that of models with far more features.
      PubDate: 2022-11-03
       
  • Multiple partition Markov model for B.1.1.7, B.1.351, B.1.617.2, and P.1
           variants of SARS-CoV 2 virus

    • Free pre-print version: Loading...

      Abstract: Abstract With tools originating from Markov processes, we investigate the similarities and differences between genomic sequences in FASTA format coming from four variants of the SARS-CoV 2 virus, B.1.1.7 (UK), B.1.351 (South Africa), B.1.617.2 (India), and P.1 (Brazil). We treat the virus’ sequences as samples of finite memory Markov processes acting in \(A=\{a,c,g,t\}.\) We model each sequence, revealing some heterogeneity between sequences belonging to the same variant. We identified the five most representative sequences for each variant using a robust notion of classification, see Fernández et al. (Math Methods Appl Sci 43(13):7537–7549. https://doi.org/10.1002/mma.5705 ). Using a notion derived from a metric between processes, see García et al. (Appl Stoch Models Bus Ind 34(6):868–878. https://doi.org/10.1002/asmb.2346), we identify four groups, each group representing a variant. It is also detected, by this metric, global proximity between the variants B.1.351 and B.1.1.7. With the selected sequences, we assemble a multiple partition model, see Cordeiro et al. (Math Methods Appl Sci 43(13):7677–7691. https://doi.org/10.1002/mma.6079), revealing in which states of the state space the variants differ, concerning the mechanisms for choosing the next element in A. Through this model, we identify that the variants differ in their transition probabilities in eleven states out of a total of 256 states. For these eleven states, we reveal how the transition probabilities change from variant (group of variants) to variant (group of variants). In other words, we indicate precisely the stochastic reasons for the discrepancies.
      PubDate: 2022-11-01
       
  • Multiway clustering with time-varying parameters

    • Free pre-print version: Loading...

      Abstract: Abstract This paper proposes a clustering approach for multivariate time series with time-varying parameters in a multiway framework. Although clustering techniques based on time series distribution characteristics have been extensively studied, methods based on time-varying parameters have only recently been explored and are missing for multivariate time series. This paper fills the gap by proposing a multiway approach for distribution-based clustering of multivariate time series. To show the validity of the proposed clustering procedure, we provide both a simulation study and an application to real air quality time series data.
      PubDate: 2022-11-01
       
  • Bayesian quantile regression models for heavy tailed bounded variables
           using the No-U-Turn sampler

    • Free pre-print version: Loading...

      Abstract: Abstract When we are interested in knowing how covariates impact different levels of the response variable, quantile regression models can be very useful, with their practical use being benefited from the increasing of computational power. The use of bounded response variables is also very common when there are data containing percentages, rates, or proportions. In this work, with the generalized Gompertz distribution as the baseline distribution, we derive two new two-parameter distributions with bounded support, and new quantile parametric mixed regression models are proposed based on these distributions, which consider bounded response variables with heavy tails. Estimation of the parameters using the Bayesian approach is considered for both models, relying on the No-U-Turn sampler algorithm. The inferential methods can be implemented and then easily used for data analysis. Simulation studies with different quantiles ( \(q=0.1\) , \(q=0.5\) and \(q=0.9\) ) and sample sizes ( \(n=100\) , \(n=200\) , \(n=500\) , \(n=2000\) , \(n=5000\) ) were conducted for 100 replicas of simulated data for each combination of settings, in the (0, 1) and [0, 1), showing the good performance of the recovery of parameters for the proposed inferential methods and models, which were compared to Beta Rectangular and Kumaraswamy regression models. Furthermore, a dataset on extreme poverty is analyzed using the proposed regression models with fixed and mixed effects. The quantile parametric models proposed in this work are an alternative and complementary modeling tool for the analysis of bounded data.
      PubDate: 2022-11-01
       
  • Multivariate understanding of income and expenditure in United States
           households with statistical learning

    • Free pre-print version: Loading...

      Abstract: Abstract In recent decades, data-driven approaches have been developed to analyze demographic and economic surveys on a large scale. Despite advances in multivariate techniques and learning methods, in practice the analysis and interpretations are often focused on a small portion of available data and limited to a single perspective. This paper aims to utilize a selected array of multivariate statistical learning methods in the analysis of income and expenditure patterns of households in the United States using the Public-Use Microdata from the Bureau of Labor Statistics Consumer Expenditure Survey (CE). The objective is to propose an effective data pipeline that provides visualizations and comprehensive interpretations for applications in governmental regulations and economic research, using thirty-five original survey variables covering the categories of demographics, income and expenditure. Details on feature extraction not only showcase CE as a unique publicly-shared big data resource with high potential for in-depth analysis, but also assist interested researchers with pre-processing. Challenges from missing values and categorical variables are treated in the exploratory analysis, while statistical learning methods are comprehensively employed to address multiple economic perspectives. Principal component analysis suggests that after-tax income, wage/salary income, and the quarterly expenditure in food, housing and overall as the five most important of the selected variables, while cluster analysis identifies and visualizes the implicit structure between variables. Based on this, canonical correlation analysis reveals high correlation between two selected groups of variables, one of income and the other of expenditure.
      PubDate: 2022-11-01
       
  • Nomclust 2.0: an R package for hierarchical clustering of objects
           characterized by nominal variables

    • Free pre-print version: Loading...

      Abstract: Abstract In this paper, we present the second generation of the nomclust R package, which we developed for the hierarchical clustering of data containing nominal variables (nominal data). The package completely covers the hierarchical clustering process, from dissimilarity matrix calculation, over the choice of a clustering method, to the evaluation of the final clusters. Through the whole clustering process, similarity measures, clustering methods, and evaluation criteria developed solely for nominal data are used, which makes this package unique. In the first part of the paper, the theoretical background of the methods used in the package is described. In the second part, the functionality of the package is demonstrated in several examples. The second generation of the package is completely rewritten to be more natural for the workflow of R users. It includes new similarity measures and evaluation criteria. We also added several graphical outputs and support for S3 generic functions. Finally, due to code optimizations, the calculation time of dissimilarity matrix calculation was substantially reduced.
      PubDate: 2022-11-01
       
  • Non-parametric seasonal unit root tests under periodic non-stationary
           volatility

    • Free pre-print version: Loading...

      Abstract: Abstract This paper presents a new non-parametric seasonal unit root testing framework that is robust to periodic non-stationary volatility in innovation variance by making an extension to the fractional seasonal variance ratio unit root tests of Eroğlu et al. (Econ Lett 167:75–80, 2018). The setup allows for both periodic heteroskedasticity structure of Burridge and Taylar (J Econ 104(1):91–117, 2001) and non-stationary volatility structure of Cavaliere and Taylor (Econ Theory 24(1):43-71, 2008). We show that the limiting null distributions of the variance ratio tests depend on nuisance parameters derived from the underlying volatility process. Monte Carlo simulations show that the standard variance ratio tests can be substantially oversized in the presence of such effects. Consequently, we propose wild bootstrap implementations of the variance ratio tests. Wild bootstrap resampling schemes are shown to deliver asymptotically pivotal inference. The simulation evidence depicts that the proposed bootstrap tests perform well in practice and essentially correct the size problems observed in the standard fractional seasonal variance ratio tests, even under extreme patterns of heteroskedasticity.
      PubDate: 2022-11-01
       
  • Overcoming convergence problems in PLS path modelling

    • Free pre-print version: Loading...

      Abstract: Abstract The present paper deals with convergence issues of Lohmöller’s procedure for the computation of the components in the PLS-PM algorithm. More datasets and proofs are given to highlight the convergence failure of this procedure. Consequently, a new procedure based on the Signless Lapalacien matrix of the indirect graph between constructs is introduced. In several cases that will be specified in this paper, both monotony and error convergence for this new procedure will be established. Several comparisons will be presented between the new procedure and the two conventionally used procedures (Lohmöller’s and Hanafi-Wold’s procedures).
      PubDate: 2022-11-01
       
  • Oblique decision tree induction by cross-entropy optimization based on the
           von Mises–Fisher distribution

    • Free pre-print version: Loading...

      Abstract: Abstract Oblique decision trees recursively divide the feature space by using splits based on linear combinations of attributes. Compared to their univariate counterparts, which only use a single attribute per split, they are often smaller and more accurate. A common approach to learn decision trees is by iteratively introducing splits on a training set in a top–down manner, yet determining a single optimal oblique split is in general computationally intractable. Therefore, one has to rely on heuristics to find near-optimal splits. In this paper, we adapt the cross-entropy optimization method to tackle this problem. The approach is motivated geometrically by the observation that equivalent oblique splits can be interpreted as connected regions on a unit hypersphere which are defined by the samples in the training data. In each iteration, the algorithm samples multiple candidate solutions from this hypersphere using the von Mises–Fisher distribution which is parameterized by a mean direction and a concentration parameter. These parameters are then updated based on the best performing samples such that when the algorithm terminates a high probability mass is assigned to a region of near-optimal solutions. Our experimental results show that the proposed method is well-suited for the induction of compact and accurate oblique decision trees in a small amount of time.
      PubDate: 2022-11-01
       
  • Ensemble updating of categorical state vectors

    • Free pre-print version: Loading...

      Abstract: Abstract An ensemble updating method for categorical state vectors is proposed. The method is based on a Bayesian view of the ensemble Kalman filter (EnKF). In the EnKF, Gaussian approximations to the forecast and filtering distributions are introduced, and the forecast ensemble is updated with a linear shift. Given that the Gaussian approximation to the forecast distribution is correct, the EnKF linear update corresponds to conditional simulation from a Gaussian distribution with mean and covariance such that the posterior samples marginally are distributed according to the Gaussian approximation to the filtering distribution. In the proposed approach for categorical vectors, the Gaussian approximations are replaced with a (possibly higher order) Markov chain model, and the linear update is replaced with simulation based on a class of decomposable graphical models. To make the update robust against errors in the assumed forecast and filtering distributions, an optimality criterion is formulated, for which the resulting optimal updating procedure can be found by solving a linear program. We explore the properties of the proposed updating procedure in a simulation example where each state variable can take three values.
      PubDate: 2022-11-01
       
  • Regularized target encoding outperforms traditional methods in supervised
           machine learning with high cardinality features

    • Free pre-print version: Loading...

      Abstract: Abstract Since most machine learning (ML) algorithms are designed for numerical inputs, efficiently encoding categorical variables is a crucial aspect in data analysis. A common problem are high cardinality features, i.e. unordered categorical predictor variables with a high number of levels. We study techniques that yield numeric representations of categorical variables which can then be used in subsequent ML applications. We focus on the impact of these techniques on a subsequent algorithm’s predictive performance, and—if possible—derive best practices on when to use which technique. We conducted a large-scale benchmark experiment, where we compared different encoding strategies together with five ML algorithms (lasso, random forest, gradient boosting, k-nearest neighbors, support vector machine) using datasets from regression, binary- and multiclass–classification settings. In our study, regularized versions of target encoding (i.e. using target predictions based on the feature levels in the training set as a new numerical feature) consistently provided the best results. Traditionally widely used encodings that make unreasonable assumptions to map levels to integers (e.g. integer encoding) or to reduce the number of levels (possibly based on target information, e.g. leaf encoding) before creating binary indicator variables (one-hot or dummy encoding) were not as effective in comparison.
      PubDate: 2022-11-01
       
  • Improved confidence intervals based on ranked set sampling designs within
           a parametric bootstrap approach

    • Free pre-print version: Loading...

      Abstract: Abstract We study the problem of obtaining confidence intervals (CIs) within a parametric framework under different ranked set sampling (RSS) designs. This is an important research issue since it has not yet been adequately addressed in the RSS literature. We focused on evaluating CIs based on a recently developed parametric bootstrap approach, and the asymptotic maximum likelihood CIs under simple random sampling (SRS) was taken as the counterpart. A comprehensive simulation study was carried out to evaluate the accuracy and precision of the CIs. We have considered as sampling designs the paired RSS, neoteric RSS, and double RSS, besides the original RSS and SRS. Different estimation methods and bootstrap CIs were evaluated. In addition, the robustness of the CIs to imperfect ranking was evaluated by inducing varied levels of ranking errors. The simulated results allowed us to identify accurate bootstrap CIs based on RSS and some of its extensions, which outperform the usual asymptotic or bootstrap CIs based on SRS in terms of accuracy (coverage rate) and/or precision (average width).
      PubDate: 2022-11-01
       
  • The 2017 Data Challenge of the American Statistical Association

    • Free pre-print version: Loading...

      PubDate: 2022-11-01
       
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
 


Your IP address: 44.210.85.190
 
Home (Search)
API
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-