A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

  Subjects -> STATISTICS (Total: 130 journals)
The end of the list has been reached or no journals were found for your choice.
Similar Journals
Journal Cover
Advances in Data Analysis and Classification
Journal Prestige (SJR): 1.09
Citation Impact (citeScore): 1
Number of Followers: 52  
 
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1862-5355 - ISSN (Online) 1862-5347
Published by Springer-Verlag Homepage  [2467 journals]
  • Correction to: Robust optimal classification trees under noisy labels

    • Free pre-print version: Loading...

      PubDate: 2022-12-01
       
  • Correction to: Principal component analysis constrained by layered simple
           structures

    • Free pre-print version: Loading...

      Abstract: A Correction to this paper has been published: 10.1007/s11634-022-00503-9
      PubDate: 2022-12-01
       
  • Correction to: Multivariate cluster weighted models using skewed
           distributions

    • Free pre-print version: Loading...

      Abstract: Abstract In the original publication of the article, the line after equation (5) has been published incorrectly
      PubDate: 2022-12-01
       
  • Polynomial approximate discretization of geometric centers in
           high-dimensional Euclidean space

    • Free pre-print version: Loading...

      Abstract: Abstract Many geometric optimization problems can be reduced to choosing points in space (centers) minimizing some objective function which continuously depends on the distances from the chosen centers to given input points. We prove that, for any fixed \(\varepsilon >0\) , every finite set of points in any-dimensional real space admits a polynomial-size set of candidate centers which can be computed in polynomial time and which contains a \((1+\varepsilon )\) -approximation of each point of space with respect to the Euclidean distances to all the given points. It provides a universal approximation-preserving reduction of any geometric center-based problems whose objective function satisfies a natural continuity-type condition to their discrete versions where the desired centers are selected from a polynomial-size set of candidates. The obtained polynomial upper bound for the size of a universal centers set is supplemented by a theoretical lower bound for this size in the worst case.
      PubDate: 2022-12-01
       
  • Independence versus indetermination: basis of two canonical clustering
           criteria

    • Free pre-print version: Loading...

      Abstract: Abstract This paper aims at comparing two coupling approaches as basic layers for building clustering criteria, suited for modularizing and clustering very large networks. We briefly use “optimal transport theory” as a starting point, and a way as well, to derive two canonical couplings: “statistical independence” and “logical indetermination”. A symmetric list of properties is provided and notably the so called “Monge’s properties”, applied to contingency matrices, and justifying the \(\otimes \) versus \(\oplus \) notation. A study is proposed, highlighting “logical indetermination”, because it is, by far, lesser known. Eventually we estimate the average difference between both couplings as the key explanation of their usually close results in network clustering.
      PubDate: 2022-12-01
       
  • Least-squares bilinear clustering of three-way data

    • Free pre-print version: Loading...

      Abstract: Abstract A least-squares bilinear clustering framework for modelling three-way data, where each observation consists of an ordinary two-way matrix, is introduced. The method combines bilinear decompositions of the two-way matrices with clustering over observations. Different clusterings are defined for each part of the bilinear decomposition, which decomposes the matrix-valued observations into overall means, row margins, column margins and row–column interactions. Therefore up to four different classifications are defined jointly, one for each type of effect. The computational burden is greatly reduced by the orthogonality of the bilinear model, such that the joint clustering problem reduces to separate problems which can be handled independently. Three of these sub-problems are specific cases of k-means clustering; a special algorithm is formulated for the row–column interactions, which are displayed in clusterwise biplots. The method is illustrated via an empirical example and interpreting the interaction biplots are discussed. Supplemental materials for this paper are available online, which includes the dedicated R package, lsbclust.
      PubDate: 2022-12-01
       
  • A comparison of two dissimilarity functions for mixed-type predictor
           variables in the $$\delta $$ -machine

    • Free pre-print version: Loading...

      Abstract: Abstract The \(\delta \) -machine is a statistical learning tool for classification based on dissimilarities or distances between profiles of the observations to profiles of a representation set, which was proposed by Yuan et al. (J Claasif 36(3): 442–470, 2019). So far, the \(\delta \) -machine was restricted to continuous predictor variables only. In this article, we extend the \(\delta \) -machine to handle continuous, ordinal, nominal, and binary predictor variables. We utilized a tailored dissimilarity function for mixed type variables which was defined by Gower. This measure has properties of a Manhattan distance. We develop, in a similar vein, a Euclidean dissimilarity function for mixed type variables. In simulation studies we compare the performance of the two dissimilarity functions and we compare the predictive performance of the \(\delta \) -machine to logistic regression models. We generated data according to two population distributions where the type of predictor variables, the distribution of categorical variables, and the number of predictor variables was varied. The performance of the \(\delta \) -machine using the two dissimilarity functions and different types of representation set was investigated. The simulation studies showed that the adjusted Euclidean dissimilarity function performed better than the adjusted Gower dissimilarity function; that the \(\delta \) -machine outperformed logistic regression; and that for constructing the representation set, K-medoids clustering achieved fewer active exemplars than the one using K-means clustering while maintaining the accuracy. We also applied the \(\delta \) -machine to an empirical example, discussed its interpretation in detail, and compared the classification performance with five other classification methods. The results showed that the \(\delta \) -machine has a good balance between accuracy and interpretability.
      PubDate: 2022-12-01
       
  • Sparse dimension reduction based on energy and ball statistics

    • Free pre-print version: Loading...

      Abstract: Abstract Two new methods for sparse dimension reduction are introduced, based on martingale difference divergence and ball covariance, respectively. These methods can be utilized straightforwardly as sufficient dimension reduction (SDR) techniques to estimate a sufficient dimension reduced subspace, which contains all information sufficient to explain a dependent variable. Moreover, owing to their sparsity, they intrinsically perform sufficient variable selection (SVS) and present two attractive new approaches to variable selection in a context of nonlinear dependencies that require few model assumptions. The two new methods are compared to a similar existing approach for SDR and SVS based on distance covariance, as well as to classical and robust sparse partial least squares. A simulation study shows that each of the new estimators can achieve correct variable selection in highly nonlinear contexts, yet are sensitive to outliers and computationally intensive. The study sheds light on the subtle differences between the methods. Two examples illustrate how they can be applied in practice, with a slight preference for the option based on martingale difference divergence in a bioinformatics example.
      PubDate: 2022-12-01
       
  • Quantile composite-based path modeling: algorithms, properties and
           applications

    • Free pre-print version: Loading...

      Abstract: Abstract Composite-based path modeling aims to study the relationships among a set of constructs, that is a representation of theoretical concepts. Such constructs are operationalized as composites (i.e. linear combinations of observed or manifest variables). The traditional partial least squares approach to composite-based path modeling focuses on the conditional means of the response distributions, being based on ordinary least squares regressions. Several are the cases where limiting to the mean could not reveal interesting effects at other locations of the outcome variables. Among these: when response variables are highly skewed, distributions have heavy tails and the analysis is concerned also about the tail part, heteroscedastic variances of the errors is present, distributions are characterized by outliers and other extreme data. In such cases, the quantile approach to path modeling is a valuable tool to complement the traditional approach, analyzing the entire distribution of outcome variables. Previous research has already shown the benefits of Quantile Composite-based Path Modeling but the methodological properties of the method have never been investigated. This paper offers a complete description of Quantile Composite-based Path Modeling, illustrating in details the method, the algorithms, the partial optimization criteria along with the machinery for validating and assessing the models. The asymptotic properties of the method are investigated through a simulation study. Moreover, an application on chronic kidney disease in diabetic patients is used to provide guidelines for the interpretation of results and to show the potentialities of the method to detect heterogeneity in the variable relationships.
      PubDate: 2022-12-01
       
  • The minimum weighted covariance determinant estimator for high-dimensional
           data

    • Free pre-print version: Loading...

      Abstract: Abstract In a variety of diverse applications, it is very desirable to perform a robust analysis of high-dimensional measurements without being harmed by the presence of a possibly larger percentage of outlying measurements. The minimum weighted covariance determinant (MWCD) estimator, based on implicit weights assigned to individual observations, represents a promising and flexible extension of the popular minimum covariance determinant (MCD) estimator of the expectation and scatter matrix of mlutivariate data. In this work, a regularized version of the MWCD denoted as the minimum regularized weighted covariance determinant (MRWCD) estimator is proposed. At the same time, it is accompanied by an outlier detection procedure. The novel MRWCD estimator is able to outperform other available robust estimators in several simulation scenarios, especially in estimating the scatter matrix of contaminated high-dimensional data.
      PubDate: 2022-12-01
       
  • Is there a role for statistics in artificial intelligence'

    • Free pre-print version: Loading...

      Abstract: Abstract The research on and application of artificial intelligence (AI) has triggered a comprehensive scientific, economic, social and political discussion. Here we argue that statistics, as an interdisciplinary scientific field, plays a substantial role both for the theoretical and practical understanding of AI and for its future development. Statistics might even be considered a core element of AI. With its specialist knowledge of data evaluation, starting with the precise formulation of the research question and passing through a study design stage on to analysis and interpretation of the results, statistics is a natural partner for other disciplines in teaching, research and practice. This paper aims at highlighting the relevance of statistical methodology in the context of AI development. In particular, we discuss contributions of statistics to the field of artificial intelligence concerning methodological development, planning and design of studies, assessment of data quality and data collection, differentiation of causality and associations and assessment of uncertainty in results. Moreover, the paper also discusses the equally necessary and meaningful extensions of curricula in schools and universities to integrate statistical aspects into AI teaching.
      PubDate: 2022-12-01
       
  • SUBiNN: a stacked uni- and bivariate kNN sparse ensemble

    • Free pre-print version: Loading...

      Abstract: Abstract Nearest Neighbor classification is an intuitive distance-based classification method. It has, however, two drawbacks: (1) it is sensitive to the number of features, and (2) it does not give information about the importance of single features or pairs of features. In stacking, a set of base-learners is combined in one overall ensemble classifier by means of a meta-learner. In this manuscript we combine univariate and bivariate nearest neighbor classifiers that are by itself easily interpretable. Furthermore, we combine these classifiers by a Lasso method that results in a sparse ensemble of nonlinear main and pairwise interaction effects. We christened the new method SUBiNN: Stacked Uni- and Bivariate Nearest Neighbors. SUBiNN overcomes the two drawbacks of simple nearest neighbor methods. In extensive simulations and using benchmark data sets, we evaluate the predictive performance of SUBiNN and compare it to other nearest neighbor ensemble methods as well as Random Forests and Support Vector Machines. Results indicate that SUBiNN often outperforms other nearest neighbor methods, that SUBiNN is well capable of identifying noise features, but that Random Forests is often, but not always, the best classifier.
      PubDate: 2022-12-01
       
  • A power-controlled reliability assessment for multi-class probabilistic
           classifiers

    • Free pre-print version: Loading...

      Abstract: Abstract In multi-class classification, the output of a probabilistic classifier is a probability distribution of the classes. In this work, we focus on a statistical assessment of the reliability of probabilistic classifiers for multi-class problems. Our approach generates a Pearson \(\chi ^2\) statistic based on the k-nearest-neighbors in the prediction space. Further, we develop a Bayesian approach for estimating the expected power of the reliability test that can be used for an appropriate sample size k. We propose a sampling algorithm and demonstrate that this algorithm obtains a valid prior distribution. The effectiveness of the proposed reliability test and expected power is evaluated through a simulation study. We also provide illustrative examples of the proposed methods with practical applications.
      PubDate: 2022-11-17
       
  • A dual subspace parsimonious mixture of matrix normal distributions

    • Free pre-print version: Loading...

      Abstract: Abstract We present a parsimonious dual-subspace clustering approach for a mixture of matrix-normal distributions. By assuming certain principal components of the row and column covariance matrices are equally important, we express the model in fewer parameters without sacrificing discriminatory information. We derive update rules for an ECM algorithm and set forth necessary conditions to ensure identifiability. We use simulation to demonstrate parameter recovery, and we illustrate the parsimony and competitive performance of the model through two data analyses.
      PubDate: 2022-11-16
       
  • Monitoring photochemical pollutants based on symbolic interval-valued data
           analysis

    • Free pre-print version: Loading...

      Abstract: Abstract This study considers monitoring photochemical pollutants for anomaly detection based on symbolic interval-valued data analysis. For this task, we construct control charts based on the principal component scores of symbolic interval-valued data. Herein, the symbolic interval-valued data are assumed to follow a normal distribution, and an approximate expectation formula of order statistics from the normal distribution is used in the univariate case to estimate the mean and variance via the method of moments. In addition, we consider the bivariate case wherein we use the maximum likelihood estimator calculated from the likelihood function derived under a bivariate copula. We also establish the procedures for the statistical control chart based on the univariate and bivariate interval-valued variables, and the procedures are potentially extendable to higher dimensional cases. Monte Carlo simulations and real data analysis using photochemical pollutants confirm the validity of the proposed method. The results particularly show the superiority over the conventional method that uses the averages to identify the date on which the abnormal maximum occurred.
      PubDate: 2022-11-12
       
  • Editorial for ADAC issue 4 of volume 16 (2022)

    • Free pre-print version: Loading...

      PubDate: 2022-10-31
       
  • Attraction-repulsion clustering: a way of promoting diversity linked to
           demographic parity in fair clustering

    • Free pre-print version: Loading...

      Abstract: Abstract We consider the problem of diversity enhancing clustering, i.e, developing clustering methods which produce clusters that favour diversity with respect to a set of protected attributes such as race, sex, age, etc. In the context of fair clustering, diversity plays a major role when fairness is understood as demographic parity. To promote diversity, we introduce perturbations to the distance in the unprotected attributes that account for protected attributes in a way that resembles attraction-repulsion of charged particles in Physics. These perturbations are defined through dissimilarities with a tractable interpretation. Cluster analysis based on attraction-repulsion dissimilarities penalizes homogeneity of the clusters with respect to the protected attributes and leads to an improvement in diversity. An advantage of our approach, which falls into a pre-processing set-up, is its compatibility with a wide variety of clustering methods and whit non-Euclidean data. We illustrate the use of our procedures with both synthetic and real data and provide discussion about the relation between diversity, fairness, and cluster structure.
      PubDate: 2022-10-20
       
  • A structured covariance ensemble for sufficient dimension reduction

    • Free pre-print version: Loading...

      Abstract: Abstract Sufficient dimension reduction (SDR) is a useful tool for high-dimensional data analysis. SDR aims at reducing the data dimensionality without loss of regression information between the response and its high-dimensional predictors. Many existing SDR methods are designed for the data with continuous responses. Motivated by a recent work on aggregate dimension reduction (Wang in Stat Si 30:1027–1048, 2020), we propose a unified SDR framework for both continuous and binary responses through a structured covariance ensemble. The connection with existing approaches is discussed in details and an efficient algorithm is proposed. Numerical examples and a real data application demonstrate its satisfactory performance.
      PubDate: 2022-10-19
       
  • Semiparametric finite mixture of regression models with Bayesian P-splines

    • Free pre-print version: Loading...

      Abstract: Abstract Mixture models provide a useful tool to account for unobserved heterogeneity and are at the basis of many model-based clustering methods. To gain additional flexibility, some model parameters can be expressed as functions of concomitant covariates. In this Paper, a semiparametric finite mixture of regression models is defined, with concomitant information assumed to influence both the component weights and the conditional means. In particular, linear predictors are replaced with smooth functions of the covariate considered by resorting to cubic splines. An estimation procedure within the Bayesian paradigm is suggested, where smoothness of the covariate effects is controlled by suitable choices for the prior distributions of the spline coefficients. A data augmentation scheme based on difference random utility models is exploited to describe the mixture weights as functions of the covariate. The performance of the proposed methodology is investigated via simulation experiments and two real-world datasets, one about baseball salaries and the other concerning nitrogen oxide in engine exhaust.
      PubDate: 2022-10-18
       
  • On smoothing and scaling language model for sentiment based information
           retrieval

    • Free pre-print version: Loading...

      Abstract: Abstract Sentiment analysis or opinion mining refers to the discovery of sentiment information within textual documents, tweets, or review posts. This field has emerged with the social media outgrowth which becomes of great interest for several applications such as marketing, tourism, and business. In this work, we approach Twitter sentiment analysis through a novel framework that addresses simultaneously the problems of text representation such as sparseness and high-dimensionality. We propose an information retrieval probabilistic model based on a new distribution namely the Smoothed Scaled Dirichlet distribution. We present a likelihood learning method for estimating the parameters of the distribution and we propose a feature generation from the information retrieval system. We apply the proposed approach Smoothed Scaled Relevance Model on four Twitter sentiment datasets: STD, STS-Gold, SemEval14, and SentiStrength. We evaluate the performance of the offered solution with a comparison against the baseline models and the related-works.
      PubDate: 2022-10-13
       
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
 


Your IP address: 44.201.99.222
 
Home (Search)
API
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-