A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

  Subjects -> STATISTICS (Total: 130 journals)
The end of the list has been reached or no journals were found for your choice.
Similar Journals
Journal Cover
Advances in Data Analysis and Classification
Journal Prestige (SJR): 1.09
Citation Impact (citeScore): 1
Number of Followers: 52  
 
  Hybrid Journal Hybrid journal (It can contain Open Access articles)
ISSN (Print) 1862-5355 - ISSN (Online) 1862-5347
Published by Springer-Verlag Homepage  [2468 journals]
  • Robust functional logistic regression

    • Free pre-print version: Loading...

      Abstract: Abstract Functional logistic regression is a popular model to capture a linear relationship between binary response and functional predictor variables. However, many methods used for parameter estimation in functional logistic regression are sensitive to outliers, which may lead to inaccurate parameter estimates and inferior classification accuracy. We propose a robust estimation procedure for functional logistic regression, in which the observations of the functional predictor are projected onto a set of finite-dimensional subspaces via robust functional principal component analysis. This dimension-reduction step reduces the outlying effects in the functional predictor. The logistic regression coefficient is estimated using an M-type estimator based on binary response and robust principal component scores. In doing so, we provide robust estimates by minimizing the effects of outliers in the binary response and functional predictor variables. Via a series of Monte-Carlo simulations and using hand radiograph data, we examine the parameter estimation and classification accuracy for the response variable. We find that the robust procedure outperforms some existing robust and non-robust methods when outliers are present, while producing competitive results when outliers are absent. In addition, the proposed method is computationally more efficient than some existing robust alternatives.
      PubDate: 2024-02-12
       
  • Neural networks with functional inputs for multi-class supervised
           classification of replicated point patterns

    • Free pre-print version: Loading...

      Abstract: Abstract A spatial point pattern is a collection of points observed in a bounded region of the Euclidean plane or space. With the dynamic development of modern imaging methods, large datasets of point patterns are available representing for example sub-cellular location patterns for human proteins or large forest populations. The main goal of this paper is to show the possibility of solving the supervised multi-class classification task for this particular type of complex data via functional neural networks. To predict the class membership for a newly observed point pattern, we compute an empirical estimate of a selected functional characteristic. Then, we consider such estimated function to be a functional variable entering the network. In a simulation study, we show that the neural network approach outperforms the kernel regression classifier that we consider a benchmark method in the point pattern setting. We also analyse a real dataset of point patterns of intramembranous particles and illustrate the practical applicability of the proposed method.
      PubDate: 2024-02-07
       
  • k-means clustering for persistent homology

    • Free pre-print version: Loading...

      Abstract: Abstract Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram. It has recently gained much popularity from its myriad successful applications to many domains, however, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the k-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush–Kuhn–Tucker framework. Additionally, we perform numerical experiments on both simulated and real data of various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures. We find that k-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.
      PubDate: 2024-01-31
       
  • RGA: a unified measure of predictive accuracy

    • Free pre-print version: Loading...

      Abstract: Abstract A key point to assess statistical forecasts is the evaluation of their predictive accuracy. Recently, a new measure, called Rank Graduation Accuracy (RGA), based on the concordance between the ranks of the predicted values and the ranks of the actual values of a series of observations to be forecast, was proposed for the assessment of the quality of the predictions. In this paper, we demonstrate that, in a classification perspective, when the response to be predicted is binary, the RGA coincides both with the AUROC and the Wilcoxon-Mann–Whitney statistic, and can be employed to evaluate the accuracy of probability forecasts. When the response to be predicted is real valued, the RGA can still be applied, differently from the AUROC, and similarly to measures such as the RMSE. Differently from the RMSE, the RGA measure evaluates point predictions in terms of their ranks, rather than in terms of their values, improving robustness.
      PubDate: 2024-01-17
       
  • QDA classification of high-dimensional data with rare and weak signals

    • Free pre-print version: Loading...

      Abstract: Abstract This paper addresses the two-class classification problem for data with rare and weak signals, under the modern high-dimension setup \(p>>n\) . Considering the two-component mixture of Gaussian features with different random mean vector of rare and weak signals but common covariance matrix (homoscedastic Gaussian), Fan (AS 41:2537-2571, 2013) investigated the optimality of linear discriminant analysis (LDA) and proposed an efficient variable selection and classification procedure. We extend their work by incorporating the more general scenario that the two components have different random covariance matrices with difference of rare and weak signals, in order to assess the effect of difference in covariance matrix on classification. Under this model, we investigated the behaviour of quadratic discriminant analysis (QDA) classifier. In theoretical aspect, we derived the successful and unsuccessful classification regions of QDA. For data of rare signals, variable selection will mostly improve the performance of statistical procedures. Thus in implementation aspect, we proposed a variable selection procedure for QDA based on the Higher Criticism Thresholding (HCT) that was proved efficient for LDA. In addition, we conducted extensive simulation studies to demonstrate the successful and unsuccessful classification regions of QDA and evaluate the effectiveness of the proposed HCT thresholded QDA.
      PubDate: 2023-12-18
       
  • Loss-guided stability selection

    • Free pre-print version: Loading...

      Abstract: Abstract In modern data analysis, sparse model selection becomes inevitable once the number of predictor variables is very high. It is well-known that model selection procedures like the Lasso or Boosting tend to overfit on real data. The celebrated Stability Selection overcomes these weaknesses by aggregating models, based on subsamples of the training data, followed by choosing a stable predictor set which is usually much sparser than the predictor sets from the raw models. The standard Stability Selection is based on a global criterion, namely the per-family error rate, while additionally requiring expert knowledge to suitably configure the hyperparameters. Model selection depends on the loss function, i.e., predictor sets selected w.r.t. some particular loss function differ from those selected w.r.t. some other loss function. Therefore, we propose a Stability Selection variant which respects the chosen loss function via an additional validation step based on out-of-sample validation data, optionally enhanced with an exhaustive search strategy. Our Stability Selection variants are widely applicable and user-friendly. Moreover, our Stability Selection variants can avoid the issue of severe underfitting, which affects the original Stability Selection for noisy high-dimensional data, so our priority is not to avoid false positives at all costs but to result in a sparse stable model with which one can make predictions. Experiments where we consider both regression and binary classification with Boosting as model selection algorithm reveal a significant precision improvement compared to raw Boosting models while not suffering from any of the mentioned issues of the original Stability Selection.
      PubDate: 2023-12-15
       
  • A fresh look at mean-shift based modal clustering

    • Free pre-print version: Loading...

      Abstract: Abstract Modal clustering is an unsupervised learning technique where cluster centers are identified as the local maxima of nonparametric probability density estimates. A natural algorithmic engine for the computation of these maxima is the mean shift procedure, which is essentially an iteratively computed chain of local means. We revisit this technique, focusing on its link to kernel density gradient estimation, in this course proposing a novel concept for bandwidth selection based on the concept of a critical bandwidth. Furthermore, in the one-dimensional case, an inverse version of the mean shift is developed to provide a novel approach for the estimation of antimodes, which is then used to identify cluster boundaries. A simulation study is provided which assesses, in the univariate case, the classification accuracy of the mean-shift based clustering approach. Three (univariate and multivariate) examples from the fields of philately, engineering, and imaging, illustrate how modal clusterings identified through mean shift based methods relate directly and naturally to physical properties of the data-generating system. Solutions are proposed to deal computationally efficiently with large data sets.
      PubDate: 2023-12-14
       
  • A probabilistic method for reconstructing the Foreign Direct Investments
           network in search of ultimate host economies

    • Free pre-print version: Loading...

      Abstract: Abstract The Ultimate Host Economies (UHEs) of a given country are defined as the ultimate destinations of Foreign Direct Investment (FDI) originating in that country. Bilateral FDI statistics struggle to identify them due to the non-negligible presence of conduit jurisdictions, which provide attractive intermediate destinations for pass-through investments due to favorable tax regimes. At the same time, determining UHEs is crucial for understanding the actual paths followed by FDI among increasingly interdependent economies. In this paper, we first reconstruct the global FDI network through mirroring and clustering techniques, starting from data collected by the International Monetary Fund. Then we provide a method for computing an (approximate) distribution of the UHEs of a country by using a probabilistic approach to this network, based on Markov chains. More specifically, we analyze the Italian case.
      PubDate: 2023-12-08
       
  • Variational inference for semiparametric Bayesian novelty detection in
           large datasets

    • Free pre-print version: Loading...

      Abstract: Abstract After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types.
      PubDate: 2023-12-04
       
  • Sparse correspondence analysis for large contingency tables

    • Free pre-print version: Loading...

      Abstract: Abstract We propose sparse variants of correspondence analysis (CA) for large contingency tables like documents-terms matrices used in text mining. By seeking to obtain many zero coefficients, sparse CA remedies to the difficulty of interpreting CA results when the size of the table is large. Since CA is a double weighted PCA (for rows and columns) or a weighted generalized SVD, we adapt known sparse versions of these methods with specific developments to obtain orthogonal solutions and to tune the sparseness parameters. We distinguish two cases depending on whether sparseness is asked for both rows and columns, or only for one set.
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-022-00531-5
       
  • Monitoring photochemical pollutants based on symbolic interval-valued data
           analysis

    • Free pre-print version: Loading...

      Abstract: Abstract This study considers monitoring photochemical pollutants for anomaly detection based on symbolic interval-valued data analysis. For this task, we construct control charts based on the principal component scores of symbolic interval-valued data. Herein, the symbolic interval-valued data are assumed to follow a normal distribution, and an approximate expectation formula of order statistics from the normal distribution is used in the univariate case to estimate the mean and variance via the method of moments. In addition, we consider the bivariate case wherein we use the maximum likelihood estimator calculated from the likelihood function derived under a bivariate copula. We also establish the procedures for the statistical control chart based on the univariate and bivariate interval-valued variables, and the procedures are potentially extendable to higher dimensional cases. Monte Carlo simulations and real data analysis using photochemical pollutants confirm the validity of the proposed method. The results particularly show the superiority over the conventional method that uses the averages to identify the date on which the abnormal maximum occurred.
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-022-00527-1
       
  • A power-controlled reliability assessment for multi-class probabilistic
           classifiers

    • Free pre-print version: Loading...

      Abstract: Abstract In multi-class classification, the output of a probabilistic classifier is a probability distribution of the classes. In this work, we focus on a statistical assessment of the reliability of probabilistic classifiers for multi-class problems. Our approach generates a Pearson \(\chi ^2\) statistic based on the k-nearest-neighbors in the prediction space. Further, we develop a Bayesian approach for estimating the expected power of the reliability test that can be used for an appropriate sample size k. We propose a sampling algorithm and demonstrate that this algorithm obtains a valid prior distribution. The effectiveness of the proposed reliability test and expected power is evaluated through a simulation study. We also provide illustrative examples of the proposed methods with practical applications.
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-022-00528-0
       
  • Attraction-repulsion clustering: a way of promoting diversity linked to
           demographic parity in fair clustering

    • Free pre-print version: Loading...

      Abstract: Abstract We consider the problem of diversity enhancing clustering, i.e, developing clustering methods which produce clusters that favour diversity with respect to a set of protected attributes such as race, sex, age, etc. In the context of fair clustering, diversity plays a major role when fairness is understood as demographic parity. To promote diversity, we introduce perturbations to the distance in the unprotected attributes that account for protected attributes in a way that resembles attraction-repulsion of charged particles in Physics. These perturbations are defined through dissimilarities with a tractable interpretation. Cluster analysis based on attraction-repulsion dissimilarities penalizes homogeneity of the clusters with respect to the protected attributes and leads to an improvement in diversity. An advantage of our approach, which falls into a pre-processing set-up, is its compatibility with a wide variety of clustering methods and whit non-Euclidean data. We illustrate the use of our procedures with both synthetic and real data and provide discussion about the relation between diversity, fairness, and cluster structure.
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-022-00516-4
       
  • Proximal methods for sparse optimal scoring and discriminant analysis

    • Free pre-print version: Loading...

      Abstract: Abstract Linear discriminant analysis (LDA) is a classical method for dimensionality reduction, where discriminant vectors are sought to project data to a lower dimensional space for optimal separability of classes. Several recent papers have outlined strategies, based on exploiting sparsity of the discriminant vectors, for performing LDA in the high-dimensional setting where the number of features exceeds the number of observations in the data. However, many of these proposed methods lack scalable methods for solution of the underlying optimization problems. We consider an optimization scheme for solving the sparse optimal scoring formulation of LDA based on block coordinate descent. Each iteration of this algorithm requires an update of a scoring vector, which admits an analytic formula, and an update of the corresponding discriminant vector, which requires solution of a convex subproblem; we will propose several variants of this algorithm where the proximal gradient method or the alternating direction method of multipliers is used to solve this subproblem. We show that the per-iteration cost of these methods scales linearly in the dimension of the data provided restricted regularization terms are employed, and cubically in the dimension of the data in the worst case. Furthermore, we establish that when this block coordinate descent framework generates convergent subsequences of iterates, then these subsequences converge to the stationary points of the sparse optimal scoring problem. We demonstrate the effectiveness of our new methods with empirical results for classification of Gaussian data and data sets drawn from benchmarking repositories, including time-series and multispectral X-ray data, and provide Matlab and R implementations of our optimization schemes.
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-022-00530-6
       
  • Determinantal consensus clustering

    • Free pre-print version: Loading...

      Abstract: Abstract Random restart of a given algorithm produces many partitions that can be aggregated to yield a consensus clustering. Ensemble methods have been recognized as more robust approaches for data clustering than single clustering algorithms. We propose the use of determinantal point processes or DPPs for the random restart of clustering algorithms based on initial sets of center points, such as k-medoids or k-means. The relation between DPPs and kernel-based methods makes DPPs suitable to describe and quantify similarity between objects. DPPs favor diversity of the center points in initial sets, so that sets with similar points have less chance of being generated than sets with very distinct points. Most current inital sets are generated with center points sampled uniformly at random. We show through extensive simulations that, contrary to DPPs, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets. The latter are two key properties that make DPPs achieve good performance. Simulations with artificial datasets and applications to real datasets show that determinantal consensus clustering outperforms consensus clusterings which are based on uniform random sampling of center points.
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-022-00514-6
       
  • Robust instance-dependent cost-sensitive classification

    • Free pre-print version: Loading...

      Abstract: Abstract Instance-dependent cost-sensitive (IDCS) learning methods have proven useful for binary classification tasks where individual instances are associated with variable misclassification costs. However, we demonstrate in this paper by means of a series of experiments that IDCS methods are sensitive to noise and outliers in relation to instance-dependent misclassification costs and their performance strongly depends on the cost distribution of the data sample. Therefore, we propose a generic three-step framework to make IDCS methods more robust: (i) detect outliers automatically, (ii) correct outlying cost information in a data-driven way, and (iii) construct an IDCS learning method using the adjusted cost information. We apply this framework to cslogit, a logistic regression-based IDCS method, to obtain its robust version, which we name r-cslogit. The robustness of this approach is introduced in steps (i) and (ii), where we make use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data and proven to be superior in terms of savings compared to its non-robust counterpart for variable levels of noise and outliers. All our code is made available online at https://github.com/SimonDeVos/Robust-IDCS.
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-022-00533-3
       
  • Clustering data with non-ignorable missingness using semi-parametric
           mixture models assuming independence within components

    • Free pre-print version: Loading...

      Abstract: Abstract We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package MNARclust available on CRAN.
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-023-00534-w
       
  • LASSO regularization within the LocalGLMnet architecture

    • Free pre-print version: Loading...

      Abstract: Abstract Deep learning models have been very successful in the application of machine learning methods, often out-performing classical statistical models such as linear regression models or generalized linear models. On the other hand, deep learning models are often criticized for not being explainable nor allowing for variable selection. There are two different ways of dealing with this problem, either we use post-hoc model interpretability methods or we design specific deep learning architectures that allow for an easier interpretation and explanation. This paper builds on our previous work on the LocalGLMnet architecture that gives an interpretable deep learning architecture. In the present paper, we show how group LASSO regularization (and other regularization schemes) can be implemented within the LocalGLMnet architecture so that we receive feature sparsity for variable selection. We benchmark our approach with the recently developed LassoNet of Lemhadri et al. ( LassoNet: a neural network with feature sparsity. J Mach Learn Res 22:1–29, 2021).
      PubDate: 2023-12-01
      DOI: 10.1007/s11634-022-00529-z
       
  • Claims fraud detection with uncertain labels

    • Free pre-print version: Loading...

      Abstract: Abstract Insurance fraud is a non self-revealing type of fraud. The true historical labels (fraud or legitimate) are only as precise as the investigators’ efforts and successes to uncover them. Popular approaches of supervised and unsupervised learning fail to capture the ambiguous nature of uncertain labels. Imprecisely observed labels can be represented in the Dempster–Shafer theory of belief functions, a generalization of supervised and unsupervised learning suited to represent uncertainty. In this paper, we show that partial information from the historical investigations can add valuable, learnable information for the fraud detection system and improves its performances. We also show that belief function theory provides a flexible mathematical framework for concept drift detection and cost sensitive learning, two common challenges in fraud detection. Finally, we present an application to a real-world motor insurance claim fraud.
      PubDate: 2023-11-30
       
  • Editorial for ADAC issue 4 of volume 17 (2023)

    • Free pre-print version: Loading...

      PubDate: 2023-10-14
      DOI: 10.1007/s11634-023-00564-4
       
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
 


Your IP address: 3.239.9.151
 
Home (Search)
API
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-
JournalTOCs
 
 

 A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z  

  Subjects -> STATISTICS (Total: 130 journals)
The end of the list has been reached or no journals were found for your choice.
Similar Journals
Similar Journals
HOME > Browse the 73 Subjects covered by JournalTOCs  
SubjectTotal Journals
 
 
JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
 


Your IP address: 3.239.9.151
 
Home (Search)
API
About JournalTOCs
News (blog, publications)
JournalTOCs on Twitter   JournalTOCs on Facebook

JournalTOCs © 2009-