Subjects -> STATISTICS (Total: 130 journals)
| A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | The end of the list has been reached or no journals were found for your choice. |
|
|
- Variational inference for semiparametric Bayesian novelty detection in
large datasets-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract After being trained on a fully-labeled training set, where the observations are grouped into a certain number of known classes, novelty detection methods aim to classify the instances of an unlabeled test set while allowing for the presence of previously unseen classes. These models are valuable in many areas, ranging from social network and food adulteration analyses to biology, where an evolving population may be present. In this paper, we focus on a two-stage Bayesian semiparametric novelty detector, also known as Brand, recently introduced in the literature. Leveraging on a model-based mixture representation, Brand allows clustering the test observations into known training terms or a single novelty term. Furthermore, the novelty term is modeled with a Dirichlet Process mixture model to flexibly capture any departure from the known patterns. Brand was originally estimated using MCMC schemes, which are prohibitively costly when applied to high-dimensional data. To scale up Brand applicability to large datasets, we propose to resort to a variational Bayes approach, providing an efficient algorithm for posterior approximation. We demonstrate a significant gain in efficiency and excellent classification performance with thorough simulation studies. Finally, to showcase its applicability, we perform a novelty detection analysis using the openly-available Statlog dataset, a large collection of satellite imaging spectra, to search for novel soil types. PubDate: 2023-12-04
- Sparse correspondence analysis for large contingency tables
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We propose sparse variants of correspondence analysis (CA) for large contingency tables like documents-terms matrices used in text mining. By seeking to obtain many zero coefficients, sparse CA remedies to the difficulty of interpreting CA results when the size of the table is large. Since CA is a double weighted PCA (for rows and columns) or a weighted generalized SVD, we adapt known sparse versions of these methods with specific developments to obtain orthogonal solutions and to tune the sparseness parameters. We distinguish two cases depending on whether sparseness is asked for both rows and columns, or only for one set. PubDate: 2023-12-01
- Monitoring photochemical pollutants based on symbolic interval-valued data
analysis-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract This study considers monitoring photochemical pollutants for anomaly detection based on symbolic interval-valued data analysis. For this task, we construct control charts based on the principal component scores of symbolic interval-valued data. Herein, the symbolic interval-valued data are assumed to follow a normal distribution, and an approximate expectation formula of order statistics from the normal distribution is used in the univariate case to estimate the mean and variance via the method of moments. In addition, we consider the bivariate case wherein we use the maximum likelihood estimator calculated from the likelihood function derived under a bivariate copula. We also establish the procedures for the statistical control chart based on the univariate and bivariate interval-valued variables, and the procedures are potentially extendable to higher dimensional cases. Monte Carlo simulations and real data analysis using photochemical pollutants confirm the validity of the proposed method. The results particularly show the superiority over the conventional method that uses the averages to identify the date on which the abnormal maximum occurred. PubDate: 2023-12-01
- A power-controlled reliability assessment for multi-class probabilistic
classifiers-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In multi-class classification, the output of a probabilistic classifier is a probability distribution of the classes. In this work, we focus on a statistical assessment of the reliability of probabilistic classifiers for multi-class problems. Our approach generates a Pearson \(\chi ^2\) statistic based on the k-nearest-neighbors in the prediction space. Further, we develop a Bayesian approach for estimating the expected power of the reliability test that can be used for an appropriate sample size k. We propose a sampling algorithm and demonstrate that this algorithm obtains a valid prior distribution. The effectiveness of the proposed reliability test and expected power is evaluated through a simulation study. We also provide illustrative examples of the proposed methods with practical applications. PubDate: 2023-12-01
- Attraction-repulsion clustering: a way of promoting diversity linked to
demographic parity in fair clustering-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We consider the problem of diversity enhancing clustering, i.e, developing clustering methods which produce clusters that favour diversity with respect to a set of protected attributes such as race, sex, age, etc. In the context of fair clustering, diversity plays a major role when fairness is understood as demographic parity. To promote diversity, we introduce perturbations to the distance in the unprotected attributes that account for protected attributes in a way that resembles attraction-repulsion of charged particles in Physics. These perturbations are defined through dissimilarities with a tractable interpretation. Cluster analysis based on attraction-repulsion dissimilarities penalizes homogeneity of the clusters with respect to the protected attributes and leads to an improvement in diversity. An advantage of our approach, which falls into a pre-processing set-up, is its compatibility with a wide variety of clustering methods and whit non-Euclidean data. We illustrate the use of our procedures with both synthetic and real data and provide discussion about the relation between diversity, fairness, and cluster structure. PubDate: 2023-12-01
- Proximal methods for sparse optimal scoring and discriminant analysis
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Linear discriminant analysis (LDA) is a classical method for dimensionality reduction, where discriminant vectors are sought to project data to a lower dimensional space for optimal separability of classes. Several recent papers have outlined strategies, based on exploiting sparsity of the discriminant vectors, for performing LDA in the high-dimensional setting where the number of features exceeds the number of observations in the data. However, many of these proposed methods lack scalable methods for solution of the underlying optimization problems. We consider an optimization scheme for solving the sparse optimal scoring formulation of LDA based on block coordinate descent. Each iteration of this algorithm requires an update of a scoring vector, which admits an analytic formula, and an update of the corresponding discriminant vector, which requires solution of a convex subproblem; we will propose several variants of this algorithm where the proximal gradient method or the alternating direction method of multipliers is used to solve this subproblem. We show that the per-iteration cost of these methods scales linearly in the dimension of the data provided restricted regularization terms are employed, and cubically in the dimension of the data in the worst case. Furthermore, we establish that when this block coordinate descent framework generates convergent subsequences of iterates, then these subsequences converge to the stationary points of the sparse optimal scoring problem. We demonstrate the effectiveness of our new methods with empirical results for classification of Gaussian data and data sets drawn from benchmarking repositories, including time-series and multispectral X-ray data, and provide Matlab and R implementations of our optimization schemes. PubDate: 2023-12-01
- Determinantal consensus clustering
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Random restart of a given algorithm produces many partitions that can be aggregated to yield a consensus clustering. Ensemble methods have been recognized as more robust approaches for data clustering than single clustering algorithms. We propose the use of determinantal point processes or DPPs for the random restart of clustering algorithms based on initial sets of center points, such as k-medoids or k-means. The relation between DPPs and kernel-based methods makes DPPs suitable to describe and quantify similarity between objects. DPPs favor diversity of the center points in initial sets, so that sets with similar points have less chance of being generated than sets with very distinct points. Most current inital sets are generated with center points sampled uniformly at random. We show through extensive simulations that, contrary to DPPs, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets. The latter are two key properties that make DPPs achieve good performance. Simulations with artificial datasets and applications to real datasets show that determinantal consensus clustering outperforms consensus clusterings which are based on uniform random sampling of center points. PubDate: 2023-12-01
- Robust instance-dependent cost-sensitive classification
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Instance-dependent cost-sensitive (IDCS) learning methods have proven useful for binary classification tasks where individual instances are associated with variable misclassification costs. However, we demonstrate in this paper by means of a series of experiments that IDCS methods are sensitive to noise and outliers in relation to instance-dependent misclassification costs and their performance strongly depends on the cost distribution of the data sample. Therefore, we propose a generic three-step framework to make IDCS methods more robust: (i) detect outliers automatically, (ii) correct outlying cost information in a data-driven way, and (iii) construct an IDCS learning method using the adjusted cost information. We apply this framework to cslogit, a logistic regression-based IDCS method, to obtain its robust version, which we name r-cslogit. The robustness of this approach is introduced in steps (i) and (ii), where we make use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data and proven to be superior in terms of savings compared to its non-robust counterpart for variable levels of noise and outliers. All our code is made available online at https://github.com/SimonDeVos/Robust-IDCS. PubDate: 2023-12-01
- Clustering data with non-ignorable missingness using semi-parametric
mixture models assuming independence within components-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package MNARclust available on CRAN. PubDate: 2023-12-01
- LASSO regularization within the LocalGLMnet architecture
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Deep learning models have been very successful in the application of machine learning methods, often out-performing classical statistical models such as linear regression models or generalized linear models. On the other hand, deep learning models are often criticized for not being explainable nor allowing for variable selection. There are two different ways of dealing with this problem, either we use post-hoc model interpretability methods or we design specific deep learning architectures that allow for an easier interpretation and explanation. This paper builds on our previous work on the LocalGLMnet architecture that gives an interpretable deep learning architecture. In the present paper, we show how group LASSO regularization (and other regularization schemes) can be implemented within the LocalGLMnet architecture so that we receive feature sparsity for variable selection. We benchmark our approach with the recently developed LassoNet of Lemhadri et al. ( LassoNet: a neural network with feature sparsity. J Mach Learn Res 22:1–29, 2021). PubDate: 2023-12-01
- Claims fraud detection with uncertain labels
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Insurance fraud is a non self-revealing type of fraud. The true historical labels (fraud or legitimate) are only as precise as the investigators’ efforts and successes to uncover them. Popular approaches of supervised and unsupervised learning fail to capture the ambiguous nature of uncertain labels. Imprecisely observed labels can be represented in the Dempster–Shafer theory of belief functions, a generalization of supervised and unsupervised learning suited to represent uncertainty. In this paper, we show that partial information from the historical investigations can add valuable, learnable information for the fraud detection system and improves its performances. We also show that belief function theory provides a flexible mathematical framework for concept drift detection and cost sensitive learning, two common challenges in fraud detection. Finally, we present an application to a real-world motor insurance claim fraud. PubDate: 2023-11-30
- Robust and sparse logistic regression
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Logistic regression is one of the most popular statistical techniques for solving (binary) classification problems in various applications (e.g. credit scoring, cancer detection, ad click predictions and churn classification). Typically, the maximum likelihood estimator is used, which is very sensitive to outlying observations. In this paper, we propose a robust and sparse logistic regression estimator where robustness is achieved by means of the \(\gamma\) -divergence. An elastic net penalty ensures sparsity in the regression coefficients such that the model is more stable and interpretable. We show that the influence function is bounded and demonstrate its robustness properties in simulations. The good performance of the proposed estimator is also illustrated in an empirical application that deals with classifying the type of fuel used by cars. PubDate: 2023-11-27
- Semiparametric mixture of linear regressions with nonparametric Gaussian
scale mixture errors-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In finite mixture of regression models, normal assumption for the errors of each regression component is typically adopted. Though this common assumption is theoretically and computationally convenient, it often produces inefficient and undesirable estimates which undermine the applicability of the model particularly in the presence of outliers. To reduce these defects, we propose to use nonparametric Gaussian scale mixture distributions for component error distributions. By this means, we can lessen the risk of misspecification and obtain robust estimators. In this paper, we study the identifiability of the proposed model and develop a feasible estimating algorithm. Numerical studies including simulation studies and real data analysis to demonstrate the performance of the proposed method are also presented. PubDate: 2023-11-23
- Functional clustering of fictional narratives using Vonnegut curves
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Motivated by a public suggestion by the famous novelist Kurt Vonnegut, we clustered functional data that represented sentiment curves for famous fictional stories. We analyzed text data from novels written between 1612 and 1925, and transformed them into curves measuring sentiment as a function of the percentage of elapsed contents of the novel. We employed sentence-level sentiment evaluation and nonparametric curve smoothing. Our clustering methods involved finding the optimal number of clusters, aligning curves using different chronological warping functions to account for phase and amplitude variation, and implementing functional K-means algorithms under the square root velocity framework. Our results revealed insights about patterns in fictional narratives that Vonnegut and others have suggested but not analyzed in a functional way. PubDate: 2023-11-04
- A between-cluster approach for clustering skew-symmetric data
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In order to investigate exchanges between objects, a clustering model for skew-symmetric data is proposed, which relies on the between-cluster effects of the skew-symmetries that represent the imbalances of the observed exchanges between pairs of objects. The aim is to detect clusters of objects that share the same behaviour of exchange so that origin and destination clusters are identified. The proposed model is based on the decomposition of the skew-symmetric matrix pertaining to the imbalances between clusters into a sum of a number of off-diagonal block matrices. Each matrix can be approximated by a skew-symmetric matrix by using a truncated Singular Value Decomposition (SVD) which exploits the properties of the skew-symmetric matrices. The model is fitted in a least-squares framework and an efficient Alternating Least Squares algorithm is provided. Finally, in order to show the potentiality of the model and the features of the resulting clusters, an extensive simulation study and an illustrative application to real data are presented. PubDate: 2023-10-28
- Applications of dual regularized Laplacian matrix for community detection
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Spectral clustering is widely used for detecting clusters in networks for community detection, while a small change on the graph Laplacian matrix could bring a dramatic improvement. In this paper, we propose a dual regularized graph Laplacian matrix and then employ it to the classical spectral clustering approach under the degree-corrected stochastic block model. If the number of communities is known as K, we consider more than K leading eigenvectors and weight them by their corresponding eigenvalues in the spectral clustering procedure to improve the performance. The improved spectral clustering method is dual regularized spectral clustering (DRSC). Theoretical analysis of DRSC shows that under mild conditions it yields stable consistent community detection. Meanwhile, we develop a strategy by taking advantage of DRSC and Newman’s modularity to estimate the number of communities K. We compare the performance of DRSC with several spectral methods and investigate the behaviors of our strategy for estimating K by substantial simulated networks and real-world networks. Numerical results show that DRSC enjoys satisfactory performance and our strategy on estimating K performs accurately and consistently, even in cases where there is only one community in a network. PubDate: 2023-10-26
- A new model for counterfactual analysis for functional data
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Counterfactual explanations have become a very popular interpretability tool to understand and explain how complex machine learning models make decisions for individual instances. Most of the research on counterfactual explainability focuses on tabular and image data and much less on models dealing with functional data. In this paper, a counterfactual analysis for functional data is addressed, in which the goal is to identify the samples of the dataset from which the counterfactual explanation is made of, as well as how they are combined so that the individual instance and its counterfactual are as close as possible. Our methodology can be used with different distance measures for multivariate functional data and is applicable to any score-based classifier. We illustrate our methodology using two different real-world datasets, one univariate and another multivariate. PubDate: 2023-10-25
- Profile-based latent class distance association analyses for sparse
tables:application to the attitude of European citizens towards sustainable tourism-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Social and behavioural sciences often deal with the analysis of associations for cross-classified data. This paper focuses on the study of the patterns observed on European citizens regarding their attitude towards sustainable tourism, specifically their willingness to change travel and tourism habits to be more sustainable. The data collected the intention to comply with nine sustainable actions; answers to these questions generated individual profiles; moreover, European country belonging is reported. Therefore, unlike a variable-oriented approach, here we are interested in a person-oriented approach through profiles. Some traditional methods are limited in their performance when using profiles, for example, by sparseness of the contingency table. We removed many of these limitations by using a latent class distance association model, clustering the row profiles into classes and representing these together with the categories of the response variable in a low-dimensional space. We showed, furthermore, that an easy interpretation of associations between clusters’ centres and categories of a response variable can be incorporated in this framework in an intuitive way using unfolding. Results of the analyses outlined that citizens mostly committed to an environmentally friendly behavior live in Sweden and Romania; citizens less willing to change their habits towards a more sustainable behavior live in Belgium, Cyprus, France, Lithuania and the Netherlands. Citizens preparedness to change habits however depends also on their socio-demographic characteristics such as gender, age, occupation, type of community where living, household size, and the frequency of travelling before the Covid-19 pandemic. PubDate: 2023-10-18
- Editorial for ADAC issue 4 of volume 17 (2023)
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2023-10-14
- Discovering interpretable structure in longitudinal predictors via
coefficient trees-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We consider the regression setting in which the response variable is not longitudinal (i.e., it is observed once for each case), but it is assumed to depend functionally on a set of predictors that are observed longitudinally, which is a specific form of functional predictors. In this situation, we often expect that the same predictor observed at nearby time points are more likely to be associated with the response in the same way. In such situations, we can exploit those aspects and discover groups of predictors that share the same (or similar) coefficient according to their temporal proximity. We propose a new algorithm called coefficient tree regression for data in which the non-longitudinal response depends on longitudinal predictors to efficiently discover the underlying temporal characteristics of the data. The approach results in a simple and highly interpretable tree structure from which the hierarchical relationships between groups of predictors that affect the response in a similar manner based on their temporal proximity can be observed, and we demonstrate with a real example that it can provide a clear and concise interpretation of the data. In numerical comparisons over a variety of examples, we show that our approach achieves substantially better predictive accuracy than existing competitors, most likely due to its inherent form of dimensionality reduction that is automatically discovered when fitting the model, in addition to having interpretability advantages and lower computational expense. PubDate: 2023-10-11
|