Subjects -> STATISTICS (Total: 130 journals)
| A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | The end of the list has been reached or no journals were found for your choice. |
|
|
- Identification of representative trees in random forests based on a new
tree-based distance measure-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR (https://github.com/imbs-hl/timbR). PubDate: 2023-03-16
- Threshold-based Naïve Bayes classifier
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The Threshold-based Naïve Bayes (Tb-NB) classifier is introduced as a (simple) improved version of the original Naïve Bayes classifier. Tb-NB extracts the sentiment from a Natural Language text corpus and allows the user not only to predict how much a sentence is positive (negative) but also to quantify a sentiment with a numeric value. It is based on the estimation of a single threshold value that concurs to define a decision rule that classifies a text into a positive (negative) opinion based on its content. One of the main advantage deriving from Tb-NB is the possibility to utilize its results as the input of post-hoc analysis aimed at observing how the quality associated to the different dimensions of a product or a service or, in a mirrored fashion, the different dimensions of customer satisfaction evolve in time or change with respect to different locations. The effectiveness of Tb-NB is evaluated analyzing data concerning the tourism industry and, specifically, hotel guests’ reviews from all hotels located in the Sardinian region and available on Booking.com. Moreover, Tb-NB is compared with other popular classifiers used in sentiment analysis in terms of model accuracy, resistance to noise and computational efficiency. PubDate: 2023-03-14
- Over-optimistic evaluation and reporting of novel cluster algorithms: an
illustrative study-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract When researchers publish new cluster algorithms, they usually demonstrate the strengths of their novel approaches by comparing the algorithms’ performance with existing competitors. However, such studies are likely to be optimistically biased towards the new algorithms, as the authors have a vested interest in presenting their method as favorably as possible in order to increase their chances of getting published. Therefore, the superior performance of newly introduced cluster algorithms is over-optimistic and might not be confirmed in independent benchmark studies performed by neutral and unbiased authors. This problem is known among many researchers, but so far, the different mechanisms leading to over-optimism in cluster algorithm evaluation have never been systematically studied and discussed. Researchers are thus often not aware of the full extent of the problem. We present an illustrative study to illuminate the mechanisms by which authors—consciously or unconsciously—paint their cluster algorithm’s performance in an over-optimistic light. Using the recently published cluster algorithm Rock as an example, we demonstrate how optimization of the used datasets or data characteristics, of the algorithm’s parameters and of the choice of the competing cluster algorithms leads to Rock’s performance appearing better than it actually is. Our study is thus a cautionary tale that illustrates how easy it can be for researchers to claim apparent “superiority” of a new cluster algorithm. This illuminates the vital importance of strategies for avoiding the problems of over-optimism (such as, e.g., neutral benchmark studies), which we also discuss in the article. PubDate: 2023-03-01
- Optimal projections for Gaussian discriminants
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We study the problem of obtaining optimal projections for performing discriminant analysis with Gaussian class densities. Unlike in most existing approaches to the problem, we focus on the optimisation of the multinomial likelihood based on posterior probability estimates, which directly captures discriminability of classes. Finding optimal projections offers utility for dimension reduction and regularisation, as well as instructive visualisation for better model interpretability. Practical applications of the proposed approach show that it is highly competitive with existing Gaussian discriminant models. Code to implement the proposed method is available in the form of an R package from https://github.com/DavidHofmeyr/OPGD. PubDate: 2023-03-01
- Sparsifying the least-squares approach to PCA: comparison of lasso and
cardinality constraint-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Sparse PCA methods are used to overcome the difficulty of interpreting the solution obtained from PCA. However, constraining PCA to obtain sparse solutions is an intractable problem, especially in a high-dimensional setting. Penalized methods are used to obtain sparse solutions due to their computational tractability. Nevertheless, recent developments permit efficiently obtaining good solutions of cardinality-constrained PCA problems allowing comparison between these approaches. Here, we conduct a comparison between a penalized PCA method with its cardinality-constrained counterpart for the least-squares formulation of PCA imposing sparseness on the component weights. We compare the penalized and cardinality-constrained methods through a simulation study that estimates the sparse structure’s recovery, mean absolute bias, mean variance, and mean squared error. Additionally, we use a high-dimensional data set to illustrate the methods in practice. Results suggest that using cardinality-constrained methods leads to better recovery of the sparse structure. PubDate: 2023-03-01
- Clusterwise elastic-net regression based on a combined information
criterion-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Many research questions pertain to a regression problem assuming that the population under study is not homogeneous with respect to the underlying model. In this setting, we propose an original method called Combined Information criterion CLUSterwise elastic-net regression (Ciclus). This method handles several methodological and application-related challenges. It is derived from both the information theory and the microeconomic utility theory and maximizes a well-defined criterion combining three weighted sub-criteria, each being related to a specific aim: getting a parsimonious partition, compact clusters for a better prediction of cluster-membership, and a good within-cluster regression fit. The solving algorithm is monotonously convergent, under mild assumptions. The Ciclus principle provides an innovative solution to two key issues: (i) the automatic optimization of the number of clusters, (ii) the proposal of a prediction model. We applied it to elastic-net regression in order to be able to manage high-dimensional data involving redundant explanatory variables. Ciclus is illustrated through both a simulation study and a real example in the field of omic data, showing how it improves the quality of the prediction and facilitates the interpretation. It should therefore prove useful whenever the data involve a population mixture as for example in biology, social sciences, economics or marketing. PubDate: 2023-03-01
- CenetBiplot: a new proposal of sparse and orthogonal biplots methods by
means of elastic net CSVD-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In this work, a new mathematical algorithm for sparse and orthogonal constrained biplots, called CenetBiplots, is proposed. Biplots provide a joint representation of observations and variables of a multidimensional matrix in the same reference system. In this subspace the relationships between them can be interpreted in terms of geometric elements. CenetBiplots projects a matrix onto a low-dimensional space generated simultaneously by sparse and orthogonal principal components. Sparsity is desired to select variables automatically, and orthogonality is necessary to keep the geometrical properties that ensure the biplots graphical interpretation. To this purpose, the present study focuses on two different objectives: 1) the extension of constrained singular value decomposition to incorporate an elastic net sparse constraint (CenetSVD), and 2) the implementation of CenetBiplots using CenetSVD. The usefulness of the proposed methodologies for analysing high-dimensional and low-dimensional matrices is shown. Our method is implemented in R software and available for download from https://github.com/ananieto/SparseCenetMA. PubDate: 2023-03-01
- Assessing similarities between spatial point patterns with a Siamese
neural network discriminant model-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Identifying structural differences among observed point patterns from several populations is of interest in several applications. We use deep convolutional neural networks and employ a Siamese framework to build a discriminant model for distinguishing structural differences between spatial point patterns. In a simulation study, and using a one-shot learning classification, we show that the Siamese network discriminant model outperforms the common dissimilarities based on intensity and K functions. The model is then used to analyze similarities between spatial point patterns of 130 species in a tropical rainforest study plot observed at different time instances. The simulation study and data analysis show the adequacy and generality of a Siamese network discriminant model in the classification of spatial point patterns. PubDate: 2023-03-01
- Poisson degree corrected dynamic stochastic block model
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Stochastic Block Model (SBM) provides a statistical tool for modeling and clustering network data. In this paper, we propose an extension of this model for discrete-time dynamic networks that takes into account the variability in node degrees, allowing us to model a broader class of networks. We develop a probabilistic model that generates temporal graphs with a dynamic cluster structure and time-dependent degree corrections for each node. Thanks to these degree corrections, the nodes can have variable in- and out-degrees, allowing us to model complex cluster structures as well as interactions that decrease or increase over time. We compare the proposed model to a model without degree correction and highlight its advantages in the case of inhomogenous degree distributions in the clusters and in the recovery of unstable cluster dynamics. We propose an inference procedure based on Variational Expectation-Maximization (VEM) that also provides the means to estimate the time-dependent degree corrections. Extensive experiments on simulated and real datasets confirm the benefits of our approach and show the effectiveness of the proposed algorithm. PubDate: 2023-03-01
- Robust mixture regression modeling based on two-piece scale mixtures of
normal distributions-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The inference of mixture regression models (MRM) is traditionally based on the normal (symmetry) assumption of component errors and thus is sensitive to outliers or symmetric/asymmetric lightly/heavy-tailed errors. To deal with these problems, some new mixture regression models have been proposed recently. In this paper, a general class of robust mixture regression models is presented based on the two-piece scale mixtures of normal (TP-SMN) distributions. The proposed model is so flexible that can simultaneously accommodate asymmetry and heavy tails. The stochastic representation of the proposed model enables us to easily implement an EM-type algorithm to estimate the unknown parameters of the model based on a penalized likelihood. In addition, the performance of the considered estimators is illustrated using a simulation study and a real data example. PubDate: 2023-03-01
- Kurtosis removal for data pre-processing
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Mesokurtic projections are linear projections with null fourth cumulants. They might be useful data pre-processing tools when nonnormality, as measured by the fourth cumulants, is either an opportunity or a challenge. Nonnull fourth cumulants are opportunities when projections with extreme kurtosis are used to identify interesting nonnormal features, as for example clusters and outliers. Unfortunately, this approach suffers from the curse of dimensionality, which may be addressed by projecting the data onto the subspace orthogonal to mesokurtic projections. Nonnull fourth cumulants are challenges when using statistical methods whose sampling properties heavily depend on the fourth cumulant themselves. Mesokurtic projections ease the problem by allowing to use the inferential properties of the same methods under normality. The paper shows necessary and sufficient conditions for the existence of mesokurtic projections and compares them with other gaussianization methods. Theoretical and empirical results suggest that mesokurtic transformations are particularly useful when sampling from finite normal mixtures. The practical use of mesokurtic projections is illustrated with the AIS and the RANDU datasets. PubDate: 2023-03-01
- Notes on the H-measure of classifier performance
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The H-measure is a classifier performance measure which takes into account the context of application without requiring a rigid value of relative misclassification costs to be set. Since its introduction in 2009 it has become widely adopted. This paper answers various queries which users have raised since its introduction, including questions about its interpretation, the choice of a weighting function, whether it is strictly proper, its coherence, and relates the measure to other work. PubDate: 2023-03-01
- Early identification of biliary atresia using subspace and the bootstrap
methods-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract In clinical medicine, physicians often rely on information derived from medical imaging systems, such as image data for diagnosis. To detect disease early, physicians extract essential information from data manually to distinguish accurately between positive and negative cases of disease. In recent years, deep learning (DL) has been used for this purpose, attracting the attention of prominent researchers because of its excellent performance. Consequently, DL and other artificial intelligence (AI) technologies are expected to develop further through integration with statistical and other approaches. Here, we examine biliary atresia (BA), a rare disease that affects primarily infants. Our study focuses on the identification of BA from image data (stool images of BA patients). Using AI and statistical approaches, we propose a machine learning classifier (model) for accurate diagnosis, efficient classification, and early detection of BA after exposure to limited training data. In an initial study, we used the subspace pattern recognition method for the development of a similar classifier. In this study, we propose the development of a filter based on the subspace method and a statistical approach. The filter enables the classifier to extract essential information from image data and discriminate efficiently between BA and non-BA patients. PubDate: 2023-03-01
- Minimum adjusted Rand index for two clusterings of a given size
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract The adjusted Rand index (ARI) is commonly used in cluster analysis to measure the degree of agreement between two data partitions. Since its introduction, exploring the situations of extreme agreement and disagreement under different circumstances has been a subject of interest, in order to achieve a better understanding of this index. Here, an explicit formula for the lowest possible value of the ARI for two clusterings of given sizes is shown, and moreover a specific pair of clusterings achieving such a bound is provided. PubDate: 2023-03-01
- Editorial for ADAC issue 1 of volume 17 (2023)
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
PubDate: 2023-02-17
- Clustering data with non-ignorable missingness using semi-parametric
mixture models assuming independence within components-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We propose a semi-parametric clustering model assuming conditional independence given the component. One advantage is that this model can handle non-ignorable missingness. The model defines each component as a product of univariate probability distributions but makes no assumption on the form of each univariate density. Note that the mixture model is used for clustering but not for estimating the density of the full variables (observed and unobserved). Estimation is performed by maximizing an extension of the smoothed likelihood allowing missingness. This optimization is achieved by a Majorization-Minorization algorithm. We illustrate the relevance of our approach by numerical experiments conducted on simulated data. Under mild assumptions, we show the identifiability of the model defining the distribution of the observed data and the monotonicity of the algorithm. We also propose an extension of this new method to the case of mixed-type data that we illustrate on a real data set. The proposed method is implemented in the R package MNARclust available on CRAN. PubDate: 2023-02-12
- Robust instance-dependent cost-sensitive classification
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Instance-dependent cost-sensitive (IDCS) learning methods have proven useful for binary classification tasks where individual instances are associated with variable misclassification costs. However, we demonstrate in this paper by means of a series of experiments that IDCS methods are sensitive to noise and outliers in relation to instance-dependent misclassification costs and their performance strongly depends on the cost distribution of the data sample. Therefore, we propose a generic three-step framework to make IDCS methods more robust: (i) detect outliers automatically, (ii) correct outlying cost information in a data-driven way, and (iii) construct an IDCS learning method using the adjusted cost information. We apply this framework to cslogit, a logistic regression-based IDCS method, to obtain its robust version, which we name r-cslogit. The robustness of this approach is introduced in steps (i) and (ii), where we make use of robust estimators to detect and impute outlying costs of individual instances. The newly proposed r-cslogit method is tested on synthetic and semi-synthetic data and proven to be superior in terms of savings compared to its non-robust counterpart for variable levels of noise and outliers. All our code is made available online at https://github.com/SimonDeVos/Robust-IDCS. PubDate: 2023-01-07 DOI: 10.1007/s11634-022-00533-3
- Flexible mixture regression with the generalized hyperbolic distribution
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract When modeling the functional relationship between a response variable and covariates via linear regression, multiple relationships may be present depending on the underlying component structure. Deploying a flexible mixture distribution can help with capturing a wide variety of such structures, thereby successfully modeling the response–covariate relationship while addressing the components. In that spirit, a mixture regression model based on the finite mixture of generalized hyperbolic distributions is introduced, and its parameter estimation method is presented. The flexibility of the generalized hyperbolic distribution can identify better-fitting components, which can lead to a more meaningful functional relationship between the response variable and the covariates. In addition, we introduce an iterative component combining procedure to aid the interpretability of the model. The results from simulated and real data analyses indicate that our method offers a distinctive edge over some of the existing methods, and that it can generate useful insights on the data set at hand for further investigation. PubDate: 2023-01-04 DOI: 10.1007/s11634-022-00532-4
- Sparse correspondence analysis for large contingency tables
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract We propose sparse variants of correspondence analysis (CA) for large contingency tables like documents-terms matrices used in text mining. By seeking to obtain many zero coefficients, sparse CA remedies to the difficulty of interpreting CA results when the size of the table is large. Since CA is a double weighted PCA (for rows and columns) or a weighted generalized SVD, we adapt known sparse versions of these methods with specific developments to obtain orthogonal solutions and to tune the sparseness parameters. We distinguish two cases depending on whether sparseness is asked for both rows and columns, or only for one set. PubDate: 2023-01-02 DOI: 10.1007/s11634-022-00531-5
- Proximal methods for sparse optimal scoring and discriminant analysis
-
Free pre-print version: Loading...
Rate this result:
What is this?
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Abstract: Abstract Linear discriminant analysis (LDA) is a classical method for dimensionality reduction, where discriminant vectors are sought to project data to a lower dimensional space for optimal separability of classes. Several recent papers have outlined strategies, based on exploiting sparsity of the discriminant vectors, for performing LDA in the high-dimensional setting where the number of features exceeds the number of observations in the data. However, many of these proposed methods lack scalable methods for solution of the underlying optimization problems. We consider an optimization scheme for solving the sparse optimal scoring formulation of LDA based on block coordinate descent. Each iteration of this algorithm requires an update of a scoring vector, which admits an analytic formula, and an update of the corresponding discriminant vector, which requires solution of a convex subproblem; we will propose several variants of this algorithm where the proximal gradient method or the alternating direction method of multipliers is used to solve this subproblem. We show that the per-iteration cost of these methods scales linearly in the dimension of the data provided restricted regularization terms are employed, and cubically in the dimension of the data in the worst case. Furthermore, we establish that when this block coordinate descent framework generates convergent subsequences of iterates, then these subsequences converge to the stationary points of the sparse optimal scoring problem. We demonstrate the effectiveness of our new methods with empirical results for classification of Gaussian data and data sets drawn from benchmarking repositories, including time-series and multispectral X-ray data, and provide Matlab and R implementations of our optimization schemes. PubDate: 2022-12-21 DOI: 10.1007/s11634-022-00530-6
|