Authors:Gérard Govaert; Mohamed Nadif Pages: 455 - 488 Abstract: Abstract Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field. PubDate: 2018-09-01 DOI: 10.1007/s11634-016-0274-6 Issue No:Vol. 12, No. 3 (2018)

Authors:Aurore Lomet; Gérard Govaert; Yves Grandvalet Pages: 489 - 508 Abstract: Abstract Block clustering aims to reveal homogeneous block structures in a data table. Among the different approaches of block clustering, we consider here a model-based method: the Gaussian latent block model for continuous data which is an extension of the Gaussian mixture model for one-way clustering. For a given data table, several candidate models are usually examined, which differ for example in the number of clusters. Model selection then becomes a critical issue. To this end, we develop a criterion based on an approximation of the integrated classification likelihood for the Gaussian latent block model, and propose a Bayesian information criterion-like variant following the same pattern. We also propose a non-asymptotic exact criterion, thus circumventing the controversial definition of the asymptotic regime arising from the dual nature of the rows and columns in co-clustering. The experimental results show steady performances of these criteria for medium to large data tables. PubDate: 2018-09-01 DOI: 10.1007/s11634-013-0161-3 Issue No:Vol. 12, No. 3 (2018)

Authors:Romain Guigourès; Marc Boullé; Fabrice Rossi Pages: 509 - 536 Abstract: Abstract This paper introduces a novel technique to track structures in time varying graphs. The method uses a maximum a posteriori approach for adjusting a three-dimensional co-clustering of the source vertices, the destination vertices and the time, to the data under study, in a way that does not require any hyper-parameter tuning. The three dimensions are simultaneously segmented in order to build clusters of source vertices, destination vertices and time segments where the edge distributions across clusters of vertices follow the same evolution over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make any a priori quantization. Experiments conducted on artificial data illustrate the good behavior of the technique, and a study of a real-life data set shows the potential of the proposed approach for exploratory data analysis. PubDate: 2018-09-01 DOI: 10.1007/s11634-015-0218-6 Issue No:Vol. 12, No. 3 (2018)

Authors:Parvin Ahmadi; Iman Gholampour; Mahmoud Tabandeh Pages: 537 - 558 Abstract: Abstract In this paper, we introduce a document clustering method based on Sparse Topical Coding, called Cluster-based Sparse Topical Coding. Topic modeling is capable of improving textual document clustering by describing documents via bag-of-words models and projecting them into a topic space. The latent semantic descriptions derived by the topic model can be utilized as features in a clustering process. In our proposed method, document clustering and topic modeling are integrated in a unified framework in order to achieve the highest performance. This framework includes Sparse Topical Coding, which is responsible for topic mining, and K-means that discovers the latent clusters in documents collection. Experimental results on widely-used datasets show that our proposed method significantly outperforms the traditional and other topic model based clustering methods. Our method achieves from 4 to 39% improvement in clustering accuracy and from 2% to more than 44% improvement in normalized mutual information. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0280-3 Issue No:Vol. 12, No. 3 (2018)

Authors:Kohei Adachi; Nickolay T. Trendafilov Pages: 559 - 585 Abstract: Abstract We propose a new procedure for sparse factor analysis (FA) such that each variable loads only one common factor. Thus, the loading matrix has a single nonzero element in each row and zeros elsewhere. Such a loading matrix is the sparsest possible for certain number of variables and common factors. For this reason, the proposed method is named sparsest FA (SSFA). It may also be called FA-based variable clustering, since the variables loading the same common factor can be classified into a cluster. In SSFA, all model parts of FA (common factors, their correlations, loadings, unique factors, and unique variances) are treated as fixed unknown parameter matrices and their least squares function is minimized through specific data matrix decomposition. A useful feature of the algorithm is that the matrix of common factor scores is re-parameterized using QR decomposition in order to efficiently estimate factor correlations. A simulation study shows that the proposed procedure can exactly identify the true sparsest models. Real data examples demonstrate the usefulness of the variable clustering performed by SSFA. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0284-z Issue No:Vol. 12, No. 3 (2018)

Authors:Mercedes Fernandez Sau; Daniela Rodriguez Pages: 587 - 603 Abstract: Abstract In this paper, we propose estimators based on the minimum distance for the unknown parameters of a parametric density on the unit sphere. We show that these estimators are consistent and asymptotically normally distributed. Also, we apply our proposal to develop a method that allows us to detect potential atypical values. The behavior under small samples of the proposed estimators is studied using Monte Carlo simulations. Two applications of our procedure are illustrated with real data sets. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0287-9 Issue No:Vol. 12, No. 3 (2018)

Authors:A. Felipe; N. Martín; P. Miranda; L. Pardo Pages: 605 - 636 Abstract: Abstract In this paper we explore the possibilities of applying \(\phi \) -divergence measures in inferential problems in the field of latent class models (LCMs) for multinomial data. We first treat the problem of estimating the model parameters. As explained below, minimum \(\phi \) -divergence estimators (M \(\phi \) Es) considered in this paper are a natural extension of the maximum likelihood estimator (MLE), the usual estimator for this problem; we study the asymptotic properties of M \(\phi \) Es, showing that they share the same asymptotic distribution as the MLE. To compare the efficiency of the M \(\phi \) Es when the sample size is not big enough to apply the asymptotic results, we have carried out an extensive simulation study; from this study, we conclude that there are estimators in this family that are competitive with the MLE. Next, we deal with the problem of testing whether a LCM for multinomial data fits a data set; again, \(\phi \) -divergence measures can be used to generate a family of test statistics generalizing both the classical likelihood ratio test and the chi-squared test statistics. Finally, we treat the problem of choosing the best model out of a sequence of nested LCMs; as before, \(\phi \) -divergence measures can handle the problem and we derive a family of \(\phi \) -divergence test statistics based on them; we study the asymptotic behavior of these test statistics, showing that it is the same as the classical test statistics. A simulation study for small and moderate sample sizes shows that there are some test statistics in the family that can compete with the classical likelihood ratio and the chi-squared test statistics. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0289-7 Issue No:Vol. 12, No. 3 (2018)

Authors:Ana Justel; Marcela Svarc Pages: 637 - 656 Abstract: Abstract This paper presents DivClusFD, a new divisive hierarchical method for the non-supervised classification of functional data. Data of this type present the peculiarity that the differences among clusters may be caused by changes as well in level as in shape. Different clusters can be separated in different subregion and there may be no subregion in which all clusters are separated. In each step of division, the DivClusFD method explores the functions and their derivatives at several fixed points, seeking the subregion in which the highest number of clusters can be separated. The number of clusters is estimated via the gap statistic. The functions are assigned to the new clusters by combining the k-means algorithm with the use of functional boxplots to identify functions that have been incorrectly classified because of their atypical local behavior. The DivClusFD method provides the number of clusters, the classification of the observed functions into the clusters and guidelines that may be for interpreting the clusters. A simulation study using synthetic data and tests of the performance of the DivClusFD method on real data sets indicate that this method is able to classify functions accurately. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0290-1 Issue No:Vol. 12, No. 3 (2018)

Authors:Andrew Marchese; Vasileios Maroulas Pages: 657 - 682 Abstract: Abstract In this paper, we consider the problem of signal classification. First, the signal is translated into a persistence diagram through the use of delay-embedding and persistent homology. Endowing the data space of persistence diagrams with a metric from point processes, we show that it admits statistical structure in the form of Fréchet means and variances and a classification scheme is established. In contrast with the Wasserstein distance, this metric accounts for changes in small persistence and changes in cardinality. The classification results using this distance are benchmarked on both synthetic data and real acoustic signals and it is demonstrated that this classifier outperforms current signal classification techniques. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0294-x Issue No:Vol. 12, No. 3 (2018)

Authors:Juana-María Vivo; Manuel Franco; Donatella Vicari Pages: 683 - 704 Abstract: Abstract The area under a receiver operating characteristic (ROC) curve is valuable for evaluating the classification performance described by the entire ROC curve in many fields including decision making and medical diagnosis. However, this can be misleading when clinical tasks demand a restricted specificity range. The partial area under a portion of the ROC curve ( \({ pAUC}\) ) has more practical relevance in such situations, but it is usually transformed to overcome some drawbacks and improve its interpretation. The standardized \({ pAUC}\) ( \({ SpAUC}\) ) index is considered as a meaningful relative measure of predictive accuracy. Nevertheless, this \({ SpAUC}\) index might still show some limitations due to ROC curves crossing the diagonal line, and to the problem when comparing two tests with crossing ROC curves in the same restricted specificity range. This paper provides an alternative \({ pAUC}\) index which overcomes these limitations. Tighter bounds for the \({ pAUC}\) of an ROC curve are derived, and then a modified \({ pAUC}\) index for any restricted specificity range is established. In addition, the proposed tighter partial area index ( \({ TpAUC}\) ) is also shown for classifier when high specificity must be clinically maintained. The variance of the \({ TpAUC}\) is also studied analytically and by simulation studies in a theoretical framework based on the most typical assumption of a binormal model, and estimated by using nonparametric bootstrap resampling in the empirical examples. Simulated and real datasets illustrate the practical utility of the \({ TpAUC}\) . PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0295-9 Issue No:Vol. 12, No. 3 (2018)

Authors:Irene Epifanio; María Victoria Ibáñez; Amelia Simó Pages: 705 - 735 Abstract: Abstract Archetype and archetypoid analysis are extended to shapes. The objective is to find representative shapes. Archetypal shapes are pure (extreme) shapes. We focus on the case where the shape of an object is represented by a configuration matrix of landmarks. As shape space is not a vectorial space, we work in the tangent space, the linearized space about the mean shape. Then, each observation is approximated by a convex combination of actual observations (archetypoids) or archetypes, which are a convex combination of observations in the data set. These tools can contribute to the understanding of shapes, as in the usual multivariate case, since they lie somewhere between clustering and matrix factorization methods. A new simplex visualization tool is also proposed to provide a picture of the archetypal analysis results. We also propose new algorithms for performing archetypal analysis with missing data and its extension to incomplete shapes. A well-known data set is used to illustrate the methodologies developed. The proposed methodology is applied to an apparel design problem in children. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0297-7 Issue No:Vol. 12, No. 3 (2018)

Authors:Gerhard Tutz; Moritz Berger Pages: 737 - 758 Abstract: Abstract Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on p values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0298-6 Issue No:Vol. 12, No. 3 (2018)

Authors:Pasquale Dolce; Vincenzo Esposito Vinzi; Natale Carlo Lauro Pages: 759 - 784 Abstract: Abstract Partial least squares path modeling presents some inconsistencies in terms of coherence with the predictive directions specified in the inner model (i.e. the path directions), because the directions of the links in the inner model are not taken into account in the iterative algorithm. In fact, the procedure amplifies interdependence among blocks and fails to distinguish between dependent and explanatory blocks. The method proposed in this paper takes into account and respects the specified path directions, with the aim of improving the predictive ability of the model and to maintain the hypothesized theoretical inner model. To highlight its properties, the proposed method is compared to the classical PLS path modeling in terms of explained variability, predictive relevance and interpretation using artificial data through a real data application. A further development of the method allows to treat multi-dimensional blocks in composite-based path modeling. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0302-1 Issue No:Vol. 12, No. 3 (2018)

Authors:A. Pedro Duarte Silva; Peter Filzmoser; Paula Brito Pages: 785 - 822 Abstract: Abstract A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0305-y Issue No:Vol. 12, No. 3 (2018)

Abstract: Abstract The Gaussian process is a common model in a wide variety of applications, such as environmental modeling, computer experiments, and geology. Two major challenges often arise: First, assuming that the process of interest is stationary over the entire domain often proves to be untenable. Second, the traditional Gaussian process model formulation is computationally inefficient for large datasets. In this paper, we propose a new Gaussian process model to tackle these problems based on the convolution of a smoothing kernel with a partitioned latent process. Nonstationarity can be modeled by allowing a separate latent process for each partition, which approximates a regional clustering structure. Partitioning follows a binary tree generating process similar to that of Classification and Regression Trees. A Bayesian approach is used to estimate the partitioning structure and model parameters simultaneously. Our motivating dataset consists of 11918 precipitation anomalies. Results show that our model has promising prediction performance and is computationally efficient for large datasets. PubDate: 2018-09-15 DOI: 10.1007/s11634-018-0341-2

Authors:Nicola Loperfido Abstract: Abstract Finite mixtures of multivariate distributions play a fundamental role in model-based clustering. However, they pose several problems, especially in the presence of many irrelevant variables. Dimension reduction methods, such as projection pursuit, are commonly used to address these problems. In this paper, we use skewness-maximizing projections to recover the subspace which optimally separates the cluster means. Skewness might then be removed in order to search for other potentially interesting data structures or to perform skewness-sensitive statistical analyses, such as the Hotelling’s \( T^{2}\) test. Our approach is algebraic in nature and deals with the symmetric tensor rank of the third multivariate cumulant. We also derive closed-form expressions for the symmetric tensor rank of the third cumulants of several multivariate mixture models, including mixtures of skew-normal distributions and mixtures of two symmetric components with proportional covariance matrices. Theoretical results in this paper shed some light on the connection between the estimated number of mixture components and their skewness. PubDate: 2018-09-06 DOI: 10.1007/s11634-018-0336-z

Abstract: Abstract We propose a novel extension of nonparametric multivariate finite mixture models by dropping the standard conditional independence assumption and incorporating the independent component analysis (ICA) structure instead. This innovation extends nonparametric mixture model estimation methods to situations in which conditional independence, a necessary assumption for the unique identifiability of the parameters in such models, is clearly violated. We formulate an objective function in terms of penalized smoothed Kullback–Leibler distance and introduce the nonlinear smoothed majorization-minimization independent component analysis algorithm for optimizing this function and estimating the model parameters. Our algorithm does not require any labeled observations a priori; it may be used for fully unsupervised clustering problems in a multivariate setting. We have implemented a practical version of this algorithm, which utilizes the FastICA algorithm, in the R package icamix. We illustrate this new methodology using several applications in unsupervised learning and image processing. PubDate: 2018-08-28 DOI: 10.1007/s11634-018-0338-x

Authors:Sylvia Frühwirth-Schnatter; Gertraud Malsiner-Walli Abstract: Abstract In model-based clustering mixture models are used to group data points into clusters. A useful concept introduced for Gaussian mixtures by Malsiner Walli et al. (Stat Comput 26:303–324, 2016) are sparse finite mixtures, where the prior distribution on the weight distribution of a mixture with K components is chosen in such a way that a priori the number of clusters in the data is random and is allowed to be smaller than K with high probability. The number of clusters is then inferred a posteriori from the data. The present paper makes the following contributions in the context of sparse finite mixture modelling. First, it is illustrated that the concept of sparse finite mixture is very generic and easily extended to cluster various types of non-Gaussian data, in particular discrete data and continuous multivariate data arising from non-Gaussian clusters. Second, sparse finite mixtures are compared to Dirichlet process mixtures with respect to their ability to identify the number of clusters. For both model classes, a random hyper prior is considered for the parameters determining the weight distribution. By suitable matching of these priors, it is shown that the choice of this hyper prior is far more influential on the cluster solution than whether a sparse finite mixture or a Dirichlet process mixture is taken into consideration. PubDate: 2018-08-24 DOI: 10.1007/s11634-018-0329-y

Authors:Camila Borelli Zeller; Celso Rômulo Barbosa Cabral; Víctor Hugo Lachos; Luis Benites Abstract: Abstract In statistical analysis, particularly in econometrics, the finite mixture of regression models based on the normality assumption is routinely used to analyze censored data. In this work, an extension of this model is proposed by considering scale mixtures of normal distributions (SMN). This approach allows us to model data with great flexibility, accommodating multimodality and heavy tails at the same time. The main virtue of considering the finite mixture of regression models for censored data under the SMN class is that this class of models has a nice hierarchical representation which allows easy implementation of inferences. We develop a simple EM-type algorithm to perform maximum likelihood inference of the parameters in the proposed model. To examine the performance of the proposed method, we present some simulation studies and analyze a real dataset. The proposed algorithm and methods are implemented in the new R package CensMixReg. PubDate: 2018-08-24 DOI: 10.1007/s11634-018-0337-y