Authors:Gérard Govaert; Mohamed Nadif Pages: 455 - 488 Abstract: Many of the datasets encountered in statistics are two-dimensional in nature and can be represented by a matrix. Classical clustering procedures seek to construct separately an optimal partition of rows or, sometimes, of columns. In contrast, co-clustering methods cluster the rows and the columns simultaneously and organize the data into homogeneous blocks (after suitable permutations). Methods of this kind have practical importance in a wide variety of applications such as document clustering, where data are typically organized in two-way contingency tables. Our goal is to offer coherent frameworks for understanding some existing criteria and algorithms for co-clustering contingency tables, and to propose new ones. We look at two different frameworks for the problem of co-clustering. The first involves minimizing an objective function based on measures of association and in particular on phi-squared and mutual information. The second uses a model-based co-clustering approach, and we consider two models: the block model and the latent block model. We establish connections between different approaches, criteria and algorithms, and we highlight a number of implicit assumptions in some commonly used algorithms. Our contribution is illustrated by numerical experiments on simulated and real-case datasets that show the relevance of the presented methods in the document clustering field. PubDate: 2018-09-01 DOI: 10.1007/s11634-016-0274-6 Issue No:Vol. 12, No. 3 (2018)

Authors:Aurore Lomet; Gérard Govaert; Yves Grandvalet Pages: 489 - 508 Abstract: Block clustering aims to reveal homogeneous block structures in a data table. Among the different approaches of block clustering, we consider here a model-based method: the Gaussian latent block model for continuous data which is an extension of the Gaussian mixture model for one-way clustering. For a given data table, several candidate models are usually examined, which differ for example in the number of clusters. Model selection then becomes a critical issue. To this end, we develop a criterion based on an approximation of the integrated classification likelihood for the Gaussian latent block model, and propose a Bayesian information criterion-like variant following the same pattern. We also propose a non-asymptotic exact criterion, thus circumventing the controversial definition of the asymptotic regime arising from the dual nature of the rows and columns in co-clustering. The experimental results show steady performances of these criteria for medium to large data tables. PubDate: 2018-09-01 DOI: 10.1007/s11634-013-0161-3 Issue No:Vol. 12, No. 3 (2018)

Authors:Romain Guigourès; Marc Boullé; Fabrice Rossi Pages: 509 - 536 Abstract: This paper introduces a novel technique to track structures in time varying graphs. The method uses a maximum a posteriori approach for adjusting a three-dimensional co-clustering of the source vertices, the destination vertices and the time, to the data under study, in a way that does not require any hyper-parameter tuning. The three dimensions are simultaneously segmented in order to build clusters of source vertices, destination vertices and time segments where the edge distributions across clusters of vertices follow the same evolution over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make any a priori quantization. Experiments conducted on artificial data illustrate the good behavior of the technique, and a study of a real-life data set shows the potential of the proposed approach for exploratory data analysis. PubDate: 2018-09-01 DOI: 10.1007/s11634-015-0218-6 Issue No:Vol. 12, No. 3 (2018)

Authors:Parvin Ahmadi; Iman Gholampour; Mahmoud Tabandeh Pages: 537 - 558 Abstract: In this paper, we introduce a document clustering method based on Sparse Topical Coding, called Cluster-based Sparse Topical Coding. Topic modeling is capable of improving textual document clustering by describing documents via bag-of-words models and projecting them into a topic space. The latent semantic descriptions derived by the topic model can be utilized as features in a clustering process. In our proposed method, document clustering and topic modeling are integrated in a unified framework in order to achieve the highest performance. This framework includes Sparse Topical Coding, which is responsible for topic mining, and K-means that discovers the latent clusters in documents collection. Experimental results on widely-used datasets show that our proposed method significantly outperforms the traditional and other topic model based clustering methods. Our method achieves from 4 to 39% improvement in clustering accuracy and from 2% to more than 44% improvement in normalized mutual information. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0280-3 Issue No:Vol. 12, No. 3 (2018)

Authors:Kohei Adachi; Nickolay T. Trendafilov Pages: 559 - 585 Abstract: We propose a new procedure for sparse factor analysis (FA) such that each variable loads only one common factor. Thus, the loading matrix has a single nonzero element in each row and zeros elsewhere. Such a loading matrix is the sparsest possible for certain number of variables and common factors. For this reason, the proposed method is named sparsest FA (SSFA). It may also be called FA-based variable clustering, since the variables loading the same common factor can be classified into a cluster. In SSFA, all model parts of FA (common factors, their correlations, loadings, unique factors, and unique variances) are treated as fixed unknown parameter matrices and their least squares function is minimized through specific data matrix decomposition. A useful feature of the algorithm is that the matrix of common factor scores is re-parameterized using QR decomposition in order to efficiently estimate factor correlations. A simulation study shows that the proposed procedure can exactly identify the true sparsest models. Real data examples demonstrate the usefulness of the variable clustering performed by SSFA. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0284-z Issue No:Vol. 12, No. 3 (2018)

Authors:Mercedes Fernandez Sau; Daniela Rodriguez Pages: 587 - 603 Abstract: In this paper, we propose estimators based on the minimum distance for the unknown parameters of a parametric density on the unit sphere. We show that these estimators are consistent and asymptotically normally distributed. Also, we apply our proposal to develop a method that allows us to detect potential atypical values. The behavior under small samples of the proposed estimators is studied using Monte Carlo simulations. Two applications of our procedure are illustrated with real data sets. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0287-9 Issue No:Vol. 12, No. 3 (2018)

Authors:A. Felipe; N. Martín; P. Miranda; L. Pardo Pages: 605 - 636 Abstract: In this paper we explore the possibilities of applying \(\phi \) -divergence measures in inferential problems in the field of latent class models (LCMs) for multinomial data. We first treat the problem of estimating the model parameters. As explained below, minimum \(\phi \) -divergence estimators (M \(\phi \) Es) considered in this paper are a natural extension of the maximum likelihood estimator (MLE), the usual estimator for this problem; we study the asymptotic properties of M \(\phi \) Es, showing that they share the same asymptotic distribution as the MLE. To compare the efficiency of the M \(\phi \) Es when the sample size is not big enough to apply the asymptotic results, we have carried out an extensive simulation study; from this study, we conclude that there are estimators in this family that are competitive with the MLE. Next, we deal with the problem of testing whether a LCM for multinomial data fits a data set; again, \(\phi \) -divergence measures can be used to generate a family of test statistics generalizing both the classical likelihood ratio test and the chi-squared test statistics. Finally, we treat the problem of choosing the best model out of a sequence of nested LCMs; as before, \(\phi \) -divergence measures can handle the problem and we derive a family of \(\phi \) -divergence test statistics based on them; we study the asymptotic behavior of these test statistics, showing that it is the same as the classical test statistics. A simulation study for small and moderate sample sizes shows that there are some test statistics in the family that can compete with the classical likelihood ratio and the chi-squared test statistics. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0289-7 Issue No:Vol. 12, No. 3 (2018)

Authors:Ana Justel; Marcela Svarc Pages: 637 - 656 Abstract: This paper presents DivClusFD, a new divisive hierarchical method for the non-supervised classification of functional data. Data of this type present the peculiarity that the differences among clusters may be caused by changes as well in level as in shape. Different clusters can be separated in different subregion and there may be no subregion in which all clusters are separated. In each step of division, the DivClusFD method explores the functions and their derivatives at several fixed points, seeking the subregion in which the highest number of clusters can be separated. The number of clusters is estimated via the gap statistic. The functions are assigned to the new clusters by combining the k-means algorithm with the use of functional boxplots to identify functions that have been incorrectly classified because of their atypical local behavior. The DivClusFD method provides the number of clusters, the classification of the observed functions into the clusters and guidelines that may be for interpreting the clusters. A simulation study using synthetic data and tests of the performance of the DivClusFD method on real data sets indicate that this method is able to classify functions accurately. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0290-1 Issue No:Vol. 12, No. 3 (2018)

Authors:Irene Epifanio; María Victoria Ibáñez; Amelia Simó Pages: 705 - 735 Abstract: Archetype and archetypoid analysis are extended to shapes. The objective is to find representative shapes. Archetypal shapes are pure (extreme) shapes. We focus on the case where the shape of an object is represented by a configuration matrix of landmarks. As shape space is not a vectorial space, we work in the tangent space, the linearized space about the mean shape. Then, each observation is approximated by a convex combination of actual observations (archetypoids) or archetypes, which are a convex combination of observations in the data set. These tools can contribute to the understanding of shapes, as in the usual multivariate case, since they lie somewhere between clustering and matrix factorization methods. A new simplex visualization tool is also proposed to provide a picture of the archetypal analysis results. We also propose new algorithms for performing archetypal analysis with missing data and its extension to incomplete shapes. A well-known data set is used to illustrate the methodologies developed. The proposed methodology is applied to an apparel design problem in children. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0297-7 Issue No:Vol. 12, No. 3 (2018)

Authors:Gerhard Tutz; Moritz Berger Pages: 737 - 758 Abstract: Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on p values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0298-6 Issue No:Vol. 12, No. 3 (2018)

Authors:A. Pedro Duarte Silva; Peter Filzmoser; Paula Brito Pages: 785 - 822 Abstract: A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data. PubDate: 2018-09-01 DOI: 10.1007/s11634-017-0305-y Issue No:Vol. 12, No. 3 (2018)

Authors:Claudio Conversano; Massimo Cannas; Francesco Mola; Emiliano Sironi Abstract: A novel criterion for estimating a latent partition of the observed groups based on the output of a hierarchical model is presented. It is based on a loss function combining the Gini income inequality ratio and the predictability index of Goodman and Kruskal in order to achieve maximum heterogeneity of random effects across groups and maximum homogeneity of predicted probabilities inside estimated clusters. The index is compared with alternative approaches in a simulation study and applied in a case study concerning the role of hospital level variables in deciding for a cesarean section. PubDate: 2018-10-12 DOI: 10.1007/s11634-018-0347-9

Authors:William Cipolli; Timothy Hanson Abstract: We propose a generative classification model that extends Quadratic Discriminant Analysis (QDA) (Cox in J R Stat Soc Ser B (Methodol) 20:215–242, 1958) and Linear Discriminant Analysis (LDA) (Fisher in Ann Eugen 7:179–188, 1936; Rao in J R Stat Soc Ser B 10:159–203, 1948) to the Bayesian nonparametric setting, providing a competitor to MclustDA (Fraley and Raftery in Am Stat Assoc 97:611–631, 2002). This approach models the data distribution for each class using a multivariate Polya tree and realizes impressive results in simulations and real data analyses. The flexibility gained from further relaxing the distributional assumptions of QDA can greatly improve the ability to correctly classify new observations for models with severe deviations from parametric distributional assumptions, while still performing well when the assumptions hold. The proposed method is quite fast compared to other supervised classifiers and very simple to implement as there are no kernel tricks or initialization steps perhaps making it one of the more user-friendly approaches to supervised learning. This highlights a significant feature of the proposed methodology as suboptimal tuning can greatly hamper classification performance; e.g., SVMs fit with non-optimal kernels perform significantly worse. PubDate: 2018-10-12 DOI: 10.1007/s11634-018-0344-z

Authors:Abby Flynt; Nema Dean; Rebecca Nugent Abstract: Agreement indices are commonly used to summarize the performance of both classification and clustering methods. The easy interpretation/intuition and desirable properties that result from the Rand and adjusted Rand indices, has led to their popularity over other available indices. While more algorithmic clustering approaches like k-means and hierarchical clustering produce hard partition assignments (assigning observations to a single cluster), other techniques like model-based clustering include information about the certainty of allocation of objects through class membership probabilities (soft partitions). To assess performance using traditional indices, e.g., the adjusted Rand index (ARI), the soft partition is mapped to a hard set of assignments, which commonly overstates the certainty of correct assignments. This paper proposes an extension of the ARI, the soft adjusted Rand index (sARI), with similar intuition and interpretation but also incorporating information from one or two soft partitions. It can be used in conjunction with the ARI, comparing the similarities of hard to soft, or soft to soft partitions to the similarities of the mapped hard partitions. Simulation study results support the intuition that in general, mapping to hard partitions tends to increase the measure of similarity between partitions. In applications, the sARI more accurately reflects the cluster boundary overlap commonly seen in real data. PubDate: 2018-10-09 DOI: 10.1007/s11634-018-0346-x

Authors:Heidi Seibold; Torsten Hothorn; Achim Zeileis Abstract: Model-based trees are used to find subgroups in data which differ with respect to model parameters. In some applications it is natural to keep some parameters fixed globally for all observations while asking if and how other parameters vary across subgroups. Existing implementations of model-based trees can only deal with the scenario where all parameters depend on the subgroups. We propose partially additive linear model trees (PALM trees) as an extension of (generalised) linear model trees (LM and GLM trees, respectively), in which the model parameters are specified a priori to be estimated either globally from all observations or locally from the observations within the subgroups determined by the tree. Simulations show that the method has high power for detecting subgroups in the presence of global effects and reliably recovers the true parameters. Furthermore, treatment–subgroup differences are detected in an empirical application of the method to data from a mathematics exam: the PALM tree is able to detect a small subgroup of students that had a disadvantage in an exam with two versions while adjusting for overall ability effects. PubDate: 2018-10-05 DOI: 10.1007/s11634-018-0342-1

Authors:Moritz Berger; Thomas Welchowski; Steffen Schmitz-Valckenberg; Matthias Schmid Abstract: Cause-specific hazard models are a popular tool for the analysis of competing risks data. The classical modeling approach in discrete time consists of fitting parametric multinomial logit models. A drawback of this method is that the focus is on main effects only, and that higher order interactions are hard to handle. Moreover, the resulting models contain a large number of parameters, which may cause numerical problems when estimating coefficients. To overcome these problems, a tree-based model is proposed that extends the survival tree methodology developed previously for time-to-event models with one single type of event. The performance of the method, compared with several competitors, is investigated in simulations. The usefulness of the proposed approach is demonstrated by an analysis of age-related macular degeneration among elderly people that were monitored by annual study visits. PubDate: 2018-09-28 DOI: 10.1007/s11634-018-0345-y

Authors:Alban Mbina Mbina; Guy Martial Nkiet; Fulgence Eyi Obiang Abstract: We propose a method for variable selection in discriminant analysis with mixed continuous and binary variables. This method is based on a criterion that permits to reduce the variable selection problem to a problem of estimating suitable permutation and dimensionality. Then, estimators for these parameters are proposed and the resulting method for selecting variables is shown to be consistent. A simulation study that permits to study several properties of the proposed approach and to compare it with an existing method is given, and an example on a real data set is provided. PubDate: 2018-09-21 DOI: 10.1007/s11634-018-0343-0

Abstract: The Gaussian process is a common model in a wide variety of applications, such as environmental modeling, computer experiments, and geology. Two major challenges often arise: First, assuming that the process of interest is stationary over the entire domain often proves to be untenable. Second, the traditional Gaussian process model formulation is computationally inefficient for large datasets. In this paper, we propose a new Gaussian process model to tackle these problems based on the convolution of a smoothing kernel with a partitioned latent process. Nonstationarity can be modeled by allowing a separate latent process for each partition, which approximates a regional clustering structure. Partitioning follows a binary tree generating process similar to that of Classification and Regression Trees. A Bayesian approach is used to estimate the partitioning structure and model parameters simultaneously. Our motivating dataset consists of 11918 precipitation anomalies. Results show that our model has promising prediction performance and is computationally efficient for large datasets. PubDate: 2018-09-15 DOI: 10.1007/s11634-018-0341-2

Authors:Nicola Loperfido Abstract: Finite mixtures of multivariate distributions play a fundamental role in model-based clustering. However, they pose several problems, especially in the presence of many irrelevant variables. Dimension reduction methods, such as projection pursuit, are commonly used to address these problems. In this paper, we use skewness-maximizing projections to recover the subspace which optimally separates the cluster means. Skewness might then be removed in order to search for other potentially interesting data structures or to perform skewness-sensitive statistical analyses, such as the Hotelling’s \( T^{2}\) test. Our approach is algebraic in nature and deals with the symmetric tensor rank of the third multivariate cumulant. We also derive closed-form expressions for the symmetric tensor rank of the third cumulants of several multivariate mixture models, including mixtures of skew-normal distributions and mixtures of two symmetric components with proportional covariance matrices. Theoretical results in this paper shed some light on the connection between the estimated number of mixture components and their skewness. PubDate: 2018-09-06 DOI: 10.1007/s11634-018-0336-z