Authors:María Teresa Gallegos; Gunter Ritter Pages: 179 - 202 Abstract: The present paper proposes a new strategy for probabilistic (often called model-based) clustering. It is well known that local maxima of mixture likelihoods can be used to partition an underlying data set. However, local maxima are rarely unique. Therefore, it remains to select the reasonable solutions, and in particular the desired one. Credible partitions are usually recognized by separation (and cohesion) of their clusters. We use here the p values provided by the classical tests of Wilks, Hotelling, and Behrens–Fisher to single out those solutions that are well separated by location. It has been shown that reasonable solutions to a clustering problem are related to Pareto points in a plot of scale balance vs. model fit of all local maxima. We briefly review this theory and propose as solutions all well-fitting Pareto points in the set of local maxima separated by location in the above sense. We also design a new iterative, parameter-free cutting plane algorithm for the multivariate Behrens–Fisher problem. PubDate: 2018-06-01 DOI: 10.1007/s11634-016-0278-2 Issue No:Vol. 12, No. 2 (2018)

Authors:Luis Angel García-Escudero; Alfonso Gordaliza; Francesca Greselin; Salvatore Ingrassia; Agustín Mayo-Iscar Pages: 203 - 233 Abstract: This paper presents a review about the usage of eigenvalues restrictions for constrained parameter estimation in mixtures of elliptical distributions according to the likelihood approach. The restrictions serve a twofold purpose: to avoid convergence to degenerate solutions and to reduce the onset of non interesting (spurious) local maximizers, related to complex likelihood surfaces. The paper shows how the constraints may play a key role in the theory of Euclidean data clustering. The aim here is to provide a reasoned survey of the constraints and their applications, considering the contributions of many authors and spanning the literature of the last 30 years. PubDate: 2018-06-01 DOI: 10.1007/s11634-017-0293-y Issue No:Vol. 12, No. 2 (2018)

Authors:Roberto Rocci; Stefano Antonio Gattone; Roberto Di Mari Pages: 235 - 260 Abstract: Maximum likelihood estimation of Gaussian mixture models with different class-specific covariance matrices is known to be problematic. This is due to the unboundedness of the likelihood, together with the presence of spurious maximizers. Existing methods to bypass this obstacle are based on the fact that unboundedness is avoided if the eigenvalues of the covariance matrices are bounded away from zero. This can be done imposing some constraints on the covariance matrices, i.e. by incorporating a priori information on the covariance structure of the mixture components. The present work introduces a constrained approach, where the class conditional covariance matrices are shrunk towards a pre-specified target matrix \(\varvec{\varPsi }.\) Data-driven choices of the matrix \(\varvec{\varPsi },\) when a priori information is not available, and the optimal amount of shrinkage are investigated. Then, constraints based on a data-driven \(\varvec{\varPsi }\) are shown to be equivariant with respect to linear affine transformations, provided that the method used to select the target matrix be also equivariant. The effectiveness of the proposal is evaluated on the basis of a simulation study and an empirical example. PubDate: 2018-06-01 DOI: 10.1007/s11634-016-0279-1 Issue No:Vol. 12, No. 2 (2018)

Authors:Šárka Brodinová; Maia Zaharieva; Peter Filzmoser; Thomas Ortner; Christian Breiteneder Pages: 261 - 284 Abstract: Media content in large repositories usually exhibits multiple groups of strongly varying sizes. Media of potential interest often form notably smaller groups. Such media groups differ so much from the remaining data that it may be worthy to look at them in more detail. In contrast, media with popular content appear in larger groups. Identifying groups of varying sizes is addressed by clustering of imbalanced data. Clustering highly imbalanced media groups is additionally challenged by the high dimensionality of the underlying features. In this paper, we present the imbalanced clustering (IClust) algorithm designed to reveal group structures in high-dimensional media data. IClust employs an existing clustering method in order to find an initial set of a large number of potentially highly pure clusters which are then successively merged. The main advantage of IClust is that the number of clusters does not have to be pre-specified and that no specific assumptions about the cluster or data characteristics need to be made. Experiments on real-world media data demonstrate that in comparison to existing methods, IClust is able to better identify media groups, especially groups of small sizes. PubDate: 2018-06-01 DOI: 10.1007/s11634-017-0292-z Issue No:Vol. 12, No. 2 (2018)

Authors:Stéphanie Bougeard; Hervé Abdi; Gilbert Saporta; Ndèye Niang Pages: 285 - 313 Abstract: Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing. PubDate: 2018-06-01 DOI: 10.1007/s11634-017-0296-8 Issue No:Vol. 12, No. 2 (2018)

Authors:Kenichi Hayashi Pages: 315 - 339 Abstract: It has been reported that using unlabeled data together with labeled data to construct a discriminant function works successfully in practice. However, theoretical studies have implied that unlabeled data can sometimes adversely affect the performance of discriminant functions. Therefore, it is important to know what situations call for the use of unlabeled data. In this paper, asymptotic relative efficiency is presented as the measure for comparing analyses with and without unlabeled data under the heteroscedastic normality assumption. The linear discriminant function maximizing the area under the receiver operating characteristic curve is considered. Asymptotic relative efficiency is evaluated to investigate when and how unlabeled data contribute to improving discriminant performance under several conditions. The results show that asymptotic relative efficiency depends mainly on the heteroscedasticity of the covariance matrices and the stochastic structure of observing the labels of the cases. PubDate: 2018-06-01 DOI: 10.1007/s11634-016-0266-6 Issue No:Vol. 12, No. 2 (2018)

Authors:Karim Abou-Moustafa; Frank P. Ferrie Pages: 341 - 363 Abstract: Finding the set of nearest neighbors for a query point of interest appears in a variety of algorithms for machine learning and pattern recognition. Examples include k nearest neighbor classification, information retrieval, case-based reasoning, manifold learning, and nonlinear dimensionality reduction. In this work, we propose a new approach for determining a distance metric from the data for finding such neighboring points. For a query point of interest, our approach learns a generalized quadratic distance (GQD) metric based on the statistical properties in a “small” neighborhood for the point of interest. The locally learned GQD metric captures information such as the density, curvature, and the intrinsic dimensionality for the points falling in this particular neighborhood. Unfortunately, learning the GQD parameters under such a local learning mechanism is a challenging problem with a high computational overhead. To address these challenges, we estimate the GQD parameters using the minimum volume covering ellipsoid (MVCE) for a set of points. The advantage of the MVCE is two-fold. First, the MVCE together with the local learning approach approximate the functionality of a well known robust estimator for covariance matrices. Second, computing the MVCE is a convex optimization problem which, in addition to having a unique global solution, can be efficiently solved using a first order optimization algorithm. We validate our metric learning approach on a large variety of datasets and show that the proposed metric has promising results when compared with five algorithms from the literature for supervised metric learning. PubDate: 2018-06-01 DOI: 10.1007/s11634-017-0286-x Issue No:Vol. 12, No. 2 (2018)

Authors:Sonia Barahona; Ximo Gual-Arnau; Maria Victoria Ibáñez; Amelia Simó Pages: 365 - 397 Abstract: Object classification according to their shape and size is of key importance in many scientific fields. This work focuses on the case where the size and shape of an object is characterized by a current. A current is a mathematical object which has been proved relevant to the modeling of geometrical data, like submanifolds, through integration of vector fields along them. As a consequence of the choice of a vector-valued reproducing kernel Hilbert space (RKHS) as a test space for integrating manifolds, it is possible to consider that shapes are embedded in this Hilbert Space. A vector-valued RKHS is a Hilbert space of vector fields; therefore, it is possible to compute a mean of shapes, or to calculate a distance between two manifolds. This embedding enables us to consider size-and-shape clustering algorithms. These algorithms are applied to a 3D database obtained from an anthropometric survey of the Spanish child population with a potential application to online sales of children’s wear. PubDate: 2018-06-01 DOI: 10.1007/s11634-017-0283-0 Issue No:Vol. 12, No. 2 (2018)

Authors:Alessandra Guglielmi; Francesca Ieva; Anna Maria Paganoni; Fernardo A. Quintana Pages: 399 - 423 Abstract: We propose a Bayesian semiparametric regression model to represent mixed-type multiple outcomes concerning patients affected by Acute Myocardial Infarction. Our approach is motivated by data coming from the ST-Elevation Myocardial Infarction (STEMI) Archive, a multi-center observational prospective clinical study planned as part of the Strategic Program of Lombardy, Italy. We specifically consider a joint model for a variable measuring treatment time and in-hospital and 60-day survival indicators. One of our main motivations is to understand how the various hospitals differ in terms of the variety of information collected as part of the study. To do so we postulate a semiparametric random effects model that incorporates dependence on a location indicator that is used to explicitly differentiate among hospitals in or outside the city of Milano. The model is based on the two parameter Poisson-Dirichlet prior, also known as the Pitman-Yor process prior. We discuss the resulting posterior inference, including sensitivity analysis, and a comparison with the particular sub-model arising when a Dirichlet process prior is assumed. PubDate: 2018-06-01 DOI: 10.1007/s11634-016-0273-7 Issue No:Vol. 12, No. 2 (2018)

Authors:Vahe Avagyan; Andrés M. Alonso; Francisco J. Nogales Pages: 425 - 447 Abstract: The accurate estimation of a precision matrix plays a crucial role in the current age of high-dimensional data explosion. To deal with this problem, one of the prominent and commonly used techniques is the \(\ell _1\) norm (Lasso) penalization for a given loss function. This approach guarantees the sparsity of the precision matrix estimate for properly selected penalty parameters. However, the \(\ell _1\) norm penalization often fails to control the bias of obtained estimator because of its overestimation behavior. In this paper, we introduce two adaptive extensions of the recently proposed \(\ell _1\) norm penalized D-trace loss minimization method. They aim at reducing the produced bias in the estimator. Extensive numerical results, using both simulated and real datasets, show the advantage of our proposed estimators. PubDate: 2018-06-01 DOI: 10.1007/s11634-016-0272-8 Issue No:Vol. 12, No. 2 (2018)

Authors:Zakariya Yahya Algamal; Muhammad Hisyam Lee Abstract: The common issues of high-dimensional gene expression data are that many of the genes may not be relevant, and there exists a high correlation among genes. Gene selection has been proven to be an effective way to improve the results of many classification methods. Sparse logistic regression using least absolute shrinkage and selection operator (lasso) or using smoothly clipped absolute deviation is one of the most widely applicable methods in cancer classification for gene selection. However, this method faces a critical challenge in practical applications when there are high correlations among genes. To address this problem, a two-stage sparse logistic regression is proposed, with the aim of obtaining an efficient subset of genes with high classification capabilities by combining the screening approach as a filter method and adaptive lasso with a new weight as an embedded method. In the first stage, sure independence screening method as a screening approach retains those genes representing high individual correlation with the cancer class level. In the second stage, the adaptive lasso with new weight is implemented to address the existence of high correlations among the screened genes in the first stage. Experimental results based on four publicly available gene expression datasets have shown that the proposed method significantly outperforms three state-of-the-art methods in terms of classification accuracy, G-mean, area under the curve, and stability. In addition, the results demonstrate that the top selected genes are biologically related to the cancer type. Thus, the proposed method can be useful for cancer classification using DNA gene expression data in real clinical practice. PubDate: 2018-08-07 DOI: 10.1007/s11634-018-0334-1

Authors:Sandra Benítez-Peña; Rafael Blanquero; Emilio Carrizosa; Pepa Ramírez-Cobo Abstract: Support vector machine (SVM) is a powerful tool in binary classification, known to attain excellent misclassification rates. On the other hand, many realworld classification problems, such as those found in medical diagnosis, churn or fraud prediction, involve misclassification costs which may be different in the different classes. However, it may be hard for the user to provide precise values for such misclassification costs, whereas it may be much easier to identify acceptable misclassification rates values. In this paper we propose a novel SVM model in which misclassification costs are considered by incorporating performance constraints in the problem formulation. Specifically, our aim is to seek the hyperplane with maximal margin yielding misclassification rates below given threshold values. Such maximal margin hyperplane is obtained by solving a quadratic convex problem with linear constraints and integer variables. The reported numerical experience shows that our model gives the user control on the misclassification rates in one class (possibly at the expense of an increase in misclassification rates for the other class) and is feasible in terms of running times. PubDate: 2018-07-31 DOI: 10.1007/s11634-018-0330-5

Authors:Francesca Torti; Domenico Perrotta; Marco Riani; Andrea Cerioli Abstract: We assess the performance of state-of-the-art robust clustering tools for regression structures under a variety of different data configurations. We focus on two methodologies that use trimming and restrictions on group scatters as their main ingredients. We also give particular care to the data generation process through the development of a flexible simulation tool for mixtures of regressions, where the user can control the degree of overlap between the groups. Level of trimming and restriction factors are input parameters for which appropriate tuning is required. Since we find that incorrect specification of the second-level trimming in the Trimmed CLUSTering REGression model (TCLUST-REG) can deteriorate the performance of the method, we propose an improvement where the second-level trimming is not fixed in advance but is data dependent. We then compare our adaptive version of TCLUST-REG with the Trimmed Cluster Weighted Restricted Model (TCWRM) which provides a powerful extension of the robust clusterwise regression methodology. Our overall conclusion is that the two methods perform comparably, but with notable differences due to the inherent degree of modeling implied by them. PubDate: 2018-07-30 DOI: 10.1007/s11634-018-0331-4

Authors:Yu-Shan Shih; Kuang-Hsun Liu Abstract: A regression tree method for analyzing rank data is proposed. A key ingredient of the methodology is to convert ranks into scores by paired comparison. We then utilize the GUIDE tree method on the score vectors to identify the preference patterns in the data. This method is exempt from selection bias and the simulation results show that it is good with respect to the selection of split variables and has a better prediction accuracy than the two other investigated methods in some cases. Furthermore, it is applicable to complex data which may contain incomplete ranks and missing covariate values. We demonstrate its usefulness in two real data studies. PubDate: 2018-07-25 DOI: 10.1007/s11634-018-0332-3

Authors:Volodymyr Melnykov; Xuwen Zhu Abstract: Studying crime trends and tendencies is an important problem that helps to identify socioeconomic patterns and relationships of crucial significance. Finite mixture models are famous for their flexibility in modeling heterogeneity in data. A novel approach designed for accounting for skewness in the distributions of matrix observations is proposed and applied to the United States crime data collected between 2000 and 2012 years. Then, the model is further extended by incorporating explanatory variables. A step-by-step model development demonstrates differences and improvements associated with every stage of the process. Results obtained by the final model are illustrated and thoroughly discussed. Multiple interesting conclusions have been drawn based on the developed model and obtained model-based clustering partition. PubDate: 2018-06-23 DOI: 10.1007/s11634-018-0326-1

Authors:Toshiki Sato; Yuichi Takano; Takanobu Nakahara Abstract: This paper is concerned with a store-choice model for investigating consumers’ store-choice behavior based on scanner panel data. Our store-choice model enables us to evaluate the effects of the consumer/product attributes not only on the consumer’s store choice but also on his/her purchase quantity. Moreover, we adopt a mixed-integer optimization (MIO) approach to selecting the best set of explanatory variables with which to construct the store-choice model. We devise two MIO models for hierarchical variable selection in which the hierarchical structure of product categories is used to enhance the reliability and computational efficiency of the variable selection. We assess the effectiveness of our MIO models through computational experiments on actual scanner panel data. These experiments are focused on the consumer’s choice among three types of stores in Japan: convenience stores, drugstores, and (grocery) supermarkets. The computational results demonstrate that our method has several advantages over the common methods for variable selection, namely, the stepwise method and \(L_1\) -regularized regression. Furthermore, our analysis reveals that convenience stores are most strongly chosen for gift cards and garbage disposal permits, drugstores are most strongly chosen for products that are specific to drugstores, and supermarkets are most strongly chosen for health food products by women with families. PubDate: 2018-06-15 DOI: 10.1007/s11634-018-0327-0

Abstract: Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots. PubDate: 2018-05-25 DOI: 10.1007/s11634-018-0325-2

Authors:Daniel Fernández; Richard Arnold; Shirley Pledger; Ivy Liu; Roy Costilla Abstract: Many of the methods which deal with clustering in matrices of data are based on mathematical techniques such as distance-based algorithms or matrix decomposition and eigenvalues. In general, it is not possible to use statistical inferences or select the appropriateness of a model via information criteria with these techniques because there is no underlying probability model. This article summarizes some recent model-based methodologies for matrices of binary, count, and ordinal data, which are modelled under a unified statistical framework using finite mixtures to group the rows and/or columns. The model parameter can be constructed from a linear predictor of parameters and covariates through link functions. This likelihood-based one-mode and two-mode fuzzy clustering provides maximum likelihood estimation of parameters and the options of using likelihood information criteria for model comparison. Additionally, a Bayesian approach is presented in which the parameters and the number of clusters are estimated simultaneously from their joint posterior distribution. Visualization tools focused on ordinal data, the fuzziness of the clustering structures, and analogies of various standard plots used in the multivariate analysis are presented. Finally, a set of future extensions is enumerated. PubDate: 2018-05-15 DOI: 10.1007/s11634-018-0324-3

Authors:Aghiles Salah; Mohamed Nadif Abstract: Co-clustering addresses the problem of simultaneous clustering of both dimensions of a data matrix. When dealing with high dimensional sparse data, co-clustering turns out to be more beneficial than one-sided clustering even if one is interested in clustering along one dimension only. Aside from being high dimensional and sparse, some datasets, such as document-term matrices, exhibit directional characteristics, and the \(L_2\) normalization of such data, so that it lies on the surface of a unit hypersphere, is useful. Popular co-clustering assumptions such as Gaussian or Multinomial are inadequate for this type of data. In this paper, we extend the scope of co-clustering to directional data. We present Diagonal Block Mixture of Von Mises–Fisher distributions (dbmovMFs), a co-clustering model which is well suited for directional data lying on a unit hypersphere. By setting the estimate of the model parameters under the maximum likelihood (ML) and classification ML approaches, we develop a class of EM algorithms for estimating dbmovMFs from data. Extensive experiments, on several real-world datasets, confirm the advantage of our approach and demonstrate the effectiveness of our algorithms. PubDate: 2018-04-30 DOI: 10.1007/s11634-018-0323-4