Authors:Mia Hubert; Peter Rousseeuw; Pieter Segaert Pages: 445 - 466 Abstract: We construct classifiers for multivariate and functional data. Our approach is based on a kind of distance between data points and classes. The distance measure needs to be robust to outliers and invariant to linear transformations of the data. For this purpose we can use the bagdistance which is based on halfspace depth. It satisfies most of the properties of a norm but is able to reflect asymmetry when the class is skewed. Alternatively we can compute a measure of outlyingness based on the skew-adjusted projection depth. In either case we propose the DistSpace transform which maps each data point to the vector of its distances to all classes, followed by k-nearest neighbor (kNN) classification of the transformed data points. This combines invariance and robustness with the simplicity and wide applicability of kNN. The proposal is compared with other methods in experiments with real and simulated data. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0269-3 Issue No:Vol. 11, No. 3 (2017)

Authors:Christina Yassouridis; Friedrich Leisch Pages: 467 - 492 Abstract: Theoretical knowledge of clustering functions is still scarce and only few models are available in form of applicable code. In literature, most methods are based on the projection of the functions onto a basis and building fixed or random effects models of the basis coefficients. They involve various parameters, among them number of basis functions, projection dimension, number of iterations etc. They usually work well on the data presented in the articles, but their performance has in most cases not been tested objectively on other data sets, nor against each other. The purpose of this paper is to give an overview of several existing methods to cluster functional data. An outline of their theoretic concepts is given and the meaning of their hyperparameters is explained. A simulation study was set up to analyze the parameters’ efficiency and sensitivity on different types of data sets, that were registered on regular and on irregular grids. For each method, a linear model of the clustering results was evaluated with different parameter levels as predictors. Later, the methods’ performances were compared to each other with the help of a visualization tool, to identify which method works the best on a specific kind of data. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0261-y Issue No:Vol. 11, No. 3 (2017)

Authors:Marek Śmieja; Magdalena Wiercioch Pages: 493 - 518 Abstract: In this contribution we present a novel constrained clustering method, Constrained clustering with a complex cluster structure (C4s), which incorporates equivalence constraints, both positive and negative, as the background information. C4s is capable of discovering groups of arbitrary structure, e.g. with multi-modal distribution, since at the initial stage the equivalence classes of elements generated by the positive constraints are split into smaller parts. This provides a detailed description of elements, which are in positive equivalence relation. In order to enable an automatic detection of the number of groups, the cross-entropy clustering is applied for each partitioning process. Experiments show that the proposed method achieves significantly better results than previous constrained clustering approaches. The advantage of our algorithm increases when we are focusing on finding partitions with complex structure of clusters. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0254-x Issue No:Vol. 11, No. 3 (2017)

Authors:Zahid A. Ansari; Syed Abdul Sattar; A. Vinaya Babu Pages: 519 - 546 Abstract: Clustering data from web user sessions is extensively applied to extract customer usage behavior to serve customized content to individual users. Due to the human involvement, web usage data usually contain noisy, incomplete and vague information. Neural networks have the capability to extract embedded knowledge in the form of user session clusters from the huge web usage data. Moreover, they provide tolerance against imperfect and noisy data. Fuzzy sets are another popular tool utilized for handling uncertainty and vagueness hidden in the data. In this paper a fuzzy neural clustering network (FNCN) based framework is proposed that makes use of the fuzzy membership concept of fuzzy c-means (FCM) clustering and the learning rate of a modified self-organizing map (MSOM) neural network model and tries to minimize the weighted sum of the squared error. FNCN is applied to cluster the users’ web access data extracted from the web logs of an educational institution’s proxy web server. The performance of FNCN is compared with FCM and MSOM based clustering methods using various validity indexes. Our results show that FNCN produces better quality of clusters than FCM and MSOM. PubDate: 2017-09-01 DOI: 10.1007/s11634-015-0228-4 Issue No:Vol. 11, No. 3 (2017)

Authors:Vaishali Mirge; Kesari Verma; Shubhrata Gupta Pages: 547 - 561 Abstract: Due to the rapid growth of wireless communications and positioning technologies, trajectory data have become increasingly popular, posing great challenges to the researchers of data mining and machine learning community. Trajectory data are obtained using GPS devices that capture the position of an object at specific time intervals. These enormous amounts of data necessitates to explore efficient and effective techniques to extract useful information to solve real world problems. Traffic flow pattern mining is one of the challenging issues for many applications. In a literature significant number of approaches are available to cluster the trajectory data, however the clustering has not been explored for trajectories pattern mining in bi-directional road networks. This paper presents a novel technique for excavating heavy traffic flow patterns in bi-directional road network, i.e. identifying divisions of the roads where the traffic flow is very dense. The proposed technique works in two phases: phase I, finds the clusters of trajectory points based on density of trajectory points; phase II, arranges the clusters in sequence based on spatiotemporal values for each route and directions. These sequences represent the traffic flow patterns. All the routes and sections exceeding a user specified minimum traffic threshold are marked as high dense traffic areas. The experiments are performed on synthetic dataset. The proposed algorithm efficiently and accurately finds the dense traffic in bi-directional roads. Proposed clustering method is compared with the standard k-means clustering algorithm for the performance evaluation. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0256-8 Issue No:Vol. 11, No. 3 (2017)

Authors:Maurizio Vichi Pages: 563 - 591 Abstract: Disjoint factor analysis (DFA) is a new latent factor model that we propose here to identify factors that relate to disjoint subsets of variables, thus simplifying the loading matrix structure. Similarly to exploratory factor analysis (EFA), the DFA does not hypothesize prior information on the number of factors and on the relevant relations between variables and factors. In DFA the population variance–covariance structure is hypothesized block diagonal after the proper permutation of variables and estimated by Maximum Likelihood, using an Coordinate Descent type algorithm. Inference on parameters on the number of factors and to confirm the hypothesized simple structure are provided. Properties such as scale equivariance, uniqueness, optimal simplification of loadings are satisfied by DFA. Relevant cross-loadings are also estimated in case they are detected from the best DFA solution. DFA has also the option to constrain a variable to load on a pre-specified factor so that the researcher can assume, a priori, some relations between variables and loadings. A simulation study shows performances of DFA and an application to optimally identify the dimensions of well-being is used to illustrate characteristics of the new methodology. A final discussion concludes the paper. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0263-9 Issue No:Vol. 11, No. 3 (2017)

Authors:Leila Amiri; Mojtaba Khazaei; Mojtaba Ganjali Pages: 593 - 609 Abstract: General location model (GLOM) is a well-known model for analyzing mixed data. In GLOM one decomposes the joint distribution of variables into conditional distribution of continuous variables given categorical outcomes and marginal distribution of categorical variables. The first version of GLOM assumes that the covariance matrices of continuous multivariate distributions across cells, which are obtained by different combination of categorical variables, are equal. In this paper, the GLOMs are considered in both cases of equality and unequality of these covariance matrices. Three covariance structures are used across cells: the same factor analyzer, factor analyzer with unequal specific variances matrices (in the general and parsimonious forms) and factor analyzers with common factor loadings. These structures are used for both modeling covariance structure and for reducing the number of parameters. The maximum likelihood estimates of parameters are computed via the EM algorithm. As an application for these models, we investigate the classification of continuous variables within cells. Based on these models, the classification is done for usual as well as for high dimensional data sets. Finally, for showing the applicability of the proposed models for classification, results from analyzing three real data sets are presented. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0258-6 Issue No:Vol. 11, No. 3 (2017)

Authors:E. Emary; Hossam M. Zawbaa; Aboul Ella Hassanien; B. Parv Pages: 611 - 627 Abstract: This paper presents a multi-objective retinal blood vessels localization approach based on flower pollination search algorithm (FPSA) and pattern search (PS) algorithm. FPSA is a new evolutionary algorithm based on the flower pollination process of flowering plants. The proposed multi-objective fitness function uses the flower pollination search algorithm (FPSA) that searches for the optimal clustering of the given retinal image into compact clusters under some constraints. Pattern search (PS) as local search method is then applied to further enhance the segmentation results using another objective function based on shape features. The proposed approach for retinal blood vessels localization is applied on public database namely DRIVE data set. Results demonstrate that the performance of the proposed approach is comparable with state of the art techniques in terms of accuracy, sensitivity, and specificity with many extendable features. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0257-7 Issue No:Vol. 11, No. 3 (2017)

Authors:Thao Nguyen-Trang; Tai Vo-Van Pages: 629 - 643 Abstract: In this article, we suggest a new algorithm to identify the prior probabilities for classification problem by Bayesian method. The prior probabilities are determined by combining the information of populations in training set and the new observations through fuzzy clustering method (FCM) instead of using uniform distribution or the ratio of sample or Laplace method as the existing ones. We next combine the determined prior probabilities and the estimated likelihood functions to classify the new object. In practice, calculations are performed by Matlab procedures. The proposed algorithm is tested by the three numerical examples including bench mark and real data sets. The results show that the new approach is reasonable and gives more efficient than existing ones. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0253-y Issue No:Vol. 11, No. 3 (2017)

Authors:Benjamin Quost; Thierry Denœux; Shoumei Li Abstract: Partially supervised learning extends both supervised and unsupervised learning, by considering situations in which only partial information about the response variable is available. In this paper, we consider partially supervised classification and we assume the learning instances to be labeled by Dempster–Shafer mass functions, called soft labels. Linear discriminant analysis and logistic regression are considered as special cases of generative and discriminative parametric models. We show that the evidential EM algorithm can be particularized to fit the parameters in each of these models. We describe experimental results with simulated data sets as well as with two real applications: K-complex detection in sleep EEGs signals and facial expression recognition. These results confirm the interest of using soft labels for classification as compared to potentially erroneous crisp labels, when the true class membership is partially unknown or ill-defined. PubDate: 2017-11-11 DOI: 10.1007/s11634-017-0301-2

Authors:Pasquale Dolce; Vincenzo Esposito Vinzi; Natale Carlo Lauro Abstract: Partial least squares path modeling presents some inconsistencies in terms of coherence with the predictive directions specified in the inner model (i.e. the path directions), because the directions of the links in the inner model are not taken into account in the iterative algorithm. In fact, the procedure amplifies interdependence among blocks and fails to distinguish between dependent and explanatory blocks. The method proposed in this paper takes into account and respects the specified path directions, with the aim of improving the predictive ability of the model and to maintain the hypothesized theoretical inner model. To highlight its properties, the proposed method is compared to the classical PLS path modeling in terms of explained variability, predictive relevance and interpretation using artificial data through a real data application. A further development of the method allows to treat multi-dimensional blocks in composite-based path modeling. PubDate: 2017-11-10 DOI: 10.1007/s11634-017-0302-1

Authors:Stéphanie Bougeard; Hervé Abdi; Gilbert Saporta; Ndèye Niang Abstract: Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing. PubDate: 2017-11-08 DOI: 10.1007/s11634-017-0296-8

Authors:Gunnar Carlsson; Facundo Mémoli; Alejandro Ribeiro; Santiago Segarra Abstract: This paper considers networks where relationships between nodes are represented by directed dissimilarities. The goal is to study methods that, based on the dissimilarity structure, output hierarchical clusters, i.e., a family of nested partitions indexed by a connectivity parameter. Our construction of hierarchical clustering methods is built around the concept of admissible methods, which are those that abide by the axioms of value—nodes in a network with two nodes are clustered together at the maximum of the two dissimilarities between them—and transformation—when dissimilarities are reduced, the network may become more clustered but not less. Two particular methods, termed reciprocal and nonreciprocal clustering, are shown to provide upper and lower bounds in the space of admissible methods. Furthermore, alternative clustering methodologies and axioms are considered. In particular, modifying the axiom of value such that clustering in two-node networks occurs at the minimum of the two dissimilarities entails the existence of a unique admissible clustering method. Finally, the developed clustering methods are implemented to analyze the internal migration in the United States. PubDate: 2017-11-08 DOI: 10.1007/s11634-017-0299-5

Authors:Šárka Brodinová; Maia Zaharieva; Peter Filzmoser; Thomas Ortner; Christian Breiteneder Abstract: Media content in large repositories usually exhibits multiple groups of strongly varying sizes. Media of potential interest often form notably smaller groups. Such media groups differ so much from the remaining data that it may be worthy to look at them in more detail. In contrast, media with popular content appear in larger groups. Identifying groups of varying sizes is addressed by clustering of imbalanced data. Clustering highly imbalanced media groups is additionally challenged by the high dimensionality of the underlying features. In this paper, we present the imbalanced clustering (IClust) algorithm designed to reveal group structures in high-dimensional media data. IClust employs an existing clustering method in order to find an initial set of a large number of potentially highly pure clusters which are then successively merged. The main advantage of IClust is that the number of clusters does not have to be pre-specified and that no specific assumptions about the cluster or data characteristics need to be made. Experiments on real-world media data demonstrate that in comparison to existing methods, IClust is able to better identify media groups, especially groups of small sizes. PubDate: 2017-11-07 DOI: 10.1007/s11634-017-0292-z

Authors:Juana-María Vivo; Manuel Franco; Donatella Vicari Abstract: The area under a receiver operating characteristic (ROC) curve is valuable for evaluating the classification performance described by the entire ROC curve in many fields including decision making and medical diagnosis. However, this can be misleading when clinical tasks demand a restricted specificity range. The partial area under a portion of the ROC curve ( \({ pAUC}\) ) has more practical relevance in such situations, but it is usually transformed to overcome some drawbacks and improve its interpretation. The standardized \({ pAUC}\) ( \({ SpAUC}\) ) index is considered as a meaningful relative measure of predictive accuracy. Nevertheless, this \({ SpAUC}\) index might still show some limitations due to ROC curves crossing the diagonal line, and to the problem when comparing two tests with crossing ROC curves in the same restricted specificity range. This paper provides an alternative \({ pAUC}\) index which overcomes these limitations. Tighter bounds for the \({ pAUC}\) of an ROC curve are derived, and then a modified \({ pAUC}\) index for any restricted specificity range is established. In addition, the proposed tighter partial area index ( \({ TpAUC}\) ) is also shown for classifier when high specificity must be clinically maintained. The variance of the \({ TpAUC}\) is also studied analytically and by simulation studies in a theoretical framework based on the most typical assumption of a binormal model, and estimated by using nonparametric bootstrap resampling in the empirical examples. Simulated and real datasets illustrate the practical utility of the \({ TpAUC}\) . PubDate: 2017-10-27 DOI: 10.1007/s11634-017-0295-9

Authors:Gerhard Tutz; Moritz Berger Abstract: Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on p values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well. PubDate: 2017-10-26 DOI: 10.1007/s11634-017-0298-6

Authors:Irene Epifanio; María Victoria Ibáñez; Amelia Simó Abstract: Archetype and archetypoid analysis are extended to shapes. The objective is to find representative shapes. Archetypal shapes are pure (extreme) shapes. We focus on the case where the shape of an object is represented by a configuration matrix of landmarks. As shape space is not a vectorial space, we work in the tangent space, the linearized space about the mean shape. Then, each observation is approximated by a convex combination of actual observations (archetypoids) or archetypes, which are a convex combination of observations in the data set. These tools can contribute to the understanding of shapes, as in the usual multivariate case, since they lie somewhere between clustering and matrix factorization methods. A new simplex visualization tool is also proposed to provide a picture of the archetypal analysis results. We also propose new algorithms for performing archetypal analysis with missing data and its extension to incomplete shapes. A well-known data set is used to illustrate the methodologies developed. The proposed methodology is applied to an apparel design problem in children. PubDate: 2017-10-25 DOI: 10.1007/s11634-017-0297-7

Authors:Luis Angel García-Escudero; Alfonso Gordaliza; Francesca Greselin; Salvatore Ingrassia; Agustín Mayo-Iscar Abstract: This paper presents a review about the usage of eigenvalues restrictions for constrained parameter estimation in mixtures of elliptical distributions according to the likelihood approach. The restrictions serve a twofold purpose: to avoid convergence to degenerate solutions and to reduce the onset of non interesting (spurious) local maximizers, related to complex likelihood surfaces. The paper shows how the constraints may play a key role in the theory of Euclidean data clustering. The aim here is to provide a reasoned survey of the constraints and their applications, considering the contributions of many authors and spanning the literature of the last 30 years. PubDate: 2017-10-23 DOI: 10.1007/s11634-017-0293-y

Authors:Andrew Marchese; Vasileios Maroulas Abstract: In this paper, we consider the problem of signal classification. First, the signal is translated into a persistence diagram through the use of delay-embedding and persistent homology. Endowing the data space of persistence diagrams with a metric from point processes, we show that it admits statistical structure in the form of Fréchet means and variances and a classification scheme is established. In contrast with the Wasserstein distance, this metric accounts for changes in small persistence and changes in cardinality. The classification results using this distance are benchmarked on both synthetic data and real acoustic signals and it is demonstrated that this classifier outperforms current signal classification techniques. PubDate: 2017-10-13 DOI: 10.1007/s11634-017-0294-x