Abstract: Many clustering algorithms when the data are curves or functions have been recently proposed. However, the presence of contamination in the sample of curves can influence the performance of most of them. In this work we propose a robust, model-based clustering method that relies on an approximation to the “density function” for functional data. The robustness follows from the joint application of data-driven trimming, for reducing the effect of contaminated observations, and constraints on the variances, for avoiding spurious clusters in the solution. The algorithm is designed to perform clustering and outlier detection simultaneously by maximizing a trimmed “pseudo” likelihood. The proposed method has been evaluated and compared with other existing methods through a simulation study. Better performance for the proposed methodology is shown when a fraction of contaminating curves is added to a non-contaminated sample. Finally, an application to a real data set that has been previously considered in the literature is given. PubDate: 2018-02-03 DOI: 10.1007/s11634-018-0312-7

Authors:Giuseppe Bove; Akinori Okada Abstract: Asymmetric pairwise relationships are frequently observed in experimental and non-experimental studies. They can be analysed with different aims and approaches. A brief review of models and methods of multidimensional scaling and cluster analysis able to deal with asymmetric proximities is provided taking a ‘data-analytic’ approach and emphasizing data visualization. PubDate: 2018-02-01 DOI: 10.1007/s11634-017-0307-9

Authors:Dawit G. Tadesse; Mark Carpenter Abstract: In this paper, we give a new feature selection algorithm for the binary class classification problem in sparse high-dimensional spaces. Singular value decomposition (SVD) is a popular dimension reduction method in higher-dimensional classification. The traditional SVD method begins by ranking the Singular Dimensions (SDs) from largest singular value to the smallest. However, when the number of signals is fewer than the number of noise, the first few ranked SDs are not necessarily the best for classification. We demonstrate, theoretically and empirically, that our method efficiently selects the SDs most appropriate for classification and significantly reduces the misclassification error. We also apply our method to real data text mining applications. PubDate: 2018-01-25 DOI: 10.1007/s11634-018-0311-8

Authors:Sébastien Loisel; Yoshio Takane Abstract: Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive. PubDate: 2018-01-18 DOI: 10.1007/s11634-018-0310-9

Authors:Amparo Baíllo; Javier Cárcamo; Konstantin Getman Abstract: The classification of the X-ray sources into classes (such as extragalactic sources, background stars,...) is an essential task in astronomy. Typically, one of the classes corresponds to extragalactic radiation, whose photon emission behaviour is well characterized by a homogeneous Poisson process. We propose to use normalized versions of the Wasserstein and Zolotarev distances to quantify the deviation of the distribution of photon interarrival times from the exponential class. Our main motivation is the analysis of a massive dataset from X-ray astronomy obtained by the Chandra Orion Ultradeep Project (COUP). This project yielded a large catalog of 1616 X-ray cosmic sources in the Orion Nebula region, with their series of photon arrival times and associated energies. We consider the plug-in estimators of these metrics, determine their asymptotic distributions, and illustrate their finite-sample performance with a Monte Carlo study. We estimate these metrics for each COUP source from three different classes. We conclude that our proposal provides a striking amount of information on the nature of the photon emitting sources. Further, these variables have the ability to identify X-ray sources wrongly catalogued before. As an appealing conclusion, we show that some sources, previously classified as extragalactic emissions, have a much higher probability of being young stars in Orion Nebula. PubDate: 2018-01-18 DOI: 10.1007/s11634-018-0309-2

Authors:José E. Chacón Abstract: The two most extended density-based approaches to clustering are surely mixture model clustering and modal clustering. In the mixture model approach, the density is represented as a mixture and clusters are associated to the different mixture components. In modal clustering, clusters are understood as regions of high density separated from each other by zones of lower density, so that they are closely related to certain regions around the density modes. If the true density is indeed in the assumed class of mixture densities, then mixture model clustering allows to scrutinize more subtle situations than modal clustering. However, when mixture modeling is used in a nonparametric way, taking advantage of the denseness of the sieve of mixture densities to approximate any density, then the correspondence between clusters and mixture components may become questionable. In this paper we introduce two methods to adopt a modal clustering point of view after a mixture model fit. Examples are provided to illustrate that mixture modeling can also be used for clustering in a nonparametric sense, as long as clusters are understood as the domains of attraction of the density modes. Finally, a simulation study reveals that the new methods are extremely efficient from a computational point of view, while at the same time they retain a high level of accuracy. PubDate: 2018-01-13 DOI: 10.1007/s11634-018-0308-3

Authors:Francesco Dotto; Alessio Farcomeni; Luis Angel García-Escudero; Agustín Mayo-Iscar Pages: 691 - 710 Abstract: A new robust fuzzy regression clustering method is proposed. We estimate coefficients of a linear regression model in each unknown cluster. Our method aims to achieve robustness by trimming a fixed proportion of observations. Assignments to clusters are fuzzy: observations contribute to estimates in more than one single cluster. We describe general criteria for tuning the method. The proposed method seems to be robust with respect to different types of contamination. PubDate: 2017-12-01 DOI: 10.1007/s11634-016-0271-9 Issue No:Vol. 11, No. 4 (2017)

Authors:Alberto Fernández; Sara del Río; Abdullah Bawakid; Francisco Herrera Pages: 711 - 730 Abstract: Due to the vast amount of information available nowadays, and the advantages related to the processing of this data, the topics of big data and data science have acquired a great importance in the current research. Big data applications are mainly about scalability, which can be achieved via the MapReduce programming model.It is designed to divide the data into several chunks or groups that are processed in parallel, and whose result is “assembled” to provide a single solution. Among different classification paradigms adapted to this new framework, fuzzy rule based classification systems have shown interesting results with a MapReduce approach for big data. It is well known that the performance of these types of systems has a strong dependence on the selection of a good granularity level for the Data Base. However, in the context of MapReduce this parameter is even harder to determine as it can be also related with the number of Maps chosen for the processing stage. In this paper, we aim at analyzing the interrelation between the number of labels of the fuzzy variables and the scarcity of the data due to the data sampling in MapReduce. Specifically, we consider that as the partitioning of the initial instance set grows, the level of granularity necessary to achieve a good performance also becomes higher. The experimental results, carried out for several Big Data problems, and using the Chi-FRBCS-BigData algorithms, support our claims. PubDate: 2017-12-01 DOI: 10.1007/s11634-016-0260-z Issue No:Vol. 11, No. 4 (2017)

Authors:Sara de la Rosa de Sáa; María Asunción Lubiano; Beatriz Sinova; Peter Filzmoser Pages: 731 - 758 Abstract: Observations distant from the majority or deviating from the general pattern often appear in datasets. Classical estimates such as the sample mean or the sample variance can be substantially affected by these observations (outliers). Even a single outlier can have huge distorting influence. However, when one deals with real-valued data there exist robust measures/estimates of location and scale (dispersion) which reduce the influence of these atypical values and provide approximately the same results as the classical estimates applied to the typical data without outliers. In real-life, data to be analyzed and interpreted are not always precisely defined and they cannot be properly expressed by using a numerical scale of measurement. Frequently, some of these imprecise data could be suitably described and modelled by considering a fuzzy rating scale of measurement. In this paper, several well-known scale (dispersion) estimators in the real-valued case are extended for random fuzzy numbers (i.e., random mechanisms generating fuzzy-valued data), and some of their properties as estimators for dispersion are examined. Furthermore, their robust behaviour is analyzed using two powerful tools, namely, the finite sample breakdown point and the sensitivity curves. Simulations, including empirical bias curves, are performed to complete the study. PubDate: 2017-12-01 DOI: 10.1007/s11634-015-0210-1 Issue No:Vol. 11, No. 4 (2017)

Authors:Rong Zhang; Baabak Ashuri; Yong Deng Pages: 759 - 783 Abstract: Time series attracts much attention for its remarkable forecasting potential. This paper discusses how fuzzy logic improves accuracy when forecasting time series using visibility graph and presents a novel method to make more accurate predictions. In the proposed method, historical data is firstly converted into a visibility graph. Then, the strategy of link prediction is utilized to preliminarily forecast the future data. Eventually, the future data is revised based on fuzzy logic. To demonstrate the performance, the proposed method is applied to forecast Construction Cost Index, Taiwan Stock Index and student enrollments. The results show that fuzzy logic is able to improve the accuracy by designing appropriate fuzzy rules. In addition, through comparison, it is proved that our method has high flexibility and predictability. It is expected that our work will not only make contributions to the theoretical study of time series forecasting, but also be beneficial to practical areas such as economy and engineering by providing more accurate predictions. PubDate: 2017-12-01 DOI: 10.1007/s11634-017-0300-3 Issue No:Vol. 11, No. 4 (2017)

Authors:Abdul Suleman Pages: 785 - 808 Abstract: We show that an improper initialization of the matrix of prototypes, \({\mathbf {V}}\) , can be misleading, and potentially gives rise to a degenerate fuzzy partition when performing fuzzy clustering by means of an archetypal analysis. Subsequently, we propose an algorithm to correct the initial guess for \({\mathbf {V}}\) , which is grounded in two theoretical results on convex hulls. A numerical experiment carried out to assess its accuracy, and involving more than 200,000 initializations, shows a failure rate of below 0.8%. PubDate: 2017-12-01 DOI: 10.1007/s11634-017-0303-0 Issue No:Vol. 11, No. 4 (2017)

Authors:Antonella Plaia; Mariangela Sciandra Abstract: Within the framework of preference rankings, the interest can lie in finding which predictors and which interactions are able to explain the observed preference structures, because preference decisions will usually depend on the characteristics of both the judges and the objects being judged. This work proposes the use of a univariate decision tree for ranking data based on the weighted distances for complete and incomplete rankings, and considers the area under the ROC curve both for pruning and model assessment. Two real and well-known datasets, the SUSHI preference data and the University ranking data, are used to display the performance of the methodology. PubDate: 2017-12-16 DOI: 10.1007/s11634-017-0306-x

Authors:A. Pedro Duarte Silva; Peter Filzmoser; Paula Brito Abstract: A multivariate outlier detection method for interval data is proposed that makes use of a parametric approach to model the interval data. The trimmed maximum likelihood principle is adapted in order to robustly estimate the model parameters. A simulation study demonstrates the usefulness of the robust estimates for outlier detection, and new diagnostic plots allow gaining deeper insight into the structure of real world interval data. PubDate: 2017-12-15 DOI: 10.1007/s11634-017-0305-y

Authors:Benjamin Quost; Thierry Denœux; Shoumei Li Abstract: Partially supervised learning extends both supervised and unsupervised learning, by considering situations in which only partial information about the response variable is available. In this paper, we consider partially supervised classification and we assume the learning instances to be labeled by Dempster–Shafer mass functions, called soft labels. Linear discriminant analysis and logistic regression are considered as special cases of generative and discriminative parametric models. We show that the evidential EM algorithm can be particularized to fit the parameters in each of these models. We describe experimental results with simulated data sets as well as with two real applications: K-complex detection in sleep EEGs signals and facial expression recognition. These results confirm the interest of using soft labels for classification as compared to potentially erroneous crisp labels, when the true class membership is partially unknown or ill-defined. PubDate: 2017-11-11 DOI: 10.1007/s11634-017-0301-2

Authors:Pasquale Dolce; Vincenzo Esposito Vinzi; Natale Carlo Lauro Abstract: Partial least squares path modeling presents some inconsistencies in terms of coherence with the predictive directions specified in the inner model (i.e. the path directions), because the directions of the links in the inner model are not taken into account in the iterative algorithm. In fact, the procedure amplifies interdependence among blocks and fails to distinguish between dependent and explanatory blocks. The method proposed in this paper takes into account and respects the specified path directions, with the aim of improving the predictive ability of the model and to maintain the hypothesized theoretical inner model. To highlight its properties, the proposed method is compared to the classical PLS path modeling in terms of explained variability, predictive relevance and interpretation using artificial data through a real data application. A further development of the method allows to treat multi-dimensional blocks in composite-based path modeling. PubDate: 2017-11-10 DOI: 10.1007/s11634-017-0302-1

Authors:Stéphanie Bougeard; Hervé Abdi; Gilbert Saporta; Ndèye Niang Abstract: Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing. PubDate: 2017-11-08 DOI: 10.1007/s11634-017-0296-8

Authors:Gunnar Carlsson; Facundo Mémoli; Alejandro Ribeiro; Santiago Segarra Abstract: This paper considers networks where relationships between nodes are represented by directed dissimilarities. The goal is to study methods that, based on the dissimilarity structure, output hierarchical clusters, i.e., a family of nested partitions indexed by a connectivity parameter. Our construction of hierarchical clustering methods is built around the concept of admissible methods, which are those that abide by the axioms of value—nodes in a network with two nodes are clustered together at the maximum of the two dissimilarities between them—and transformation—when dissimilarities are reduced, the network may become more clustered but not less. Two particular methods, termed reciprocal and nonreciprocal clustering, are shown to provide upper and lower bounds in the space of admissible methods. Furthermore, alternative clustering methodologies and axioms are considered. In particular, modifying the axiom of value such that clustering in two-node networks occurs at the minimum of the two dissimilarities entails the existence of a unique admissible clustering method. Finally, the developed clustering methods are implemented to analyze the internal migration in the United States. PubDate: 2017-11-08 DOI: 10.1007/s11634-017-0299-5

Authors:Šárka Brodinová; Maia Zaharieva; Peter Filzmoser; Thomas Ortner; Christian Breiteneder Abstract: Media content in large repositories usually exhibits multiple groups of strongly varying sizes. Media of potential interest often form notably smaller groups. Such media groups differ so much from the remaining data that it may be worthy to look at them in more detail. In contrast, media with popular content appear in larger groups. Identifying groups of varying sizes is addressed by clustering of imbalanced data. Clustering highly imbalanced media groups is additionally challenged by the high dimensionality of the underlying features. In this paper, we present the imbalanced clustering (IClust) algorithm designed to reveal group structures in high-dimensional media data. IClust employs an existing clustering method in order to find an initial set of a large number of potentially highly pure clusters which are then successively merged. The main advantage of IClust is that the number of clusters does not have to be pre-specified and that no specific assumptions about the cluster or data characteristics need to be made. Experiments on real-world media data demonstrate that in comparison to existing methods, IClust is able to better identify media groups, especially groups of small sizes. PubDate: 2017-11-07 DOI: 10.1007/s11634-017-0292-z

Authors:Juana-María Vivo; Manuel Franco; Donatella Vicari Abstract: The area under a receiver operating characteristic (ROC) curve is valuable for evaluating the classification performance described by the entire ROC curve in many fields including decision making and medical diagnosis. However, this can be misleading when clinical tasks demand a restricted specificity range. The partial area under a portion of the ROC curve ( \({ pAUC}\) ) has more practical relevance in such situations, but it is usually transformed to overcome some drawbacks and improve its interpretation. The standardized \({ pAUC}\) ( \({ SpAUC}\) ) index is considered as a meaningful relative measure of predictive accuracy. Nevertheless, this \({ SpAUC}\) index might still show some limitations due to ROC curves crossing the diagonal line, and to the problem when comparing two tests with crossing ROC curves in the same restricted specificity range. This paper provides an alternative \({ pAUC}\) index which overcomes these limitations. Tighter bounds for the \({ pAUC}\) of an ROC curve are derived, and then a modified \({ pAUC}\) index for any restricted specificity range is established. In addition, the proposed tighter partial area index ( \({ TpAUC}\) ) is also shown for classifier when high specificity must be clinically maintained. The variance of the \({ TpAUC}\) is also studied analytically and by simulation studies in a theoretical framework based on the most typical assumption of a binormal model, and estimated by using nonparametric bootstrap resampling in the empirical examples. Simulated and real datasets illustrate the practical utility of the \({ TpAUC}\) . PubDate: 2017-10-27 DOI: 10.1007/s11634-017-0295-9