Authors:Francesco Dotto; Alessio Farcomeni; Luis Angel García-Escudero; Agustín Mayo-Iscar Pages: 691 - 710 Abstract: A new robust fuzzy regression clustering method is proposed. We estimate coefficients of a linear regression model in each unknown cluster. Our method aims to achieve robustness by trimming a fixed proportion of observations. Assignments to clusters are fuzzy: observations contribute to estimates in more than one single cluster. We describe general criteria for tuning the method. The proposed method seems to be robust with respect to different types of contamination. PubDate: 2017-12-01 DOI: 10.1007/s11634-016-0271-9 Issue No:Vol. 11, No. 4 (2017)

Authors:Alberto Fernández; Sara del Río; Abdullah Bawakid; Francisco Herrera Pages: 711 - 730 Abstract: Due to the vast amount of information available nowadays, and the advantages related to the processing of this data, the topics of big data and data science have acquired a great importance in the current research. Big data applications are mainly about scalability, which can be achieved via the MapReduce programming model.It is designed to divide the data into several chunks or groups that are processed in parallel, and whose result is “assembled” to provide a single solution. Among different classification paradigms adapted to this new framework, fuzzy rule based classification systems have shown interesting results with a MapReduce approach for big data. It is well known that the performance of these types of systems has a strong dependence on the selection of a good granularity level for the Data Base. However, in the context of MapReduce this parameter is even harder to determine as it can be also related with the number of Maps chosen for the processing stage. In this paper, we aim at analyzing the interrelation between the number of labels of the fuzzy variables and the scarcity of the data due to the data sampling in MapReduce. Specifically, we consider that as the partitioning of the initial instance set grows, the level of granularity necessary to achieve a good performance also becomes higher. The experimental results, carried out for several Big Data problems, and using the Chi-FRBCS-BigData algorithms, support our claims. PubDate: 2017-12-01 DOI: 10.1007/s11634-016-0260-z Issue No:Vol. 11, No. 4 (2017)

Authors:Sara de la Rosa de Sáa; María Asunción Lubiano; Beatriz Sinova; Peter Filzmoser Pages: 731 - 758 Abstract: Observations distant from the majority or deviating from the general pattern often appear in datasets. Classical estimates such as the sample mean or the sample variance can be substantially affected by these observations (outliers). Even a single outlier can have huge distorting influence. However, when one deals with real-valued data there exist robust measures/estimates of location and scale (dispersion) which reduce the influence of these atypical values and provide approximately the same results as the classical estimates applied to the typical data without outliers. In real-life, data to be analyzed and interpreted are not always precisely defined and they cannot be properly expressed by using a numerical scale of measurement. Frequently, some of these imprecise data could be suitably described and modelled by considering a fuzzy rating scale of measurement. In this paper, several well-known scale (dispersion) estimators in the real-valued case are extended for random fuzzy numbers (i.e., random mechanisms generating fuzzy-valued data), and some of their properties as estimators for dispersion are examined. Furthermore, their robust behaviour is analyzed using two powerful tools, namely, the finite sample breakdown point and the sensitivity curves. Simulations, including empirical bias curves, are performed to complete the study. PubDate: 2017-12-01 DOI: 10.1007/s11634-015-0210-1 Issue No:Vol. 11, No. 4 (2017)

Authors:Rong Zhang; Baabak Ashuri; Yong Deng Pages: 759 - 783 Abstract: Time series attracts much attention for its remarkable forecasting potential. This paper discusses how fuzzy logic improves accuracy when forecasting time series using visibility graph and presents a novel method to make more accurate predictions. In the proposed method, historical data is firstly converted into a visibility graph. Then, the strategy of link prediction is utilized to preliminarily forecast the future data. Eventually, the future data is revised based on fuzzy logic. To demonstrate the performance, the proposed method is applied to forecast Construction Cost Index, Taiwan Stock Index and student enrollments. The results show that fuzzy logic is able to improve the accuracy by designing appropriate fuzzy rules. In addition, through comparison, it is proved that our method has high flexibility and predictability. It is expected that our work will not only make contributions to the theoretical study of time series forecasting, but also be beneficial to practical areas such as economy and engineering by providing more accurate predictions. PubDate: 2017-12-01 DOI: 10.1007/s11634-017-0300-3 Issue No:Vol. 11, No. 4 (2017)

Authors:Abdul Suleman Pages: 785 - 808 Abstract: We show that an improper initialization of the matrix of prototypes, \({\mathbf {V}}\) , can be misleading, and potentially gives rise to a degenerate fuzzy partition when performing fuzzy clustering by means of an archetypal analysis. Subsequently, we propose an algorithm to correct the initial guess for \({\mathbf {V}}\) , which is grounded in two theoretical results on convex hulls. A numerical experiment carried out to assess its accuracy, and involving more than 200,000 initializations, shows a failure rate of below 0.8%. PubDate: 2017-12-01 DOI: 10.1007/s11634-017-0303-0 Issue No:Vol. 11, No. 4 (2017)

Authors:Vaishali Mirge; Kesari Verma; Shubhrata Gupta Pages: 547 - 561 Abstract: Due to the rapid growth of wireless communications and positioning technologies, trajectory data have become increasingly popular, posing great challenges to the researchers of data mining and machine learning community. Trajectory data are obtained using GPS devices that capture the position of an object at specific time intervals. These enormous amounts of data necessitates to explore efficient and effective techniques to extract useful information to solve real world problems. Traffic flow pattern mining is one of the challenging issues for many applications. In a literature significant number of approaches are available to cluster the trajectory data, however the clustering has not been explored for trajectories pattern mining in bi-directional road networks. This paper presents a novel technique for excavating heavy traffic flow patterns in bi-directional road network, i.e. identifying divisions of the roads where the traffic flow is very dense. The proposed technique works in two phases: phase I, finds the clusters of trajectory points based on density of trajectory points; phase II, arranges the clusters in sequence based on spatiotemporal values for each route and directions. These sequences represent the traffic flow patterns. All the routes and sections exceeding a user specified minimum traffic threshold are marked as high dense traffic areas. The experiments are performed on synthetic dataset. The proposed algorithm efficiently and accurately finds the dense traffic in bi-directional roads. Proposed clustering method is compared with the standard k-means clustering algorithm for the performance evaluation. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0256-8 Issue No:Vol. 11, No. 3 (2017)

Authors:Maurizio Vichi Pages: 563 - 591 Abstract: Disjoint factor analysis (DFA) is a new latent factor model that we propose here to identify factors that relate to disjoint subsets of variables, thus simplifying the loading matrix structure. Similarly to exploratory factor analysis (EFA), the DFA does not hypothesize prior information on the number of factors and on the relevant relations between variables and factors. In DFA the population variance–covariance structure is hypothesized block diagonal after the proper permutation of variables and estimated by Maximum Likelihood, using an Coordinate Descent type algorithm. Inference on parameters on the number of factors and to confirm the hypothesized simple structure are provided. Properties such as scale equivariance, uniqueness, optimal simplification of loadings are satisfied by DFA. Relevant cross-loadings are also estimated in case they are detected from the best DFA solution. DFA has also the option to constrain a variable to load on a pre-specified factor so that the researcher can assume, a priori, some relations between variables and loadings. A simulation study shows performances of DFA and an application to optimally identify the dimensions of well-being is used to illustrate characteristics of the new methodology. A final discussion concludes the paper. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0263-9 Issue No:Vol. 11, No. 3 (2017)

Authors:Thao Nguyen-Trang; Tai Vo-Van Pages: 629 - 643 Abstract: In this article, we suggest a new algorithm to identify the prior probabilities for classification problem by Bayesian method. The prior probabilities are determined by combining the information of populations in training set and the new observations through fuzzy clustering method (FCM) instead of using uniform distribution or the ratio of sample or Laplace method as the existing ones. We next combine the determined prior probabilities and the estimated likelihood functions to classify the new object. In practice, calculations are performed by Matlab procedures. The proposed algorithm is tested by the three numerical examples including bench mark and real data sets. The results show that the new approach is reasonable and gives more efficient than existing ones. PubDate: 2017-09-01 DOI: 10.1007/s11634-016-0253-y Issue No:Vol. 11, No. 3 (2017)

Authors:Benjamin Quost; Thierry Denœux; Shoumei Li Abstract: Partially supervised learning extends both supervised and unsupervised learning, by considering situations in which only partial information about the response variable is available. In this paper, we consider partially supervised classification and we assume the learning instances to be labeled by Dempster–Shafer mass functions, called soft labels. Linear discriminant analysis and logistic regression are considered as special cases of generative and discriminative parametric models. We show that the evidential EM algorithm can be particularized to fit the parameters in each of these models. We describe experimental results with simulated data sets as well as with two real applications: K-complex detection in sleep EEGs signals and facial expression recognition. These results confirm the interest of using soft labels for classification as compared to potentially erroneous crisp labels, when the true class membership is partially unknown or ill-defined. PubDate: 2017-11-11 DOI: 10.1007/s11634-017-0301-2

Authors:Pasquale Dolce; Vincenzo Esposito Vinzi; Natale Carlo Lauro Abstract: Partial least squares path modeling presents some inconsistencies in terms of coherence with the predictive directions specified in the inner model (i.e. the path directions), because the directions of the links in the inner model are not taken into account in the iterative algorithm. In fact, the procedure amplifies interdependence among blocks and fails to distinguish between dependent and explanatory blocks. The method proposed in this paper takes into account and respects the specified path directions, with the aim of improving the predictive ability of the model and to maintain the hypothesized theoretical inner model. To highlight its properties, the proposed method is compared to the classical PLS path modeling in terms of explained variability, predictive relevance and interpretation using artificial data through a real data application. A further development of the method allows to treat multi-dimensional blocks in composite-based path modeling. PubDate: 2017-11-10 DOI: 10.1007/s11634-017-0302-1

Authors:Stéphanie Bougeard; Hervé Abdi; Gilbert Saporta; Ndèye Niang Abstract: Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing. PubDate: 2017-11-08 DOI: 10.1007/s11634-017-0296-8

Authors:Gunnar Carlsson; Facundo Mémoli; Alejandro Ribeiro; Santiago Segarra Abstract: This paper considers networks where relationships between nodes are represented by directed dissimilarities. The goal is to study methods that, based on the dissimilarity structure, output hierarchical clusters, i.e., a family of nested partitions indexed by a connectivity parameter. Our construction of hierarchical clustering methods is built around the concept of admissible methods, which are those that abide by the axioms of value—nodes in a network with two nodes are clustered together at the maximum of the two dissimilarities between them—and transformation—when dissimilarities are reduced, the network may become more clustered but not less. Two particular methods, termed reciprocal and nonreciprocal clustering, are shown to provide upper and lower bounds in the space of admissible methods. Furthermore, alternative clustering methodologies and axioms are considered. In particular, modifying the axiom of value such that clustering in two-node networks occurs at the minimum of the two dissimilarities entails the existence of a unique admissible clustering method. Finally, the developed clustering methods are implemented to analyze the internal migration in the United States. PubDate: 2017-11-08 DOI: 10.1007/s11634-017-0299-5

Authors:Šárka Brodinová; Maia Zaharieva; Peter Filzmoser; Thomas Ortner; Christian Breiteneder Abstract: Media content in large repositories usually exhibits multiple groups of strongly varying sizes. Media of potential interest often form notably smaller groups. Such media groups differ so much from the remaining data that it may be worthy to look at them in more detail. In contrast, media with popular content appear in larger groups. Identifying groups of varying sizes is addressed by clustering of imbalanced data. Clustering highly imbalanced media groups is additionally challenged by the high dimensionality of the underlying features. In this paper, we present the imbalanced clustering (IClust) algorithm designed to reveal group structures in high-dimensional media data. IClust employs an existing clustering method in order to find an initial set of a large number of potentially highly pure clusters which are then successively merged. The main advantage of IClust is that the number of clusters does not have to be pre-specified and that no specific assumptions about the cluster or data characteristics need to be made. Experiments on real-world media data demonstrate that in comparison to existing methods, IClust is able to better identify media groups, especially groups of small sizes. PubDate: 2017-11-07 DOI: 10.1007/s11634-017-0292-z

Authors:Juana-María Vivo; Manuel Franco; Donatella Vicari Abstract: The area under a receiver operating characteristic (ROC) curve is valuable for evaluating the classification performance described by the entire ROC curve in many fields including decision making and medical diagnosis. However, this can be misleading when clinical tasks demand a restricted specificity range. The partial area under a portion of the ROC curve ( \({ pAUC}\) ) has more practical relevance in such situations, but it is usually transformed to overcome some drawbacks and improve its interpretation. The standardized \({ pAUC}\) ( \({ SpAUC}\) ) index is considered as a meaningful relative measure of predictive accuracy. Nevertheless, this \({ SpAUC}\) index might still show some limitations due to ROC curves crossing the diagonal line, and to the problem when comparing two tests with crossing ROC curves in the same restricted specificity range. This paper provides an alternative \({ pAUC}\) index which overcomes these limitations. Tighter bounds for the \({ pAUC}\) of an ROC curve are derived, and then a modified \({ pAUC}\) index for any restricted specificity range is established. In addition, the proposed tighter partial area index ( \({ TpAUC}\) ) is also shown for classifier when high specificity must be clinically maintained. The variance of the \({ TpAUC}\) is also studied analytically and by simulation studies in a theoretical framework based on the most typical assumption of a binormal model, and estimated by using nonparametric bootstrap resampling in the empirical examples. Simulated and real datasets illustrate the practical utility of the \({ TpAUC}\) . PubDate: 2017-10-27 DOI: 10.1007/s11634-017-0295-9

Authors:Gerhard Tutz; Moritz Berger Abstract: Generalized linear and additive models are very efficient regression tools but many parameters have to be estimated if categorical predictors with many categories are included. The method proposed here focusses on the main effects of categorical predictors by using tree type methods to obtain clusters of categories. When the predictor has many categories one wants to know in particular which of the categories have to be distinguished with respect to their effect on the response. The tree-structured approach allows to detect clusters of categories that share the same effect while letting other predictors, in particular metric predictors, have a linear or additive effect on the response. An algorithm for the fitting is proposed and various stopping criteria are evaluated. The preferred stopping criterion is based on p values representing a conditional inference procedure. In addition, stability of clusters is investigated and the relevance of predictors is investigated by bootstrap methods. Several applications show the usefulness of the tree-structured approach and small simulation studies demonstrate that the fitting procedure works well. PubDate: 2017-10-26 DOI: 10.1007/s11634-017-0298-6

Authors:Irene Epifanio; María Victoria Ibáñez; Amelia Simó Abstract: Archetype and archetypoid analysis are extended to shapes. The objective is to find representative shapes. Archetypal shapes are pure (extreme) shapes. We focus on the case where the shape of an object is represented by a configuration matrix of landmarks. As shape space is not a vectorial space, we work in the tangent space, the linearized space about the mean shape. Then, each observation is approximated by a convex combination of actual observations (archetypoids) or archetypes, which are a convex combination of observations in the data set. These tools can contribute to the understanding of shapes, as in the usual multivariate case, since they lie somewhere between clustering and matrix factorization methods. A new simplex visualization tool is also proposed to provide a picture of the archetypal analysis results. We also propose new algorithms for performing archetypal analysis with missing data and its extension to incomplete shapes. A well-known data set is used to illustrate the methodologies developed. The proposed methodology is applied to an apparel design problem in children. PubDate: 2017-10-25 DOI: 10.1007/s11634-017-0297-7

Authors:Luis Angel García-Escudero; Alfonso Gordaliza; Francesca Greselin; Salvatore Ingrassia; Agustín Mayo-Iscar Abstract: This paper presents a review about the usage of eigenvalues restrictions for constrained parameter estimation in mixtures of elliptical distributions according to the likelihood approach. The restrictions serve a twofold purpose: to avoid convergence to degenerate solutions and to reduce the onset of non interesting (spurious) local maximizers, related to complex likelihood surfaces. The paper shows how the constraints may play a key role in the theory of Euclidean data clustering. The aim here is to provide a reasoned survey of the constraints and their applications, considering the contributions of many authors and spanning the literature of the last 30 years. PubDate: 2017-10-23 DOI: 10.1007/s11634-017-0293-y

Authors:Andrew Marchese; Vasileios Maroulas Abstract: In this paper, we consider the problem of signal classification. First, the signal is translated into a persistence diagram through the use of delay-embedding and persistent homology. Endowing the data space of persistence diagrams with a metric from point processes, we show that it admits statistical structure in the form of Fréchet means and variances and a classification scheme is established. In contrast with the Wasserstein distance, this metric accounts for changes in small persistence and changes in cardinality. The classification results using this distance are benchmarked on both synthetic data and real acoustic signals and it is demonstrated that this classifier outperforms current signal classification techniques. PubDate: 2017-10-13 DOI: 10.1007/s11634-017-0294-x