• Editorial for issue 3/2017
• Multivariate and functional classification using depth and distance
• Authors: Mia Hubert; Peter Rousseeuw; Pieter Segaert
Abstract: Abstract We construct classifiers for multivariate and functional data. Our approach is based on a kind of distance between data points and classes. The distance measure needs to be robust to outliers and invariant to linear transformations of the data. For this purpose we can use the bagdistance which is based on halfspace depth. It satisfies most of the properties of a norm but is able to reflect asymmetry when the class is skewed. Alternatively we can compute a measure of outlyingness based on the skew-adjusted projection depth. In either case we propose the DistSpace transform which maps each data point to the vector of its distances to all classes, followed by k-nearest neighbor (kNN) classification of the transformed data points. This combines invariance and robustness with the simplicity and wide applicability of kNN. The proposal is compared with other methods in experiments with real and simulated data.
• Benchmarking different clustering algorithms on functional data
• Authors: Christina Yassouridis; Friedrich Leisch
Abstract: Abstract Theoretical knowledge of clustering functions is still scarce and only few models are available in form of applicable code. In literature, most methods are based on the projection of the functions onto a basis and building fixed or random effects models of the basis coefficients. They involve various parameters, among them number of basis functions, projection dimension, number of iterations etc. They usually work well on the data presented in the articles, but their performance has in most cases not been tested objectively on other data sets, nor against each other. The purpose of this paper is to give an overview of several existing methods to cluster functional data. An outline of their theoretic concepts is given and the meaning of their hyperparameters is explained. A simulation study was set up to analyze the parameters’ efficiency and sensitivity on different types of data sets, that were registered on regular and on irregular grids. For each method, a linear model of the clustering results was evaluated with different parameter levels as predictors. Later, the methods’ performances were compared to each other with the help of a visualization tool, to identify which method works the best on a specific kind of data.
• Constrained clustering with a complex cluster structure
• Authors: Marek Śmieja; Magdalena Wiercioch
Abstract: Abstract In this contribution we present a novel constrained clustering method, Constrained clustering with a complex cluster structure (C4s), which incorporates equivalence constraints, both positive and negative, as the background information. C4s is capable of discovering groups of arbitrary structure, e.g. with multi-modal distribution, since at the initial stage the equivalence classes of elements generated by the positive constraints are split into smaller parts. This provides a detailed description of elements, which are in positive equivalence relation. In order to enable an automatic detection of the number of groups, the cross-entropy clustering is applied for each partitioning process. Experiments show that the proposed method achieves significantly better results than previous constrained clustering approaches. The advantage of our algorithm increases when we are focusing on finding partitions with complex structure of clusters.
• A fuzzy neural network based framework to discover user access patterns
from web log data
• Authors: Zahid A. Ansari; Syed Abdul Sattar; A. Vinaya Babu
Abstract: Abstract Clustering data from web user sessions is extensively applied to extract customer usage behavior to serve customized content to individual users. Due to the human involvement, web usage data usually contain noisy, incomplete and vague information. Neural networks have the capability to extract embedded knowledge in the form of user session clusters from the huge web usage data. Moreover, they provide tolerance against imperfect and noisy data. Fuzzy sets are another popular tool utilized for handling uncertainty and vagueness hidden in the data. In this paper a fuzzy neural clustering network (FNCN) based framework is proposed that makes use of the fuzzy membership concept of fuzzy c-means (FCM) clustering and the learning rate of a modified self-organizing map (MSOM) neural network model and tries to minimize the weighted sum of the squared error. FNCN is applied to cluster the users’ web access data extracted from the web logs of an educational institution’s proxy web server. The performance of FNCN is compared with FCM and MSOM based clustering methods using various validity indexes. Our results show that FNCN produces better quality of clusters than FCM and MSOM.
• Dense traffic flow patterns mining in bi-directional road networks using
density based trajectory clustering
• Authors: Vaishali Mirge; Kesari Verma; Shubhrata Gupta
Abstract: Abstract Due to the rapid growth of wireless communications and positioning technologies, trajectory data have become increasingly popular, posing great challenges to the researchers of data mining and machine learning community. Trajectory data are obtained using GPS devices that capture the position of an object at specific time intervals. These enormous amounts of data necessitates to explore efficient and effective techniques to extract useful information to solve real world problems. Traffic flow pattern mining is one of the challenging issues for many applications. In a literature significant number of approaches are available to cluster the trajectory data, however the clustering has not been explored for trajectories pattern mining in bi-directional road networks. This paper presents a novel technique for excavating heavy traffic flow patterns in bi-directional road network, i.e. identifying divisions of the roads where the traffic flow is very dense. The proposed technique works in two phases: phase I, finds the clusters of trajectory points based on density of trajectory points; phase II, arranges the clusters in sequence based on spatiotemporal values for each route and directions. These sequences represent the traffic flow patterns. All the routes and sections exceeding a user specified minimum traffic threshold are marked as high dense traffic areas. The experiments are performed on synthetic dataset. The proposed algorithm efficiently and accurately finds the dense traffic in bi-directional roads. Proposed clustering method is compared with the standard k-means clustering algorithm for the performance evaluation.
• Authors: Maurizio Vichi
Abstract: Abstract Disjoint factor analysis (DFA) is a new latent factor model that we propose here to identify factors that relate to disjoint subsets of variables, thus simplifying the loading matrix structure. Similarly to exploratory factor analysis (EFA), the DFA does not hypothesize prior information on the number of factors and on the relevant relations between variables and factors. In DFA the population variance–covariance structure is hypothesized block diagonal after the proper permutation of variables and estimated by Maximum Likelihood, using an Coordinate Descent type algorithm. Inference on parameters on the number of factors and to confirm the hypothesized simple structure are provided. Properties such as scale equivariance, uniqueness, optimal simplification of loadings are satisfied by DFA. Relevant cross-loadings are also estimated in case they are detected from the best DFA solution. DFA has also the option to constrain a variable to load on a pre-specified factor so that the researcher can assume, a priori, some relations between variables and loadings. A simulation study shows performances of DFA and an application to optimally identify the dimensions of well-being is used to illustrate characteristics of the new methodology. A final discussion concludes the paper.
• General location model with factor analyzer covariance matrix structure
and its applications
• Authors: Leila Amiri; Mojtaba Khazaei; Mojtaba Ganjali
Abstract: Abstract General location model (GLOM) is a well-known model for analyzing mixed data. In GLOM one decomposes the joint distribution of variables into conditional distribution of continuous variables given categorical outcomes and marginal distribution of categorical variables. The first version of GLOM assumes that the covariance matrices of continuous multivariate distributions across cells, which are obtained by different combination of categorical variables, are equal. In this paper, the GLOMs are considered in both cases of equality and unequality of these covariance matrices. Three covariance structures are used across cells: the same factor analyzer, factor analyzer with unequal specific variances matrices (in the general and parsimonious forms) and factor analyzers with common factor loadings. These structures are used for both modeling covariance structure and for reducing the number of parameters. The maximum likelihood estimates of parameters are computed via the EM algorithm. As an application for these models, we investigate the classification of continuous variables within cells. Based on these models, the classification is done for usual as well as for high dimensional data sets. Finally, for showing the applicability of the proposed models for classification, results from analyzing three real data sets are presented.
• Multi-objective retinal vessel localization using flower pollination
search algorithm with pattern search
• Authors: E. Emary; Hossam M. Zawbaa; Aboul Ella Hassanien; B. Parv
Abstract: Abstract This paper presents a multi-objective retinal blood vessels localization approach based on flower pollination search algorithm (FPSA) and pattern search (PS) algorithm. FPSA is a new evolutionary algorithm based on the flower pollination process of flowering plants. The proposed multi-objective fitness function uses the flower pollination search algorithm (FPSA) that searches for the optimal clustering of the given retinal image into compact clusters under some constraints. Pattern search (PS) as local search method is then applied to further enhance the segmentation results using another objective function based on shape features. The proposed approach for retinal blood vessels localization is applied on public database namely DRIVE data set. Results demonstrate that the performance of the proposed approach is comparable with state of the art techniques in terms of accuracy, sensitivity, and specificity with many extendable features.
• A new approach for determining the prior probabilities in the
classification problem by Bayesian method
• Authors: Thao Nguyen-Trang; Tai Vo-Van
Abstract: Abstract In this article, we suggest a new algorithm to identify the prior probabilities for classification problem by Bayesian method. The prior probabilities are determined by combining the information of populations in training set and the new observations through fuzzy clustering method (FCM) instead of using uniform distribution or the ratio of sample or Laplace method as the existing ones. We next combine the determined prior probabilities and the estimated likelihood functions to classify the new object. In practice, calculations are performed by Matlab procedures. The proposed algorithm is tested by the three numerical examples including bench mark and real data sets. The results show that the new approach is reasonable and gives more efficient than existing ones.
• Model-based regression clustering for high-dimensional data: application
to functional data
• Authors: Emilie Devijver
Abstract: Abstract Finite mixture regression models are useful for modeling the relationship between response and predictors arising from different subpopulations. In this article, we study high-dimensional predictors and high-dimensional response and propose two procedures to cluster observations according to the link between predictors and the response. To reduce the dimension, we propose to use the Lasso estimator, which takes into account the sparsity and a maximum likelihood estimator penalized by the rank, to take into account the matrix structure. To choose the number of components and the sparsity level, we construct a collection of models, varying those two parameters and we select a model among this collection with a non-asymptotic criterion. We extend these procedures to functional data, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms and apply and evaluate our methods both on simulated and real datasets, to understand how they work in practice.
• Mixture models for ordinal responses to account for uncertainty of choice
• Authors: Gerhard Tutz; Micha Schneider; Maria Iannario; Domenico Piccolo
Abstract: Abstract In CUB models the uncertainty of choice is explicitly modelled as a Combination of discrete Uniform and shifted Binomial random variables. The basic concept to model the response as a mixture of a deliberate choice of a response category and an uncertainty component that is represented by a uniform distribution on the response categories is extended to a much wider class of models. The deliberate choice can in particular be determined by classical ordinal response models as the cumulative and adjacent categories model. Then one obtains the traditional and flexible models as special cases when the uncertainty component is irrelevant. It is shown that the effect of explanatory variables is underestimated if the uncertainty component is neglected in a cumulative type mixture model. Visualization tools for the effects of variables are proposed and the modelling strategies are evaluated by use of real data sets. It is demonstrated that the extended class of models frequently yields better fit than classical ordinal response models without an uncertainty component.
• Principal component analysis for histogram-valued data
• Authors: J. Le-Rademacher; L. Billard
Abstract: Abstract This paper introduces a principal component methodology for analysing histogram-valued data under the symbolic data domain. Currently, no comparable method exists for this type of data. The proposed method uses a symbolic covariance matrix to determine the principal component space. The resulting observations on principal component space are presented as polytopes for visualization. Numerical representation of the resulting polytopes via histogram-valued output is also presented. The necessary algorithms are included. The technique is illustrated on a weather data set.
• T3C: improving a decision tree classification algorithm’s interval
splits on continuous attributes
• Authors: Panagiotis Tzirakis; Christos Tjortjis
Abstract: Abstract This paper proposes, describes and evaluates T3C, a classification algorithm that builds decision trees of depth at most three, and results in high accuracy whilst keeping the size of the tree reasonably small. T3C is an improvement over algorithm T3 in the way it performs splits on continuous attributes. When run against publicly available data sets, T3C achieved lower generalisation error than T3 and the popular C4.5, and competitive results compared to Random Forest and Rotation Forest.
• ADCLUS and INDCLUS: analysis, experimentation, and meta-heuristic
algorithm extensions
• Authors: Stephen L. France; Wen Chen; Yumin Deng
Abstract: Abstract The ADCLUS and INDCLUS models, along with associated fitting techniques, can be used to extract an overlapping clustering structure from similarity data. In this paper, we examine the scalability of these models. We test the SINDLCUS algorithm and an adapted version of the SYMPRES algorithm on medium size datasets and try to infer their scalability and the degree of the local optima problem as the problem size increases. We describe several meta-heuristic approaches to minimizing the INDCLUS and ADCLUS loss functions.
• Backtransformation: a new representation of data processing chains with a
scalar decision function
• Authors: Mario Michael Krell; Sirko Straube
Abstract: Abstract Data processing often transforms a complex signal using a set of different preprocessing algorithms to a single value as the outcome of a final decision function. Still, it is challenging to understand and visualize the interplay between the algorithms performing this transformation. Especially when dimensionality reduction is used, the original data structure (e.g., spatio-temporal information) is hidden from subsequent algorithms. To tackle this problem, we introduce the backtransformation concept suggesting to look at the combination of algorithms as one transformation which maps the original input signal to a single value. Therefore, it takes the derivative of the final decision function and transforms it back through the previous processing steps via backward iteration and the chain rule. The resulting derivative of the composed decision function in the sample of interest represents the complete decision process. Using it for visualizations might improve the understanding of the process. Often, it is possible to construct a feasible processing chain with affine mappings which simplifies the calculation for the backtransformation and the interpretation of the result a lot. In this case, the affine backtransformation provides the complete parameterization of the processing chain. This article introduces the theory, provides implementation guidelines, and presents three application examples.
• A divisive clustering method for functional data with special
consideration of outliers
• Authors: Ana Justel; Marcela Svarc
Abstract: Abstract This paper presents DivClusFD, a new divisive hierarchical method for the non-supervised classification of functional data. Data of this type present the peculiarity that the differences among clusters may be caused by changes as well in level as in shape. Different clusters can be separated in different subregion and there may be no subregion in which all clusters are separated. In each step of division, the DivClusFD method explores the functions and their derivatives at several fixed points, seeking the subregion in which the highest number of clusters can be separated. The number of clusters is estimated via the gap statistic. The functions are assigned to the new clusters by combining the k-means algorithm with the use of functional boxplots to identify functions that have been incorrectly classified because of their atypical local behavior. The DivClusFD method provides the number of clusters, the classification of the observed functions into the clusters and guidelines that may be for interpreting the clusters. A simulation study using synthetic data and tests of the performance of the DivClusFD method on real data sets indicate that this method is able to classify functions accurately.
• Statistical inference in constrained latent class models for multinomial
data based on $$\phi$$ ϕ -divergence measures
• Authors: A. Felipe; N. Martín; P. Miranda; L. Pardo
Abstract: Abstract In this paper we explore the possibilities of applying $$\phi$$ -divergence measures in inferential problems in the field of latent class models (LCMs) for multinomial data. We first treat the problem of estimating the model parameters. As explained below, minimum $$\phi$$ -divergence estimators (M $$\phi$$ Es) considered in this paper are a natural extension of the maximum likelihood estimator (MLE), the usual estimator for this problem; we study the asymptotic properties of M $$\phi$$ Es, showing that they share the same asymptotic distribution as the MLE. To compare the efficiency of the M $$\phi$$ Es when the sample size is not big enough to apply the asymptotic results, we have carried out an extensive simulation study; from this study, we conclude that there are estimators in this family that are competitive with the MLE. Next, we deal with the problem of testing whether a LCM for multinomial data fits a data set; again, $$\phi$$ -divergence measures can be used to generate a family of test statistics generalizing both the classical likelihood ratio test and the chi-squared test statistics. Finally, we treat the problem of choosing the best model out of a sequence of nested LCMs; as before, $$\phi$$ -divergence measures can handle the problem and we derive a family of $$\phi$$ -divergence test statistics based on them; we study the asymptotic behavior of these test statistics, showing that it is the same as the classical test statistics. A simulation study for small and moderate sample sizes shows that there are some test statistics in the family that can compete with the classical likelihood ratio and the chi-squared test statistics.
• Minimum distance method for directional data and outlier detection
• Authors: Mercedes Fernandez Sau; Daniela Rodriguez
Abstract: Abstract In this paper, we propose estimators based on the minimum distance for the unknown parameters of a parametric density on the unit sphere. We show that these estimators are consistent and asymptotically normally distributed. Also, we apply our proposal to develop a method that allows us to detect potential atypical values. The behavior under small samples of the proposed estimators is studied using Monte Carlo simulations. Two applications of our procedure are illustrated with real data sets.
• Editorial for issue 2/2017
