Abstract: In real-world application scenarios, the identification of groups poses a significant challenge due to possibly occurring outliers and existing noise variables. Therefore, there is a need for a clustering method which is capable of revealing the group structure in data containing both outliers and noise variables without any pre-knowledge. In this paper, we propose a k-means-based algorithm incorporating a weighting function which leads to an automatic weight assignment for each observation. In order to cope with noise variables, a lasso-type penalty is used in an objective function adjusted by observation weights. We finally introduce a framework for selecting both the number of clusters and variables based on a modified gap statistic. The conducted experiments on simulated and real-world data demonstrate the advantage of the method to identify groups, outliers, and informative variables simultaneously. PubDate: 2019-03-19

Abstract: Several machine learning techniques assume that the number of objects in considered classes is approximately similar. Nevertheless, in real-world applications, the class of interest to be studied is generally scarce. The data imbalance status may allow high global accuracy through most standard learning algorithms, but it poses a real challenge when considering the minority class accuracy. To deal with this issue, we introduce in this paper a novel adaptation of the decision tree algorithm to imbalanced data situations. A new asymmetric entropy measure is proposed. It adjusts the most uncertain class distribution to the a priori class distribution and involves it in the node splitting-process. Unlike most competitive split criteria, which include only the maximum uncertainty vector in their formula, the proposed entropy is customizable with an adjustable concavity to better comply with the system expectations. The experimental results across thirty-five differently class-imbalanced data-sets show significant improvements over various split criteria adapted for imbalanced situations. Furthermore, being combined with sampling strategies and based-ensemble methods, our entropy proves significant enhancements on the minority class prediction, along with a good handling of the data difficulties related to the class imbalance problem. PubDate: 2019-03-02

Authors:Gregor Zens Abstract: A method for implicit variable selection in mixture-of-experts frameworks is proposed. We introduce a prior structure where information is taken from a set of independent covariates. Robust class membership predictors are identified using a normal gamma prior. The resulting model setup is used in a finite mixture of Bernoulli distributions to find homogenous clusters of women in Mozambique based on their information sources on HIV. Fully Bayesian inference is carried out via the implementation of a Gibbs sampler. PubDate: 2019-02-21 DOI: 10.1007/s11634-019-00353-y

Authors:Christian Hennig; Willi Sauerbrei Abstract: It is well known that variable selection in multiple regression can be unstable and that the model uncertainty can be considerable. The model uncertainty can be quantified and explored by bootstrap resampling, see Sauerbrei et al. (Biom J 57:531–555, 2015). Here approaches are introduced that use the results of bootstrap replications of the variable selection process to obtain more detailed information about the data. Analyses will be based on dissimilarities between the results of the analyses of different bootstrap samples. Dissimilarities are computed between the vector of predictions, and between the sets of selected variables. The dissimilarities are used to map the models by multidimensional scaling, to cluster them, and to construct heatplots. Clusters can point to different interpretations of the data that could arise from different selections of variables supported by different bootstrap samples. A new measure of variable selection instability is also defined. The methodology can be applied to various regression models, estimators, and variable selection methods. It will be illustrated by three real data examples, using linear regression and a Cox proportional hazards model, and model selection by AIC and BIC. PubDate: 2019-02-15 DOI: 10.1007/s11634-018-00351-6

Authors:Gonzalo Perez-de-la-Cruz; Guillermina Eslava-Gomez Abstract: The purpose of this paper is to illustrate the potential use of discriminant analysis for discrete variables whose dependence structure is assumed to follow, or can be approximated by, a tree-structured graphical model. This is done by comparing its empirical performance, using estimated error rates for real and simulated data, with the well-known Naive Bayes classification rule and with linear logistic regression, both of which do not consider any interaction between variables, and with models that consider interactions like a decomposable and the saturated model. The results show that discriminant analysis based on tree-structured graphical models, a simple nonlinear method including only some of the pairwise interactions between variables, is competitive with, and sometimes superior to, other methods which assume no interactions, and has the advantage over more complex decomposable models of finding the graph structure in a fast way and exact form. PubDate: 2019-02-12 DOI: 10.1007/s11634-019-00352-z

Authors:Asma Gul; Aris Perperoglou; Zardad Khan; Osama Mahmoud; Miftahuddin Miftahuddin; Werner Adler; Berthold Lausen Pages: 827 - 840 Abstract: Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines. PubDate: 2018-12-01 DOI: 10.1007/s11634-015-0227-5 Issue No:Vol. 12, No. 4 (2018)

Authors:Deana Desa Pages: 841 - 865 Abstract: This study examined how a non-linear modeling of ordered categorical variables within multiple-group confirmatory factor analysis supported measurement invariance. A four-item classroom disciplinary climate scale used in cross-cultural framework was empirically investigated. In the first part of the analysis, a separated categorical confirmatory factor analysis was initially applied to account for the complex structure of the relationships between the observed measures in each country. The categorical multiple-group confirmatory factor analysis (MGCFA) was then used to conduct a cross-country examination of full measurement invariance namely the configural, metric, and scalar levels of invariance in the classroom discipline climate measures. The categorical MGCFA modeling supported configural and metric invariances as well as scalar invariance for the latent factor structure of classroom disciplinary climate. This finding implying meaningful cross-country comparisons on the scale means, on the associations of classroom disciplinary climate scale with other scales and on the item-factor latent structure. Application of the categorical modeling appeared to correctly specify the factor structure of the scale, thereby promising the appropriateness of reporting comparisons such as rankings of many groups, and illustrating league tables of different heterogeneous groups. Limitations of the modeling in this study and future suggestions for measurement invariance testing in studies with large numbers of groups are discussed. PubDate: 2018-12-01 DOI: 10.1007/s11634-016-0240-3 Issue No:Vol. 12, No. 4 (2018)

Authors:Daniel Horn; Aydın Demircioğlu; Bernd Bischl; Tobias Glasmachers; Claus Weihs Pages: 867 - 883 Abstract: Kernelized support vector machines (SVMs) belong to the most widely used classification methods. However, in contrast to linear SVMs, the computation time required to train such a machine becomes a bottleneck when facing large data sets. In order to mitigate this shortcoming of kernel SVMs, many approximate training algorithms were developed. While most of these methods claim to be much faster than the state-of-the-art solver LIBSVM, a thorough comparative study is missing. We aim to fill this gap. We choose several well-known approximate SVM solvers and compare their performance on a number of large benchmark data sets. Our focus is to analyze the trade-off between prediction error and runtime for different learning and accuracy parameter settings. This includes simple subsampling of the data, the poor-man’s approach to handling large scale problems. We employ model-based multi-objective optimization, which allows us to tune the parameters of learning machine and solver over the full range of accuracy/runtime trade-offs. We analyze (differences between) solvers by studying and comparing the Pareto fronts formed by the two objectives classification error and training time. Unsurprisingly, given more runtime most solvers are able to find more accurate solutions, i.e., achieve a higher prediction accuracy. It turns out that LIBSVM with subsampling of the data is a strong baseline. Some solvers systematically outperform others, which allows us to give concrete recommendations of when to use which solver. PubDate: 2018-12-01 DOI: 10.1007/s11634-016-0265-7 Issue No:Vol. 12, No. 4 (2018)

Authors:Silke Janitza; Ender Celik; Anne-Laure Boulesteix Pages: 885 - 915 Abstract: Random forests are a commonly used tool for classification and for ranking candidate predictors based on the so-called variable importance measures. These measures attribute scores to the variables reflecting their importance. A drawback of variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, were developed for addressing this problem. The existing testing approaches require the repeated computation of random forests. While for low-dimensional settings those approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. In this article a computationally fast heuristic variable importance test is proposed that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance, which is inspired by cross-validation procedures. The new approach is tested and compared to the approach of Altmann and colleagues using simulation studies, which are based on real data from high-dimensional binary classification settings. The new approach controls the type I error and has at least comparable power at a substantially smaller computation time in the studies. Thus, it might be used as a computationally fast alternative to existing procedures for high-dimensional data settings where many variables do not carry any information. The new approach is implemented in the R package vita. PubDate: 2018-12-01 DOI: 10.1007/s11634-016-0276-4 Issue No:Vol. 12, No. 4 (2018)

Authors:Ludwig Lausser; Florian Schmid; Lyn-Rouven Schirra; Adalbert F. X. Wilhelm; Hans A. Kestler Pages: 917 - 936 Abstract: Predicting phenotypes on the basis of gene expression profiles is a classification task that is becoming increasingly important in the field of precision medicine. Although these expression signals are real-valued, it is questionable if they can be analyzed on an interval scale. As with many biological signals their influence on e.g. protein levels is usually non-linear and thus can be misinterpreted. In this article we study gene expression profiles with up to 54,000 dimensions. We analyze these measurements on an ordinal scale by replacing the real-valued profiles by their ranks. This type of rank transformation can be used for the construction of invariant classifiers that are not affected by noise induced by data transformations which can occur in the measurement setup. Our 10 \(\times \) 10 fold cross-validation experiments on 86 different data sets and 19 different classification models indicate that classifiers largely benefit from this transformation. Especially random forests and support vector machines achieve improved classification results on a significant majority of datasets. PubDate: 2018-12-01 DOI: 10.1007/s11634-016-0277-3 Issue No:Vol. 12, No. 4 (2018)

Authors:Afef Ben Brahim; Mohamed Limam Pages: 937 - 952 Abstract: The curse of dimensionality is based on the fact that high dimensional data is often difficult to work with. A large number of features can increase the noise of the data and thus the error of a learning algorithm. Feature selection is a solution for such problems where there is a need to reduce the data dimensionality. Different feature selection algorithms may yield feature subsets that can be considered local optima in the space of feature subsets. Ensemble feature selection combines independent feature subsets and might give a better approximation to the optimal subset of features. We propose an ensemble feature selection approach based on feature selectors’ reliability assessment. It aims at providing a unique and stable feature selection without ignoring the predictive accuracy aspect. A classification algorithm is used as an evaluator to assign a confidence to features selected by ensemble members based on their associated classification performance. We compare our proposed approach to several existing techniques and to individual feature selection algorithms. Results show that our approach often improves classification performance and feature selection stability for high dimensional data sets. PubDate: 2018-12-01 DOI: 10.1007/s11634-017-0285-y Issue No:Vol. 12, No. 4 (2018)

Authors:Qiang Wang; Thanh-Tung Nguyen; Joshua Z. Huang; Thuy Thi Nguyen Pages: 953 - 972 Abstract: In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data. PubDate: 2018-12-01 DOI: 10.1007/s11634-018-0318-1 Issue No:Vol. 12, No. 4 (2018)

Authors:Ravi Sankar Sangam; Hari Om Pages: 973 - 995 Abstract: In data stream environment, most of the conventional clustering algorithms are not sufficiently efficient, since large volumes of data arrive in a stream and these data points unfold with time. The problem of clustering time-evolving metric data and categorical time-evolving data has separately been well explored in recent years, but the problem of clustering mixed type time-evolving data remains a challenging issue due to an awkward gap between the structure of metric and categorical attributes. In this paper, we devise a generalized framework, termed Equi-Clustream to dynamically cluster mixed type time-evolving data, which comprises three algorithms: a Hybrid Drifting Concept Detection Algorithm that detects the drifting concept between the current sliding window and previous sliding window, a Hybrid Data Labeling Algorithm that assigns an appropriate cluster label to each data vector of the current non-drifting window based on the clustering result of the previous sliding window, and a visualization algorithm that analyses the relationship between the clusters at different timestamps and also visualizes the evolving trends of the clusters. The efficacy of the proposed framework is shown by experiments on synthetic and real world datasets. PubDate: 2018-12-01 DOI: 10.1007/s11634-018-0316-3 Issue No:Vol. 12, No. 4 (2018)

Authors:Hosik Choi; Seokho Lee Abstract: We present a new clustering algorithm for multivariate binary data. The new algorithm is based on the convex relaxation of hierarchical clustering, which is achieved by considering the binomial likelihood as a natural distribution for binary data and by formulating convex clustering using a pairwise penalty on prototypes of clusters. Under convex clustering, we show that the typical \(\ell _1\) pairwise fused penalty results in ineffective cluster formation. In an attempt to promote the clustering performance and select the relevant clustering variables, we propose the penalized maximum likelihood estimation with an \(\ell _2\) fused penalty on the fusion parameters and an \(\ell _1\) penalty on the loading matrix. We provide an efficient algorithm to solve the optimization by using majorization-minimization algorithm and alternative direction method of multipliers. Numerical studies confirmed its good performance and real data analysis demonstrates the practical usefulness of the proposed method. PubDate: 2018-11-14 DOI: 10.1007/s11634-018-0350-1

Authors:Hiroyasu Abe; Hiroshi Yadohisa Abstract: Orthogonal nonnegative matrix tri-factorization (ONMTF) is a biclustering method using a given nonnegative data matrix and has been applied to document-term clustering, collaborative filtering, and so on. In previously proposed ONMTF methods, it is assumed that the error distribution is normal. However, the assumption of normal distribution is not always appropriate for nonnegative data. In this paper, we propose three new ONMTF methods, which respectively employ the following error distributions: normal, Poisson, and compound Poisson. To develop the new methods, we adopt a k-means based algorithm but not a multiplicative updating algorithm, which was the main method used for obtaining estimators in previous methods. A simulation study and an application involving document-term matrices demonstrate that our method can outperform previous methods, in terms of the goodness of clustering and in the estimation of the factor matrix. PubDate: 2018-10-25 DOI: 10.1007/s11634-018-0348-8

Authors:Claudio Conversano; Massimo Cannas; Francesco Mola; Emiliano Sironi Abstract: A novel criterion for estimating a latent partition of the observed groups based on the output of a hierarchical model is presented. It is based on a loss function combining the Gini income inequality ratio and the predictability index of Goodman and Kruskal in order to achieve maximum heterogeneity of random effects across groups and maximum homogeneity of predicted probabilities inside estimated clusters. The index is compared with alternative approaches in a simulation study and applied in a case study concerning the role of hospital level variables in deciding for a cesarean section. PubDate: 2018-10-12 DOI: 10.1007/s11634-018-0347-9

Authors:William Cipolli; Timothy Hanson Abstract: We propose a generative classification model that extends Quadratic Discriminant Analysis (QDA) (Cox in J R Stat Soc Ser B (Methodol) 20:215–242, 1958) and Linear Discriminant Analysis (LDA) (Fisher in Ann Eugen 7:179–188, 1936; Rao in J R Stat Soc Ser B 10:159–203, 1948) to the Bayesian nonparametric setting, providing a competitor to MclustDA (Fraley and Raftery in Am Stat Assoc 97:611–631, 2002). This approach models the data distribution for each class using a multivariate Polya tree and realizes impressive results in simulations and real data analyses. The flexibility gained from further relaxing the distributional assumptions of QDA can greatly improve the ability to correctly classify new observations for models with severe deviations from parametric distributional assumptions, while still performing well when the assumptions hold. The proposed method is quite fast compared to other supervised classifiers and very simple to implement as there are no kernel tricks or initialization steps perhaps making it one of the more user-friendly approaches to supervised learning. This highlights a significant feature of the proposed methodology as suboptimal tuning can greatly hamper classification performance; e.g., SVMs fit with non-optimal kernels perform significantly worse. PubDate: 2018-10-12 DOI: 10.1007/s11634-018-0344-z

Authors:Abby Flynt; Nema Dean; Rebecca Nugent Abstract: Agreement indices are commonly used to summarize the performance of both classification and clustering methods. The easy interpretation/intuition and desirable properties that result from the Rand and adjusted Rand indices, has led to their popularity over other available indices. While more algorithmic clustering approaches like k-means and hierarchical clustering produce hard partition assignments (assigning observations to a single cluster), other techniques like model-based clustering include information about the certainty of allocation of objects through class membership probabilities (soft partitions). To assess performance using traditional indices, e.g., the adjusted Rand index (ARI), the soft partition is mapped to a hard set of assignments, which commonly overstates the certainty of correct assignments. This paper proposes an extension of the ARI, the soft adjusted Rand index (sARI), with similar intuition and interpretation but also incorporating information from one or two soft partitions. It can be used in conjunction with the ARI, comparing the similarities of hard to soft, or soft to soft partitions to the similarities of the mapped hard partitions. Simulation study results support the intuition that in general, mapping to hard partitions tends to increase the measure of similarity between partitions. In applications, the sARI more accurately reflects the cluster boundary overlap commonly seen in real data. PubDate: 2018-10-09 DOI: 10.1007/s11634-018-0346-x