for Journals by Title or ISSN for Articles by Keywords help

Publisher: Springer-Verlag (Total: 2352 journals)

 Advances in Data Analysis and ClassificationJournal Prestige (SJR): 1.09 Citation Impact (citeScore): 1Number of Followers: 56      Hybrid journal (It can contain Open Access articles) ISSN (Print) 1862-5355 - ISSN (Online) 1862-5347 Published by Springer-Verlag  [2352 journals]
• Ensemble of a subset of k NN classifiers
• Authors: Asma Gul; Aris Perperoglou; Zardad Khan; Osama Mahmoud; Miftahuddin Miftahuddin; Werner Adler; Berthold Lausen
Pages: 827 - 840
Abstract: Combining multiple classifiers, known as ensemble methods, can give substantial improvement in prediction performance of learning algorithms especially in the presence of non-informative features in the data sets. We propose an ensemble of subset of kNN classifiers, ESkNN, for classification task in two steps. Firstly, we choose classifiers based upon their individual performance using the out-of-sample accuracy. The selected classifiers are then combined sequentially starting from the best model and assessed for collective performance on a validation data set. We use bench mark data sets with their original and some added non-informative features for the evaluation of our method. The results are compared with usual kNN, bagged kNN, random kNN, multiple feature subset method, random forest and support vector machines. Our experimental comparisons on benchmark classification problems and simulated data sets reveal that the proposed ensemble gives better classification performance than the usual kNN and its ensembles, and performs comparable to random forest and support vector machines.
PubDate: 2018-12-01
DOI: 10.1007/s11634-015-0227-5
Issue No: Vol. 12, No. 4 (2018)

• Understanding non-linear modeling of measurement invariance in
heterogeneous populations
• Authors: Deana Desa
Pages: 841 - 865
Abstract: This study examined how a non-linear modeling of ordered categorical variables within multiple-group confirmatory factor analysis supported measurement invariance. A four-item classroom disciplinary climate scale used in cross-cultural framework was empirically investigated. In the first part of the analysis, a separated categorical confirmatory factor analysis was initially applied to account for the complex structure of the relationships between the observed measures in each country. The categorical multiple-group confirmatory factor analysis (MGCFA) was then used to conduct a cross-country examination of full measurement invariance namely the configural, metric, and scalar levels of invariance in the classroom discipline climate measures. The categorical MGCFA modeling supported configural and metric invariances as well as scalar invariance for the latent factor structure of classroom disciplinary climate. This finding implying meaningful cross-country comparisons on the scale means, on the associations of classroom disciplinary climate scale with other scales and on the item-factor latent structure. Application of the categorical modeling appeared to correctly specify the factor structure of the scale, thereby promising the appropriateness of reporting comparisons such as rankings of many groups, and illustrating league tables of different heterogeneous groups. Limitations of the modeling in this study and future suggestions for measurement invariance testing in studies with large numbers of groups are discussed.
PubDate: 2018-12-01
DOI: 10.1007/s11634-016-0240-3
Issue No: Vol. 12, No. 4 (2018)

• A comparative study on large scale kernelized support vector machines
• Authors: Daniel Horn; Aydın Demircioğlu; Bernd Bischl; Tobias Glasmachers; Claus Weihs
Pages: 867 - 883
Abstract: Kernelized support vector machines (SVMs) belong to the most widely used classification methods. However, in contrast to linear SVMs, the computation time required to train such a machine becomes a bottleneck when facing large data sets. In order to mitigate this shortcoming of kernel SVMs, many approximate training algorithms were developed. While most of these methods claim to be much faster than the state-of-the-art solver LIBSVM, a thorough comparative study is missing. We aim to fill this gap. We choose several well-known approximate SVM solvers and compare their performance on a number of large benchmark data sets. Our focus is to analyze the trade-off between prediction error and runtime for different learning and accuracy parameter settings. This includes simple subsampling of the data, the poor-man’s approach to handling large scale problems. We employ model-based multi-objective optimization, which allows us to tune the parameters of learning machine and solver over the full range of accuracy/runtime trade-offs. We analyze (differences between) solvers by studying and comparing the Pareto fronts formed by the two objectives classification error and training time. Unsurprisingly, given more runtime most solvers are able to find more accurate solutions, i.e., achieve a higher prediction accuracy. It turns out that LIBSVM with subsampling of the data is a strong baseline. Some solvers systematically outperform others, which allows us to give concrete recommendations of when to use which solver.
PubDate: 2018-12-01
DOI: 10.1007/s11634-016-0265-7
Issue No: Vol. 12, No. 4 (2018)

• A computationally fast variable importance test for random forests for
high-dimensional data
• Authors: Silke Janitza; Ender Celik; Anne-Laure Boulesteix
Pages: 885 - 915
Abstract: Random forests are a commonly used tool for classification and for ranking candidate predictors based on the so-called variable importance measures. These measures attribute scores to the variables reflecting their importance. A drawback of variable importance measures is that there is no natural cutoff that can be used to discriminate between important and non-important variables. Several approaches, for example approaches based on hypothesis testing, were developed for addressing this problem. The existing testing approaches require the repeated computation of random forests. While for low-dimensional settings those approaches might be computationally tractable, for high-dimensional settings typically including thousands of candidate predictors, computing time is enormous. In this article a computationally fast heuristic variable importance test is proposed that is appropriate for high-dimensional data where many variables do not carry any information. The testing approach is based on a modified version of the permutation variable importance, which is inspired by cross-validation procedures. The new approach is tested and compared to the approach of Altmann and colleagues using simulation studies, which are based on real data from high-dimensional binary classification settings. The new approach controls the type I error and has at least comparable power at a substantially smaller computation time in the studies. Thus, it might be used as a computationally fast alternative to existing procedures for high-dimensional data settings where many variables do not carry any information. The new approach is implemented in the R package vita.
PubDate: 2018-12-01
DOI: 10.1007/s11634-016-0276-4
Issue No: Vol. 12, No. 4 (2018)

• Rank-based classifiers for extremely high-dimensional gene expression data
• Authors: Ludwig Lausser; Florian Schmid; Lyn-Rouven Schirra; Adalbert F. X. Wilhelm; Hans A. Kestler
Pages: 917 - 936
Abstract: Predicting phenotypes on the basis of gene expression profiles is a classification task that is becoming increasingly important in the field of precision medicine. Although these expression signals are real-valued, it is questionable if they can be analyzed on an interval scale. As with many biological signals their influence on e.g. protein levels is usually non-linear and thus can be misinterpreted. In this article we study gene expression profiles with up to 54,000 dimensions. We analyze these measurements on an ordinal scale by replacing the real-valued profiles by their ranks. This type of rank transformation can be used for the construction of invariant classifiers that are not affected by noise induced by data transformations which can occur in the measurement setup. Our 10 $$\times$$ 10 fold cross-validation experiments on 86 different data sets and 19 different classification models indicate that classifiers largely benefit from this transformation. Especially random forests and support vector machines achieve improved classification results on a significant majority of datasets.
PubDate: 2018-12-01
DOI: 10.1007/s11634-016-0277-3
Issue No: Vol. 12, No. 4 (2018)

• Ensemble feature selection for high dimensional data: a new method and a
comparative study
• Authors: Afef Ben Brahim; Mohamed Limam
Pages: 937 - 952
Abstract: The curse of dimensionality is based on the fact that high dimensional data is often difficult to work with. A large number of features can increase the noise of the data and thus the error of a learning algorithm. Feature selection is a solution for such problems where there is a need to reduce the data dimensionality. Different feature selection algorithms may yield feature subsets that can be considered local optima in the space of feature subsets. Ensemble feature selection combines independent feature subsets and might give a better approximation to the optimal subset of features. We propose an ensemble feature selection approach based on feature selectors’ reliability assessment. It aims at providing a unique and stable feature selection without ignoring the predictive accuracy aspect. A classification algorithm is used as an evaluator to assign a confidence to features selected by ensemble members based on their associated classification performance. We compare our proposed approach to several existing techniques and to individual feature selection algorithms. Results show that our approach often improves classification performance and feature selection stability for high dimensional data sets.
PubDate: 2018-12-01
DOI: 10.1007/s11634-017-0285-y
Issue No: Vol. 12, No. 4 (2018)

• An efficient random forests algorithm for high dimensional data
classification
• Authors: Qiang Wang; Thanh-Tung Nguyen; Joshua Z. Huang; Thuy Thi Nguyen
Pages: 953 - 972
Abstract: In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.
PubDate: 2018-12-01
DOI: 10.1007/s11634-018-0318-1
Issue No: Vol. 12, No. 4 (2018)

• Equi-Clustream: a framework for clustering time evolving mixed data
• Authors: Ravi Sankar Sangam; Hari Om
Pages: 973 - 995
Abstract: In data stream environment, most of the conventional clustering algorithms are not sufficiently efficient, since large volumes of data arrive in a stream and these data points unfold with time. The problem of clustering time-evolving metric data and categorical time-evolving data has separately been well explored in recent years, but the problem of clustering mixed type time-evolving data remains a challenging issue due to an awkward gap between the structure of metric and categorical attributes. In this paper, we devise a generalized framework, termed Equi-Clustream to dynamically cluster mixed type time-evolving data, which comprises three algorithms: a Hybrid Drifting Concept Detection Algorithm that detects the drifting concept between the current sliding window and previous sliding window, a Hybrid Data Labeling Algorithm that assigns an appropriate cluster label to each data vector of the current non-drifting window based on the clustering result of the previous sliding window, and a visualization algorithm that analyses the relationship between the clusters at different timestamps and also visualizes the evolving trends of the clusters. The efficacy of the proposed framework is shown by experiments on synthetic and real world datasets.
PubDate: 2018-12-01
DOI: 10.1007/s11634-018-0316-3
Issue No: Vol. 12, No. 4 (2018)

• Editorial for issue 3/2018
• Pages: 449 - 454
PubDate: 2018-09-01
DOI: 10.1007/s11634-018-0340-3
Issue No: Vol. 12, No. 3 (2018)

• Convex clustering for binary data
• Authors: Hosik Choi; Seokho Lee
Abstract: We present a new clustering algorithm for multivariate binary data. The new algorithm is based on the convex relaxation of hierarchical clustering, which is achieved by considering the binomial likelihood as a natural distribution for binary data and by formulating convex clustering using a pairwise penalty on prototypes of clusters. Under convex clustering, we show that the typical $$\ell _1$$ pairwise fused penalty results in ineffective cluster formation. In an attempt to promote the clustering performance and select the relevant clustering variables, we propose the penalized maximum likelihood estimation with an $$\ell _2$$ fused penalty on the fusion parameters and an $$\ell _1$$ penalty on the loading matrix. We provide an efficient algorithm to solve the optimization by using majorization-minimization algorithm and alternative direction method of multipliers. Numerical studies confirmed its good performance and real data analysis demonstrates the practical usefulness of the proposed method.
PubDate: 2018-11-14
DOI: 10.1007/s11634-018-0350-1

• Special issue on “Science of big data: theory, methods and
applications”
• Authors: Hans A. Kestler; Paul D. McNicholas; Adalbert F. X. Wilhelm
PubDate: 2018-11-01
DOI: 10.1007/s11634-018-0349-7

• Orthogonal nonnegative matrix tri-factorization based on Tweedie
distributions
• Authors: Hiroyasu Abe; Hiroshi Yadohisa
Abstract: Orthogonal nonnegative matrix tri-factorization (ONMTF) is a biclustering method using a given nonnegative data matrix and has been applied to document-term clustering, collaborative filtering, and so on. In previously proposed ONMTF methods, it is assumed that the error distribution is normal. However, the assumption of normal distribution is not always appropriate for nonnegative data. In this paper, we propose three new ONMTF methods, which respectively employ the following error distributions: normal, Poisson, and compound Poisson. To develop the new methods, we adopt a k-means based algorithm but not a multiplicative updating algorithm, which was the main method used for obtaining estimators in previous methods. A simulation study and an application involving document-term matrices demonstrate that our method can outperform previous methods, in terms of the goodness of clustering and in the estimation of the factor matrix.
PubDate: 2018-10-25
DOI: 10.1007/s11634-018-0348-8

• Random effects clustering in multilevel modeling: choosing a proper
partition
• Authors: Claudio Conversano; Massimo Cannas; Francesco Mola; Emiliano Sironi
Abstract: A novel criterion for estimating a latent partition of the observed groups based on the output of a hierarchical model is presented. It is based on a loss function combining the Gini income inequality ratio and the predictability index of Goodman and Kruskal in order to achieve maximum heterogeneity of random effects across groups and maximum homogeneity of predicted probabilities inside estimated clusters. The index is compared with alternative approaches in a simulation study and applied in a case study concerning the role of hospital level variables in deciding for a cesarean section.
PubDate: 2018-10-12
DOI: 10.1007/s11634-018-0347-9

• Supervised learning via smoothed Polya trees
• Authors: William Cipolli; Timothy Hanson
Abstract: We propose a generative classification model that extends Quadratic Discriminant Analysis (QDA) (Cox in J R Stat Soc Ser B (Methodol) 20:215–242, 1958) and Linear Discriminant Analysis (LDA) (Fisher in Ann Eugen 7:179–188, 1936; Rao in J R Stat Soc Ser B 10:159–203, 1948) to the Bayesian nonparametric setting, providing a competitor to MclustDA (Fraley and Raftery in Am Stat Assoc 97:611–631, 2002). This approach models the data distribution for each class using a multivariate Polya tree and realizes impressive results in simulations and real data analyses. The flexibility gained from further relaxing the distributional assumptions of QDA can greatly improve the ability to correctly classify new observations for models with severe deviations from parametric distributional assumptions, while still performing well when the assumptions hold. The proposed method is quite fast compared to other supervised classifiers and very simple to implement as there are no kernel tricks or initialization steps perhaps making it one of the more user-friendly approaches to supervised learning. This highlights a significant feature of the proposed methodology as suboptimal tuning can greatly hamper classification performance; e.g., SVMs fit with non-optimal kernels perform significantly worse.
PubDate: 2018-10-12
DOI: 10.1007/s11634-018-0344-z

• sARI: a soft agreement measure for class partitions incorporating
assignment probabilities
• Authors: Abby Flynt; Nema Dean; Rebecca Nugent
Abstract: Agreement indices are commonly used to summarize the performance of both classification and clustering methods. The easy interpretation/intuition and desirable properties that result from the Rand and adjusted Rand indices, has led to their popularity over other available indices. While more algorithmic clustering approaches like k-means and hierarchical clustering produce hard partition assignments (assigning observations to a single cluster), other techniques like model-based clustering include information about the certainty of allocation of objects through class membership probabilities (soft partitions). To assess performance using traditional indices, e.g., the adjusted Rand index (ARI), the soft partition is mapped to a hard set of assignments, which commonly overstates the certainty of correct assignments. This paper proposes an extension of the ARI, the soft adjusted Rand index (sARI), with similar intuition and interpretation but also incorporating information from one or two soft partitions. It can be used in conjunction with the ARI, comparing the similarities of hard to soft, or soft to soft partitions to the similarities of the mapped hard partitions. Simulation study results support the intuition that in general, mapping to hard partitions tends to increase the measure of similarity between partitions. In applications, the sARI more accurately reflects the cluster boundary overlap commonly seen in real data.
PubDate: 2018-10-09
DOI: 10.1007/s11634-018-0346-x

• Generalised linear model trees with global additive effects
• Authors: Heidi Seibold; Torsten Hothorn; Achim Zeileis
Abstract: Model-based trees are used to find subgroups in data which differ with respect to model parameters. In some applications it is natural to keep some parameters fixed globally for all observations while asking if and how other parameters vary across subgroups. Existing implementations of model-based trees can only deal with the scenario where all parameters depend on the subgroups. We propose partially additive linear model trees (PALM trees) as an extension of (generalised) linear model trees (LM and GLM trees, respectively), in which the model parameters are specified a priori to be estimated either globally from all observations or locally from the observations within the subgroups determined by the tree. Simulations show that the method has high power for detecting subgroups in the presence of global effects and reliably recovers the true parameters. Furthermore, treatment–subgroup differences are detected in an empirical application of the method to data from a mathematics exam: the PALM tree is able to detect a small subgroup of students that had a disadvantage in an exam with two versions while adjusting for overall ability effects.
PubDate: 2018-10-05
DOI: 10.1007/s11634-018-0342-1

• A classification tree approach for the modeling of competing risks in
discrete time
• Authors: Moritz Berger; Thomas Welchowski; Steffen Schmitz-Valckenberg; Matthias Schmid
Abstract: Cause-specific hazard models are a popular tool for the analysis of competing risks data. The classical modeling approach in discrete time consists of fitting parametric multinomial logit models. A drawback of this method is that the focus is on main effects only, and that higher order interactions are hard to handle. Moreover, the resulting models contain a large number of parameters, which may cause numerical problems when estimating coefficients. To overcome these problems, a tree-based model is proposed that extends the survival tree methodology developed previously for time-to-event models with one single type of event. The performance of the method, compared with several competitors, is investigated in simulations. The usefulness of the proposed approach is demonstrated by an analysis of age-related macular degeneration among elderly people that were monitored by annual study visits.
PubDate: 2018-09-28
DOI: 10.1007/s11634-018-0345-y

• Variable selection in discriminant analysis for mixed continuous-binary
variables and several groups
• Authors: Alban Mbina Mbina; Guy Martial Nkiet; Fulgence Eyi Obiang
Abstract: We propose a method for variable selection in discriminant analysis with mixed continuous and binary variables. This method is based on a criterion that permits to reduce the variable selection problem to a problem of estimating suitable permutation and dimensionality. Then, estimators for these parameters are proposed and the resulting method for selecting variables is shown to be consistent. A simulation study that permits to study several properties of the proposed approach and to compare it with an existing method is given, and an example on a real data set is provided.
PubDate: 2018-09-21
DOI: 10.1007/s11634-018-0343-0

• Bayesian nonstationary Gaussian process models via treed process
convolutions
• Abstract: The Gaussian process is a common model in a wide variety of applications, such as environmental modeling, computer experiments, and geology. Two major challenges often arise: First, assuming that the process of interest is stationary over the entire domain often proves to be untenable. Second, the traditional Gaussian process model formulation is computationally inefficient for large datasets. In this paper, we propose a new Gaussian process model to tackle these problems based on the convolution of a smoothing kernel with a partitioned latent process. Nonstationarity can be modeled by allowing a separate latent process for each partition, which approximates a regional clustering structure. Partitioning follows a binary tree generating process similar to that of Classification and Regression Trees. A Bayesian approach is used to estimate the partitioning structure and model parameters simultaneously. Our motivating dataset consists of 11918 precipitation anomalies. Results show that our model has promising prediction performance and is computationally efficient for large datasets.
PubDate: 2018-09-15
DOI: 10.1007/s11634-018-0341-2

• Finite mixtures, projection pursuit and tensor rank: a triangulation
• Authors: Nicola Loperfido
Abstract: Finite mixtures of multivariate distributions play a fundamental role in model-based clustering. However, they pose several problems, especially in the presence of many irrelevant variables. Dimension reduction methods, such as projection pursuit, are commonly used to address these problems. In this paper, we use skewness-maximizing projections to recover the subspace which optimally separates the cluster means. Skewness might then be removed in order to search for other potentially interesting data structures or to perform skewness-sensitive statistical analyses, such as the Hotelling’s $$T^{2}$$ test. Our approach is algebraic in nature and deals with the symmetric tensor rank of the third multivariate cumulant. We also derive closed-form expressions for the symmetric tensor rank of the third cumulants of several multivariate mixture models, including mixtures of skew-normal distributions and mixtures of two symmetric components with proportional covariance matrices. Theoretical results in this paper shed some light on the connection between the estimated number of mixture components and their skewness.
PubDate: 2018-09-06
DOI: 10.1007/s11634-018-0336-z

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327

Home (Search)
Subjects A-Z
Publishers A-Z
Customise
APIs