for Journals by Title or ISSN for Articles by Keywords help

Publisher: Springer-Verlag   (Total: 2341 journals)

 Advances in Data Analysis and Classification   [SJR: 1.113]   [H-I: 14]   [52 followers]  Follow         Hybrid journal (It can contain Open Access articles)    ISSN (Print) 1862-5355 - ISSN (Online) 1862-5347    Published by Springer-Verlag  [2341 journals]
• On visual distances for spectrum-type functional data
• Authors: A. Cholaquidis; A. Cuevas; R. Fraiman
Pages: 5 - 24
Abstract: Abstract A functional distance $${\mathbb H}$$ , based on the Hausdorff metric between the function hypographs, is proposed for the space $${\mathcal E}$$ of non-negative real upper semicontinuous functions on a compact interval. The main goal of the paper is to show that the space $$({\mathcal E},{\mathbb H})$$ is particularly suitable in some statistical problems with functional data which involve functions with very wiggly graphs and narrow, sharp peaks. A typical example is given by spectrograms, either obtained by magnetic resonance or by mass spectrometry. On the theoretical side, we show that $$({\mathcal E},{\mathbb H})$$ is a complete, separable locally compact space and that the $${\mathbb H}$$ -convergence of a sequence of functions implies the convergence of the respective maximum values of these functions. The probabilistic and statistical implications of these results are discussed, in particular regarding the consistency of k-NN classifiers for supervised classification problems with functional data in $${\mathbb H}$$ . On the practical side, we provide the results of a small simulation study and check also the performance of our method in two real data problems of supervised classification involving mass spectra.
PubDate: 2017-03-01
DOI: 10.1007/s11634-015-0217-7
Issue No: Vol. 11, No. 1 (2017)

• NMF versus ICA for blind source separation
• Authors: Andri Mirzal
Pages: 25 - 48
Abstract: Abstract Blind source separation (BSS) is a problem of recovering source signals from signal mixtures without or very limited information about the sources and the mixing process. From literatures, nonnegative matrix factorization (NMF) and independent component analysis (ICA) seem to be the mainstream techniques for solving the BSS problems. Even though the using of NMF and ICA for BSS is well studied, there is still a lack of works that compare the performances of these techniques. Moreover, the nonuniqueness property of NMF is rarely mentioned even though this property actually can make the reconstructed signals vary significantly, and thus introduces the difficulty on how to choose the representative reconstructions from several possible outcomes. In this paper, we compare the performances of NMF and ICA as BSS methods using some standard NMF and ICA algorithms, and point out the difficulty in choosing the representative reconstructions originated from the nonuniqueness property of NMF.
PubDate: 2017-03-01
DOI: 10.1007/s11634-014-0192-4
Issue No: Vol. 11, No. 1 (2017)

• Dichotomic lattices and local discretization for Galois lattices
• Authors: Nathalie Girard; Karell Bertet; Muriel Visani
Pages: 49 - 77
Abstract: Abstract The present paper deals with supervised classification methods based on Galois lattices and decision trees. Such ordered structures require attributes discretization and it is known that, for decision trees, local discretization improves the classification performance compared with global discretization. While most literature on discretization for Galois lattices relies on global discretization, the presented work introduces a new local discretization algorithm for Galois lattices which hinges on a property of some specific lattices that we introduce as dichotomic lattices. Their properties, co-atomicity and $$\vee$$ -complementarity are proved along with their links with decision trees. Finally, some quantitative and qualitative evaluations of the local discretization are proposed.
PubDate: 2017-03-01
DOI: 10.1007/s11634-015-0225-7
Issue No: Vol. 11, No. 1 (2017)

• Minimum Class Variance SVM+ for data classification
• Authors: Wenxin Zhu; Ping Zhong
Pages: 79 - 96
Abstract: Abstract In this paper, a new Support Vector Machine Plus (SVM+) type model called Minimum Class Variance SVM+ (MCVSVM+) is presented. Similar to SVM+, the proposed model utilizes the group information in the training data. We show that MCVSVM+ has both the advantages of SVM+ and Minimum Class Variance Support Vector Machine (MCVSVM). That is, MCVSVM+ not only considers class distribution characteristics in its optimization problem but also utilizes the additional information (i.e. group information) hidden in the data, in contrast to SVM+ that takes into consideration only the samples that are in the class boundaries. The experimental results demonstrate the validity and advantage of the new model compared with the standard SVM, SVM+ and MCVSVM.
PubDate: 2017-03-01
DOI: 10.1007/s11634-015-0212-z
Issue No: Vol. 11, No. 1 (2017)

• A uniform framework for the combination of penalties in generalized
structured models
• Authors: Margret-Ruth Oelker; Gerhard Tutz
Pages: 97 - 120
Abstract: Abstract Penalized estimation has become an established tool for regularization and model selection in regression models. A variety of penalties with specific features are available and effective algorithms for specific penalties have been proposed. But not much is available to fit models with a combination of different penalties. When modeling the rent data of Munich as in our application, various types of predictors call for a combination of a Ridge, a group Lasso and a Lasso-type penalty within one model. We propose to approximate penalties that are (semi-)norms of scalar linear transformations of the coefficient vector in generalized structured models—such that penalties of various kinds can be combined in one model. The approach is very general such that the Lasso, the fused Lasso, the Ridge, the smoothly clipped absolute deviation penalty, the elastic net and many more penalties are embedded. The computation is based on conventional penalized iteratively re-weighted least squares algorithms and hence, easy to implement. New penalties can be incorporated quickly. The approach is extended to penalties with vector based arguments. There are several possibilities to choose the penalty parameter(s). A software implementation is available. Some illustrative examples show promising results.
PubDate: 2017-03-01
DOI: 10.1007/s11634-015-0205-y
Issue No: Vol. 11, No. 1 (2017)

• Advances in credit scoring: combining performance and interpretation in
kernel discriminant analysis
• Authors: Caterina Liberati; Furio Camillo; Gilbert Saporta
Pages: 121 - 138
Abstract: Abstract Due to the recent financial turmoil, a discussion in the banking sector about how to accomplish long term success, and how to follow an exhaustive and powerful strategy in credit scoring is being raised up. Recently, the significant theoretical advances in machine learning algorithms have pushed the application of kernel-based classifiers, producing very effective results. Unfortunately, such tools have an inability to provide an explanation, or comprehensible justification, for the solutions they supply. In this paper, we propose a new strategy to model credit scoring data, which exploits, indirectly, the classification power of the kernel machines into an operative field. A reconstruction process of the kernel classifier is performed via linear regression, if all predictors are numerical, or via a general linear model, if some or all predictors are categorical. The loss of performance, due to such approximation, is balanced by better interpretability for the end user, which is able to order, understand and to rank the influence of each category of the variables set in the prediction. An Italian bank case study has been illustrated and discussed; empirical results reveal a promising performance of the introduced strategy.
PubDate: 2017-03-01
DOI: 10.1007/s11634-015-0213-y
Issue No: Vol. 11, No. 1 (2017)

• A generalized maximum entropy estimator to simple linear measurement error
model with a composite indicator
• Authors: Maurizio Carpita; Enrico Ciavolino
Pages: 139 - 158
Abstract: Abstract We extend the simple linear measurement error model through the inclusion of a composite indicator by using the generalized maximum entropy estimator. A Monte Carlo simulation study is proposed for comparing the performances of the proposed estimator to his counterpart the ordinary least squares “Adjusted for attenuation”. The two estimators are compared in term of correlation with the true latent variable, standard error and root mean of squared error. Two illustrative case studies are reported in order to discuss the results obtained on the real data set, and relate them to the conclusions drawn via simulation study.
PubDate: 2017-03-01
DOI: 10.1007/s11634-016-0237-y
Issue No: Vol. 11, No. 1 (2017)

• Evaluation of the evolution of relationships between topics over time
• Authors: Wolfgang Gaul; Dominique Vincent
Pages: 159 - 178
Abstract: Abstract Topics that attract public attention can originate from current events or developments, might be influenced by situations in the past, and often continue to be of interest in the future. When respective information is made available textually, one possibility of detecting such topics of public importance consists in scrutinizing, e.g., appropriate press articles using—given the continual growth of information—text processing techniques enriched by computer routines which examine present-day textual material, check historical publications, find newly emerging topics, and are able to track topic trends over time. Information clustering based on content-(dis)similarity of the underlying textual material and graph-theoretical considerations to deal with the network of relationships between content-similar topics are described and combined in a new approach. Explanatory examples of topic detection and tracking in online news articles illustrate the usefulness of the approach in different situations.
PubDate: 2017-03-01
DOI: 10.1007/s11634-016-0241-2
Issue No: Vol. 11, No. 1 (2017)

• Supervised box clustering
• Authors: Vincenzo Spinelli
Pages: 179 - 204
Abstract: Abstract In this work we address a technique for effectively clustering points in specific convex sets, called homogeneous boxes, having sides aligned with the coordinate axes (isothetic condition). The proposed clustering approach is based on homogeneity conditions, not according to some distance measure, and, even if it was originally developed in the context of the logical analysis of data, it is now placed inside the framework of Supervised clustering. First, we introduce the basic concepts in box geometry; then, we consider a generalized clustering algorithm based on a class of graphs, called incompatibility graphs. For supervised classification problems, we consider classifiers based on box sets, and compare the overall performances to the accuracy levels of competing methods for a wide range of real data sets. The results show that the proposed method performs comparably with other supervised learning methods in terms of accuracy.
PubDate: 2017-03-01
DOI: 10.1007/s11634-016-0233-2
Issue No: Vol. 11, No. 1 (2017)

• Multiple straight-line fitting using a Bayes factor
• Authors: Carlos Lara-Alvarez; Leonardo Romero; Cuauhtemoc Gomez
Pages: 205 - 218
Abstract: Abstract This paper introduces a Bayesian approach to solve the problem of fitting multiple straight lines to a set of 2D points. Other approaches use many arbitrary parameters and threshold values, the proposed criterion uses only the parameters of the measurement errors. Models with multiple lines are useful in many applications, this paper analyzes the performance of the new approach to solve a classical problem in robotics: finding a map of lines from laser measurements. Tests show that the Bayesian approach obtains reliable models.
PubDate: 2017-03-01
DOI: 10.1007/s11634-016-0236-z
Issue No: Vol. 11, No. 1 (2017)

• Sparsest factor analysis for clustering variables: a matrix decomposition
approach
• Authors: Kohei Adachi; Nickolay T. Trendafilov
Abstract: Abstract We propose a new procedure for sparse factor analysis (FA) such that each variable loads only one common factor. Thus, the loading matrix has a single nonzero element in each row and zeros elsewhere. Such a loading matrix is the sparsest possible for certain number of variables and common factors. For this reason, the proposed method is named sparsest FA (SSFA). It may also be called FA-based variable clustering, since the variables loading the same common factor can be classified into a cluster. In SSFA, all model parts of FA (common factors, their correlations, loadings, unique factors, and unique variances) are treated as fixed unknown parameter matrices and their least squares function is minimized through specific data matrix decomposition. A useful feature of the algorithm is that the matrix of common factor scores is re-parameterized using QR decomposition in order to efficiently estimate factor correlations. A simulation study shows that the proposed procedure can exactly identify the true sparsest models. Real data examples demonstrate the usefulness of the variable clustering performed by SSFA.
PubDate: 2017-04-13
DOI: 10.1007/s11634-017-0284-z

• Unsupervised classification of children’s bodies using currents
• Authors: Sonia Barahona; Ximo Gual-Arnau; Maria Victoria Ibáñez; Amelia Simó
Abstract: Abstract Object classification according to their shape and size is of key importance in many scientific fields. This work focuses on the case where the size and shape of an object is characterized by a current. A current is a mathematical object which has been proved relevant to the modeling of geometrical data, like submanifolds, through integration of vector fields along them. As a consequence of the choice of a vector-valued reproducing kernel Hilbert space (RKHS) as a test space for integrating manifolds, it is possible to consider that shapes are embedded in this Hilbert Space. A vector-valued RKHS is a Hilbert space of vector fields; therefore, it is possible to compute a mean of shapes, or to calculate a distance between two manifolds. This embedding enables us to consider size-and-shape clustering algorithms. These algorithms are applied to a 3D database obtained from an anthropometric survey of the Spanish child population with a potential application to online sales of children’s wear.
PubDate: 2017-03-11
DOI: 10.1007/s11634-017-0283-0

• Relating brand confusion to ad similarities and brand strengths through
image data analysis and classification
• Authors: Daniel Baier; Sarah Frost
PubDate: 2017-03-04
DOI: 10.1007/s11634-017-0282-1

• Cluster-based sparse topical coding for topic mining and document
clustering
• Authors: Parvin Ahmadi; Iman Gholampour; Mahmoud Tabandeh
Abstract: Abstract In this paper, we introduce a document clustering method based on Sparse Topical Coding, called Cluster-based Sparse Topical Coding. Topic modeling is capable of improving textual document clustering by describing documents via bag-of-words models and projecting them into a topic space. The latent semantic descriptions derived by the topic model can be utilized as features in a clustering process. In our proposed method, document clustering and topic modeling are integrated in a unified framework in order to achieve the highest performance. This framework includes Sparse Topical Coding, which is responsible for topic mining, and K-means that discovers the latent clusters in documents collection. Experimental results on widely-used datasets show that our proposed method significantly outperforms the traditional and other topic model based clustering methods. Our method achieves from 4 to 39% improvement in clustering accuracy and from 2% to more than 44% improvement in normalized mutual information.
PubDate: 2017-02-28
DOI: 10.1007/s11634-017-0280-3

• Editorial for issue 1/2017
• PubDate: 2017-02-22
DOI: 10.1007/s11634-017-0281-2

• A data driven equivariant approach to constrained Gaussian mixture
modeling
• Authors: Roberto Rocci; Stefano Antonio Gattone; Roberto Di Mari
Abstract: Abstract Maximum likelihood estimation of Gaussian mixture models with different class-specific covariance matrices is known to be problematic. This is due to the unboundedness of the likelihood, together with the presence of spurious maximizers. Existing methods to bypass this obstacle are based on the fact that unboundedness is avoided if the eigenvalues of the covariance matrices are bounded away from zero. This can be done imposing some constraints on the covariance matrices, i.e. by incorporating a priori information on the covariance structure of the mixture components. The present work introduces a constrained approach, where the class conditional covariance matrices are shrunk towards a pre-specified target matrix $$\varvec{\varPsi }.$$ Data-driven choices of the matrix $$\varvec{\varPsi },$$ when a priori information is not available, and the optimal amount of shrinkage are investigated. Then, constraints based on a data-driven $$\varvec{\varPsi }$$ are shown to be equivariant with respect to linear affine transformations, provided that the method used to select the target matrix be also equivariant. The effectiveness of the proposal is evaluated on the basis of a simulation study and an empirical example.
PubDate: 2017-01-06
DOI: 10.1007/s11634-016-0279-1

• Second special issue on “Advances in latent variables: methods,
models and applications”
• Authors: Angela Montanari; Maurizio Vichi
Pages: 417 - 421
PubDate: 2016-12-01
DOI: 10.1007/s11634-016-0275-5
Issue No: Vol. 10, No. 4 (2016)

• The determination of uncertainty levels in robust clustering of subjects
with longitudinal observations using the Dirichlet process mixture
• Authors: Reyhaneh Rikhtehgaran; Iraj Kazemi
Pages: 541 - 562
Abstract: Abstract In this paper we introduce a new method to the cluster analysis of longitudinal data focusing on the determination of uncertainty levels for cluster memberships. The method uses the Dirichlet-t distribution which notably utilizes the robustness feature of the student-t distribution in the framework of a Bayesian semi-parametric approach together with robust clustering of subjects evaluates the uncertainty level of subjects memberships to their clusters. We let the number of clusters and the uncertainty levels be unknown while fitting Dirichlet process mixture models. Two simulation studies are conducted to demonstrate the proposed methodology. The method is applied to cluster a real data set taken from gene expression studies.
PubDate: 2016-12-01
DOI: 10.1007/s11634-016-0262-x
Issue No: Vol. 10, No. 4 (2016)

• Probabilistic clustering via Pareto solutions and significance tests
• Authors: María Teresa Gallegos; Gunter Ritter
Abstract: Abstract The present paper proposes a new strategy for probabilistic (often called model-based) clustering. It is well known that local maxima of mixture likelihoods can be used to partition an underlying data set. However, local maxima are rarely unique. Therefore, it remains to select the reasonable solutions, and in particular the desired one. Credible partitions are usually recognized by separation (and cohesion) of their clusters. We use here the p values provided by the classical tests of Wilks, Hotelling, and Behrens–Fisher to single out those solutions that are well separated by location. It has been shown that reasonable solutions to a clustering problem are related to Pareto points in a plot of scale balance vs. model fit of all local maxima. We briefly review this theory and propose as solutions all well-fitting Pareto points in the set of local maxima separated by location in the above sense. We also design a new iterative, parameter-free cutting plane algorithm for the multivariate Behrens–Fisher problem.
PubDate: 2016-12-30
DOI: 10.1007/s11634-016-0278-2

• Rank-based classifiers for extremely high-dimensional gene expression data
• Authors: Ludwig Lausser; Florian Schmid; Lyn-Rouven Schirra; Adalbert F. X. Wilhelm; Hans A. Kestler
Abstract: Abstract Predicting phenotypes on the basis of gene expression profiles is a classification task that is becoming increasingly important in the field of precision medicine. Although these expression signals are real-valued, it is questionable if they can be analyzed on an interval scale. As with many biological signals their influence on e.g. protein levels is usually non-linear and thus can be misinterpreted. In this article we study gene expression profiles with up to 54,000 dimensions. We analyze these measurements on an ordinal scale by replacing the real-valued profiles by their ranks. This type of rank transformation can be used for the construction of invariant classifiers that are not affected by noise induced by data transformations which can occur in the measurement setup. Our 10 $$\times$$ 10 fold cross-validation experiments on 86 different data sets and 19 different classification models indicate that classifiers largely benefit from this transformation. Especially random forests and support vector machines achieve improved classification results on a significant majority of datasets.
PubDate: 2016-12-19
DOI: 10.1007/s11634-016-0277-3

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327

Home (Search)
Subjects A-Z
Publishers A-Z
Customise
APIs