Authors:Karel Hron; Paula Brito; Peter Filzmoser Pages: 223 - 241 Abstract: Compositional data are considered as data where relative contributions of parts on a whole, conveyed by (log-)ratios between them, are essential for the analysis. In Symbolic Data Analysis (SDA), we are in the framework of interval data when elements are characterized by variables whose values are intervals on \(\mathbb {R}\) representing inherent variability. In this paper, we address the special problem of the analysis of interval compositions, i.e., when the interval data are obtained by the aggregation of compositions. It is assumed that the interval information is represented by the respective midpoints and ranges, and both sources of information are considered as compositions. In this context, we introduce the representation of interval data as three-way data. In the framework of the log-ratio approach from compositional data analysis, it is outlined how interval compositions can be treated in an exploratory context. The goal of the analysis is to represent the compositions by coordinates which are interpretable in terms of the original compositional parts. This is achieved by summarizing all relative information (logratios) about each part into one coordinate from the coordinate system. Based on an example from the European Union Statistics on Income and Living Conditions (EU-SILC), several possibilities for an exploratory data analysis approach for interval compositions are outlined and investigated. PubDate: 2017-06-01 DOI: 10.1007/s11634-016-0245-y Issue No:Vol. 11, No. 2 (2017)

Authors:Emilie Devijver Pages: 243 - 279 Abstract: Finite mixture regression models are useful for modeling the relationship between response and predictors arising from different subpopulations. In this article, we study high-dimensional predictors and high-dimensional response and propose two procedures to cluster observations according to the link between predictors and the response. To reduce the dimension, we propose to use the Lasso estimator, which takes into account the sparsity and a maximum likelihood estimator penalized by the rank, to take into account the matrix structure. To choose the number of components and the sparsity level, we construct a collection of models, varying those two parameters and we select a model among this collection with a non-asymptotic criterion. We extend these procedures to functional data, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms and apply and evaluate our methods both on simulated and real datasets, to understand how they work in practice. PubDate: 2017-06-01 DOI: 10.1007/s11634-016-0242-1 Issue No:Vol. 11, No. 2 (2017)

Authors:Gerhard Tutz; Micha Schneider; Maria Iannario; Domenico Piccolo Pages: 281 - 305 Abstract: In CUB models the uncertainty of choice is explicitly modelled as a Combination of discrete Uniform and shifted Binomial random variables. The basic concept to model the response as a mixture of a deliberate choice of a response category and an uncertainty component that is represented by a uniform distribution on the response categories is extended to a much wider class of models. The deliberate choice can in particular be determined by classical ordinal response models as the cumulative and adjacent categories model. Then one obtains the traditional and flexible models as special cases when the uncertainty component is irrelevant. It is shown that the effect of explanatory variables is underestimated if the uncertainty component is neglected in a cumulative type mixture model. Visualization tools for the effects of variables are proposed and the modelling strategies are evaluated by use of real data sets. It is demonstrated that the extended class of models frequently yields better fit than classical ordinal response models without an uncertainty component. PubDate: 2017-06-01 DOI: 10.1007/s11634-016-0247-9 Issue No:Vol. 11, No. 2 (2017)

Authors:Julio César Hernández-Sánchez; José Luis Vicente-Villardón Pages: 307 - 326 Abstract: Classical biplot methods allow for the simultaneous representation of individuals (rows) and variables (columns) of a data matrix. For binary data, logistic biplots have been recently developed. When data are nominal, both classical and binary logistic biplots are not adequate and techniques such as multiple correspondence analysis (MCA), latent trait analysis (LTA) or item response theory (IRT) for nominal items should be used instead. In this paper we extend the binary logistic biplot to nominal data. The resulting method is termed “nominal logistic biplot”(NLB), although the variables are represented as convex prediction regions rather than vectors. Using the methods from computational geometry, the set of prediction regions is converted to a set of points in such a way that the prediction for each individual is established by its closest “category point”. Then interpretation is based on distances rather than on projections. We study the geometry of such a representation and construct computational algorithms for the estimation of parameters and the calculation of prediction regions. Nominal logistic biplots extend both MCA and LTA in the sense that they give a graphical representation for LTA similar to the one obtained in MCA. PubDate: 2017-06-01 DOI: 10.1007/s11634-016-0249-7 Issue No:Vol. 11, No. 2 (2017)

Authors:J. Le-Rademacher; L. Billard Pages: 327 - 351 Abstract: This paper introduces a principal component methodology for analysing histogram-valued data under the symbolic data domain. Currently, no comparable method exists for this type of data. The proposed method uses a symbolic covariance matrix to determine the principal component space. The resulting observations on principal component space are presented as polytopes for visualization. Numerical representation of the resulting polytopes via histogram-valued output is also presented. The necessary algorithms are included. The technique is illustrated on a weather data set. PubDate: 2017-06-01 DOI: 10.1007/s11634-016-0255-9 Issue No:Vol. 11, No. 2 (2017)

Authors:Panagiotis Tzirakis; Christos Tjortjis Pages: 353 - 370 Abstract: This paper proposes, describes and evaluates T3C, a classification algorithm that builds decision trees of depth at most three, and results in high accuracy whilst keeping the size of the tree reasonably small. T3C is an improvement over algorithm T3 in the way it performs splits on continuous attributes. When run against publicly available data sets, T3C achieved lower generalisation error than T3 and the popular C4.5, and competitive results compared to Random Forest and Rotation Forest. PubDate: 2017-06-01 DOI: 10.1007/s11634-016-0246-x Issue No:Vol. 11, No. 2 (2017)

Authors:Stephen L. France; Wen Chen; Yumin Deng Pages: 371 - 393 Abstract: The ADCLUS and INDCLUS models, along with associated fitting techniques, can be used to extract an overlapping clustering structure from similarity data. In this paper, we examine the scalability of these models. We test the SINDLCUS algorithm and an adapted version of the SYMPRES algorithm on medium size datasets and try to infer their scalability and the degree of the local optima problem as the problem size increases. We describe several meta-heuristic approaches to minimizing the INDCLUS and ADCLUS loss functions. PubDate: 2017-06-01 DOI: 10.1007/s11634-016-0244-z Issue No:Vol. 11, No. 2 (2017)

Authors:Nadia Solaro; Alessandro Barbiero; Giancarlo Manzi; Pier Alda Ferrari Pages: 395 - 414 Abstract: Missing data recurrently affect datasets in almost every field of quantitative research. The subject is vast and complex and has originated a literature rich in very different approaches to the problem. Within an exploratory framework, distance-based methods such as nearest-neighbour imputation (NNI), or procedures involving multivariate data analysis (MVDA) techniques seem to treat the problem properly. In NNI, the metric and the number of donors can be chosen at will. MVDA-based procedures expressly account for variable associations. The new approach proposed here, called Forward Imputation, ideally meets these features. It is designed as a sequential procedure that imputes missing data in a step-by-step process involving subsets of units according to their “completeness rate”. Two methods within this context are developed for the imputation of quantitative data. One applies NNI with the Mahalanobis distance, the other combines NNI and principal component analysis. Statistical properties of the two methods are discussed, and their performance is assessed, also in comparison with alternative imputation methods. To this purpose, a simulation study in the presence of different data patterns along with an application to real data are carried out, and practical hints for users are also provided. PubDate: 2017-06-01 DOI: 10.1007/s11634-016-0243-0 Issue No:Vol. 11, No. 2 (2017)

Authors:Mario Michael Krell; Sirko Straube Pages: 415 - 439 Abstract: Data processing often transforms a complex signal using a set of different preprocessing algorithms to a single value as the outcome of a final decision function. Still, it is challenging to understand and visualize the interplay between the algorithms performing this transformation. Especially when dimensionality reduction is used, the original data structure (e.g., spatio-temporal information) is hidden from subsequent algorithms. To tackle this problem, we introduce the backtransformation concept suggesting to look at the combination of algorithms as one transformation which maps the original input signal to a single value. Therefore, it takes the derivative of the final decision function and transforms it back through the previous processing steps via backward iteration and the chain rule. The resulting derivative of the composed decision function in the sample of interest represents the complete decision process. Using it for visualizations might improve the understanding of the process. Often, it is possible to construct a feasible processing chain with affine mappings which simplifies the calculation for the backtransformation and the interpretation of the result a lot. In this case, the affine backtransformation provides the complete parameterization of the processing chain. This article introduces the theory, provides implementation guidelines, and presents three application examples. PubDate: 2017-06-01 DOI: 10.1007/s11634-015-0229-3 Issue No:Vol. 11, No. 2 (2017)

Authors:A. Cholaquidis; A. Cuevas; R. Fraiman Pages: 5 - 24 Abstract: A functional distance \({\mathbb H}\) , based on the Hausdorff metric between the function hypographs, is proposed for the space \({\mathcal E}\) of non-negative real upper semicontinuous functions on a compact interval. The main goal of the paper is to show that the space \(({\mathcal E},{\mathbb H})\) is particularly suitable in some statistical problems with functional data which involve functions with very wiggly graphs and narrow, sharp peaks. A typical example is given by spectrograms, either obtained by magnetic resonance or by mass spectrometry. On the theoretical side, we show that \(({\mathcal E},{\mathbb H})\) is a complete, separable locally compact space and that the \({\mathbb H}\) -convergence of a sequence of functions implies the convergence of the respective maximum values of these functions. The probabilistic and statistical implications of these results are discussed, in particular regarding the consistency of k-NN classifiers for supervised classification problems with functional data in \({\mathbb H}\) . On the practical side, we provide the results of a small simulation study and check also the performance of our method in two real data problems of supervised classification involving mass spectra. PubDate: 2017-03-01 DOI: 10.1007/s11634-015-0217-7 Issue No:Vol. 11, No. 1 (2017)

Authors:Wolfgang Gaul; Dominique Vincent Pages: 159 - 178 Abstract: Topics that attract public attention can originate from current events or developments, might be influenced by situations in the past, and often continue to be of interest in the future. When respective information is made available textually, one possibility of detecting such topics of public importance consists in scrutinizing, e.g., appropriate press articles using—given the continual growth of information—text processing techniques enriched by computer routines which examine present-day textual material, check historical publications, find newly emerging topics, and are able to track topic trends over time. Information clustering based on content-(dis)similarity of the underlying textual material and graph-theoretical considerations to deal with the network of relationships between content-similar topics are described and combined in a new approach. Explanatory examples of topic detection and tracking in online news articles illustrate the usefulness of the approach in different situations. PubDate: 2017-03-01 DOI: 10.1007/s11634-016-0241-2 Issue No:Vol. 11, No. 1 (2017)

Authors:A. Felipe; N. Martín; P. Miranda; L. Pardo Abstract: In this paper we explore the possibilities of applying \(\phi \) -divergence measures in inferential problems in the field of latent class models (LCMs) for multinomial data. We first treat the problem of estimating the model parameters. As explained below, minimum \(\phi \) -divergence estimators (M \(\phi \) Es) considered in this paper are a natural extension of the maximum likelihood estimator (MLE), the usual estimator for this problem; we study the asymptotic properties of M \(\phi \) Es, showing that they share the same asymptotic distribution as the MLE. To compare the efficiency of the M \(\phi \) Es when the sample size is not big enough to apply the asymptotic results, we have carried out an extensive simulation study; from this study, we conclude that there are estimators in this family that are competitive with the MLE. Next, we deal with the problem of testing whether a LCM for multinomial data fits a data set; again, \(\phi \) -divergence measures can be used to generate a family of test statistics generalizing both the classical likelihood ratio test and the chi-squared test statistics. Finally, we treat the problem of choosing the best model out of a sequence of nested LCMs; as before, \(\phi \) -divergence measures can handle the problem and we derive a family of \(\phi \) -divergence test statistics based on them; we study the asymptotic behavior of these test statistics, showing that it is the same as the classical test statistics. A simulation study for small and moderate sample sizes shows that there are some test statistics in the family that can compete with the classical likelihood ratio and the chi-squared test statistics. PubDate: 2017-07-04 DOI: 10.1007/s11634-017-0289-7

Authors:Mercedes Fernandez Sau; Daniela Rodriguez Abstract: In this paper, we propose estimators based on the minimum distance for the unknown parameters of a parametric density on the unit sphere. We show that these estimators are consistent and asymptotically normally distributed. Also, we apply our proposal to develop a method that allows us to detect potential atypical values. The behavior under small samples of the proposed estimators is studied using Monte Carlo simulations. Two applications of our procedure are illustrated with real data sets. PubDate: 2017-06-02 DOI: 10.1007/s11634-017-0287-9

Authors:Karim Abou-Moustafa; Frank P. Ferrie Abstract: Finding the set of nearest neighbors for a query point of interest appears in a variety of algorithms for machine learning and pattern recognition. Examples include k nearest neighbor classification, information retrieval, case-based reasoning, manifold learning, and nonlinear dimensionality reduction. In this work, we propose a new approach for determining a distance metric from the data for finding such neighboring points. For a query point of interest, our approach learns a generalized quadratic distance (GQD) metric based on the statistical properties in a “small” neighborhood for the point of interest. The locally learned GQD metric captures information such as the density, curvature, and the intrinsic dimensionality for the points falling in this particular neighborhood. Unfortunately, learning the GQD parameters under such a local learning mechanism is a challenging problem with a high computational overhead. To address these challenges, we estimate the GQD parameters using the minimum volume covering ellipsoid (MVCE) for a set of points. The advantage of the MVCE is two-fold. First, the MVCE together with the local learning approach approximate the functionality of a well known robust estimator for covariance matrices. Second, computing the MVCE is a convex optimization problem which, in addition to having a unique global solution, can be efficiently solved using a first order optimization algorithm. We validate our metric learning approach on a large variety of datasets and show that the proposed metric has promising results when compared with five algorithms from the literature for supervised metric learning. PubDate: 2017-04-25 DOI: 10.1007/s11634-017-0286-x

Authors:Afef Ben Brahim; Mohamed Limam Abstract: The curse of dimensionality is based on the fact that high dimensional data is often difficult to work with. A large number of features can increase the noise of the data and thus the error of a learning algorithm. Feature selection is a solution for such problems where there is a need to reduce the data dimensionality. Different feature selection algorithms may yield feature subsets that can be considered local optima in the space of feature subsets. Ensemble feature selection combines independent feature subsets and might give a better approximation to the optimal subset of features. We propose an ensemble feature selection approach based on feature selectors’ reliability assessment. It aims at providing a unique and stable feature selection without ignoring the predictive accuracy aspect. A classification algorithm is used as an evaluator to assign a confidence to features selected by ensemble members based on their associated classification performance. We compare our proposed approach to several existing techniques and to individual feature selection algorithms. Results show that our approach often improves classification performance and feature selection stability for high dimensional data sets. PubDate: 2017-04-24 DOI: 10.1007/s11634-017-0285-y

Authors:Kohei Adachi; Nickolay T. Trendafilov Abstract: We propose a new procedure for sparse factor analysis (FA) such that each variable loads only one common factor. Thus, the loading matrix has a single nonzero element in each row and zeros elsewhere. Such a loading matrix is the sparsest possible for certain number of variables and common factors. For this reason, the proposed method is named sparsest FA (SSFA). It may also be called FA-based variable clustering, since the variables loading the same common factor can be classified into a cluster. In SSFA, all model parts of FA (common factors, their correlations, loadings, unique factors, and unique variances) are treated as fixed unknown parameter matrices and their least squares function is minimized through specific data matrix decomposition. A useful feature of the algorithm is that the matrix of common factor scores is re-parameterized using QR decomposition in order to efficiently estimate factor correlations. A simulation study shows that the proposed procedure can exactly identify the true sparsest models. Real data examples demonstrate the usefulness of the variable clustering performed by SSFA. PubDate: 2017-04-13 DOI: 10.1007/s11634-017-0284-z

Authors:Sonia Barahona; Ximo Gual-Arnau; Maria Victoria Ibáñez; Amelia Simó Abstract: Object classification according to their shape and size is of key importance in many scientific fields. This work focuses on the case where the size and shape of an object is characterized by a current. A current is a mathematical object which has been proved relevant to the modeling of geometrical data, like submanifolds, through integration of vector fields along them. As a consequence of the choice of a vector-valued reproducing kernel Hilbert space (RKHS) as a test space for integrating manifolds, it is possible to consider that shapes are embedded in this Hilbert Space. A vector-valued RKHS is a Hilbert space of vector fields; therefore, it is possible to compute a mean of shapes, or to calculate a distance between two manifolds. This embedding enables us to consider size-and-shape clustering algorithms. These algorithms are applied to a 3D database obtained from an anthropometric survey of the Spanish child population with a potential application to online sales of children’s wear. PubDate: 2017-03-11 DOI: 10.1007/s11634-017-0283-0

Authors:Daniel Baier; Sarah Frost Abstract: Brand confusion occurs when a consumer is exposed to an advertisement (ad) for brand A but believes that it is for brand B. If more consumers are confused in this direction than in the other one (assuming that an ad for B is for A), this asymmetry is a disadvantage for A. Consequently, the confusion potential and structure of ads has to be checked: A sample of consumers is exposed to a sample of ads. For each ad the consumers have to specify their guess about the advertised brand. Then, the collected data are aggregated and analyzed using, e.g., MDS or two-mode clustering. In this paper we compare this approach to a new one where image data analysis and classification is applied: The confusion potential and structure of ads is related to featurewise distances between ads and—to model asymmetric effects—to the strengths of the advertised brands. A sample application for the German beer market is presented, the results are encouraging. PubDate: 2017-03-04 DOI: 10.1007/s11634-017-0282-1