Authors:John C. Gower Pages: 33 - 41 Abstract: The paper gives a short account of how I became interested in analysing asymmetry in square tables. The early history of the canonical analysis of skew-symmetry and the associated development of its geometrical interpretation are described. PubDate: 2018-03-01 DOI: 10.1007/s11634-014-0181-7 Issue No:Vol. 12, No. 1 (2018)

Authors:Donatella Vicari Pages: 43 - 64 Abstract: A CLUstering model for SKew-symmetric data including EXTernal information (CLUSKEXT) is proposed, which relies on the decomposition of a skew-symmetric matrix into within and between cluster effects which are further decomposed into regression and residual effects when possible external information on the objects is available. In order to fit the imbalances between objects, the model jointly searches for a partition of objects and appropriate weights which are in turn linearly linked to the external variables. The proposal is fitted in a least-squares framework and a decomposition of the fit is derived. An appropriate Alternating Least-Squares algorithm is provided to fit the model to illustrative real and artificial data. PubDate: 2018-03-01 DOI: 10.1007/s11634-015-0203-0 Issue No:Vol. 12, No. 1 (2018)

Authors:Gunnar Carlsson; Facundo Mémoli; Alejandro Ribeiro; Santiago Segarra Pages: 65 - 105 Abstract: This paper considers networks where relationships between nodes are represented by directed dissimilarities. The goal is to study methods that, based on the dissimilarity structure, output hierarchical clusters, i.e., a family of nested partitions indexed by a connectivity parameter. Our construction of hierarchical clustering methods is built around the concept of admissible methods, which are those that abide by the axioms of value—nodes in a network with two nodes are clustered together at the maximum of the two dissimilarities between them—and transformation—when dissimilarities are reduced, the network may become more clustered but not less. Two particular methods, termed reciprocal and nonreciprocal clustering, are shown to provide upper and lower bounds in the space of admissible methods. Furthermore, alternative clustering methodologies and axioms are considered. In particular, modifying the axiom of value such that clustering in two-node networks occurs at the minimum of the two dissimilarities entails the existence of a unique admissible clustering method. Finally, the developed clustering methods are implemented to analyze the internal migration in the United States. PubDate: 2018-03-01 DOI: 10.1007/s11634-017-0299-5 Issue No:Vol. 12, No. 1 (2018)

Authors:Mark de Rooij Pages: 107 - 130 Abstract: Longitudinal categorical data are often collected using an experimental design where the interest is in the differential development of the treatment group compared to the control group. Such differential development is often assessed based on average growth curves but can also be based on transitions. For longitudinal multinomial data we describe a transitional methodology for the statistical analysis based on a distance model. Such a distance approach has two advantages compared to a multinomial regression model: (1) sparse data can be handled more efficiently; (2) a graphical representation of the model can be made to enhance interpretation. Within this approach it is possible to jointly model the observations and missing values by adding a new category to the response variable representing the missingness condition. This approach is investigated in a Monte Carlo simulation study. The results show this is a promising way to deal with missing data, although the mechanism is not yet completely understood in all cases. Finally, an empirical example is presented where the advantages of the modeling procedure are highlighted. PubDate: 2018-03-01 DOI: 10.1007/s11634-015-0226-6 Issue No:Vol. 12, No. 1 (2018)

Authors:Marti Sagarra; Frank M. T. A. Busing; Cecilio Mar-Molinero; Josep Rialp Pages: 131 - 153 Abstract: Spanish financial institutions have been heavily affected by the banking crisis that began in 2008. Many of them, especially Spanish savings banks (or Cajas), had to merge with other institutions or had to be rescued. We address the question of up to what point the nature of competition in this sector has changed as a result of the crisis. Although institutions compete in many ways, we concentrate on their presence in the main street through bank branches open to the public (i.e., retail banking competition). Our measure of inter-firm rivalry is based on a geographical proximity measure that we calculate for the years 2008 (before the crisis) and 2012 (the last available data set). The technical approach is based on multidimensional unfolding, a methodology which allows us to graphically represent the asymmetric nature of such rivalry. These maps visualise the salient aspects of the system during the two dates analysed, and can be understood without a detailed technical knowledge. PubDate: 2018-03-01 DOI: 10.1007/s11634-014-0186-2 Issue No:Vol. 12, No. 1 (2018)

Authors:Daniel Baier; Sarah Frost Pages: 155 - 171 Abstract: Brand confusion occurs when a consumer is exposed to an advertisement (ad) for brand A but believes that it is for brand B. If more consumers are confused in this direction than in the other one (assuming that an ad for B is for A), this asymmetry is a disadvantage for A. Consequently, the confusion potential and structure of ads has to be checked: A sample of consumers is exposed to a sample of ads. For each ad the consumers have to specify their guess about the advertised brand. Then, the collected data are aggregated and analyzed using, e.g., MDS or two-mode clustering. In this paper we compare this approach to a new one where image data analysis and classification is applied: The confusion potential and structure of ads is related to featurewise distances between ads and—to model asymmetric effects—to the strengths of the advertised brands. A sample application for the German beer market is presented, the results are encouraging. PubDate: 2018-03-01 DOI: 10.1007/s11634-017-0282-1 Issue No:Vol. 12, No. 1 (2018)

Authors:José R. Berrendero; Javier Cárcamo Abstract: We obtain a decomposition of any quadratic classifier in terms of products of hyperplanes. These hyperplanes can be viewed as relevant linear components of the quadratic rule (with respect to the underlying classification problem). As an application, we introduce the associated multidirectional classifier; a piecewise linear classification rule induced by the approximating products. Such a classifier is useful to determine linear combinations of the predictor variables with ability to discriminate. We also show that this classifier can be used as a tool to reduce the dimension of the data and helps identify the most important variables to classify new elements. Finally, we illustrate with a real data set the use of these linear components to construct oblique classification trees. PubDate: 2018-04-07 DOI: 10.1007/s11634-018-0321-6

Authors:Shuji Ando; Kouji Tahata; Sadao Tomizawa Abstract: For square contingency tables, a double symmetry model having a matrix structure that combines both symmetry and point symmetry was proposed. Also, an index which represents the degree of departure from double symmetry was proposed. However, this index cannot simultaneously characterize the degree of departure from symmetry and the degree of departure from point symmetry. For measuring the degree of departure from double symmetry, the present paper proposes a bivariate index vector that can simultaneously characterize the degree of departure from symmetry and the degree of departure from point symmetry. PubDate: 2018-03-26 DOI: 10.1007/s11634-018-0320-7

Authors:Qiang Wang; Thanh-Tung Nguyen; Joshua Z. Huang; Thuy Thi Nguyen Abstract: In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data. PubDate: 2018-03-21 DOI: 10.1007/s11634-018-0318-1

Authors:Matthijs J. Warrens; Alexandra de Raadt Abstract: Cohen’s kappa is the most widely used coefficient for assessing interobserver agreement on a nominal scale. An alternative coefficient for quantifying agreement between two observers is Bangdiwala’s B. To provide a proper interpretation of an agreement coefficient one must first understand its meaning. Properties of the kappa coefficient have been extensively studied and are well documented. Properties of coefficient B have been studied, but not extensively. In this paper, various new properties of B are presented. Category B-coefficients are defined that are the basic building blocks of B. It is studied how coefficient B, Cohen’s kappa, the observed agreement and associated category coefficients may be related. It turns out that the relationships between the coefficients are quite different for \(2\times 2\) tables than for agreement tables with three or more categories. PubDate: 2018-03-19 DOI: 10.1007/s11634-018-0319-0

Authors:Wan-Lun Wang; Luis M. Castro; Yen-Ting Chang; Tsung-I Lin Abstract: Mixtures of common t factor analyzers (MCtFA) have been shown its effectiveness in robustifying mixtures of common factor analyzers (MCFA) when handling model-based clustering of the high-dimensional data with heavy tails. However, the MCtFA model may still suffer from a lack of robustness against observations whose distributions are highly asymmetric. This paper presents a further robust extension of the MCFA and MCtFA models, called the mixture of common restricted skew-t factor analyzers (MCrstFA), by assuming a restricted multivariate skew-t distribution for the common factors. The MCrstFA model can be used to accommodate severely non-normal (skewed and leptokurtic) random phenomena while preserving its parsimony in factor-analytic representation and performing graphical visualization in low-dimensional plots. A computationally feasible expectation conditional maximization either algorithm is developed to carry out maximum likelihood estimation. The numbers of factors and mixture components are simultaneously determined based on common likelihood penalized criteria. The usefulness of our proposed model is illustrated with simulated and real datasets, and experimental results signify its superiority over some existing competitors. PubDate: 2018-03-08 DOI: 10.1007/s11634-018-0317-2

Authors:Ravi Sankar Sangam; Hari Om Abstract: In data stream environment, most of the conventional clustering algorithms are not sufficiently efficient, since large volumes of data arrive in a stream and these data points unfold with time. The problem of clustering time-evolving metric data and categorical time-evolving data has separately been well explored in recent years, but the problem of clustering mixed type time-evolving data remains a challenging issue due to an awkward gap between the structure of metric and categorical attributes. In this paper, we devise a generalized framework, termed Equi-Clustream to dynamically cluster mixed type time-evolving data, which comprises three algorithms: a Hybrid Drifting Concept Detection Algorithm that detects the drifting concept between the current sliding window and previous sliding window, a Hybrid Data Labeling Algorithm that assigns an appropriate cluster label to each data vector of the current non-drifting window based on the clustering result of the previous sliding window, and a visualization algorithm that analyses the relationship between the clusters at different timestamps and also visualizes the evolving trends of the clusters. The efficacy of the proposed framework is shown by experiments on synthetic and real world datasets. PubDate: 2018-02-26 DOI: 10.1007/s11634-018-0316-3

Authors:Christian Carmona; Luis Nieto-Barajas; Antonio Canale Abstract: The Ministry of Social Development in Mexico is in charge of creating and assigning social programmes targeting specific needs in the population for the improvement of the quality of life. To better target the social programmes, the Ministry is aimed to find clusters of households with the same needs based on demographic characteristics as well as poverty conditions of the household. Available data consists of continuous, ordinal, and nominal variables, all of which come from a non-i.i.d complex design survey sample. We propose a Bayesian nonparametric mixture model that jointly models a set of latent variables, as in an underlying variable response approach, associated to the observed mixed scale data and accommodates for the different sampling probabilities. The performance of the model is assessed via simulated data. A full analysis of socio-economic conditions in households in the Mexican State of Mexico is presented. PubDate: 2018-02-08 DOI: 10.1007/s11634-018-0313-6

Authors:Edoardo Otranto; Massimo Mucciardi Abstract: The STAR model is widely used to represent the dynamics of a certain variable recorded at several locations at the same time. Its advantages are often discussed in terms of parsimony with respect to space-time VAR structures because it considers a single coefficient for each time and spatial lag. This hypothesis can be very strong; we add a certain degree of flexibility to the STAR model, providing the possibility for coefficients to vary in groups of locations. The new class of models (called Flexible STAR–FSTAR) is compared to the classical STAR and the space-time VAR by simulations and an application. PubDate: 2018-02-07 DOI: 10.1007/s11634-018-0314-5

Abstract: Many clustering algorithms when the data are curves or functions have been recently proposed. However, the presence of contamination in the sample of curves can influence the performance of most of them. In this work we propose a robust, model-based clustering method that relies on an approximation to the “density function” for functional data. The robustness follows from the joint application of data-driven trimming, for reducing the effect of contaminated observations, and constraints on the variances, for avoiding spurious clusters in the solution. The algorithm is designed to perform clustering and outlier detection simultaneously by maximizing a trimmed “pseudo” likelihood. The proposed method has been evaluated and compared with other existing methods through a simulation study. Better performance for the proposed methodology is shown when a fraction of contaminating curves is added to a non-contaminated sample. Finally, an application to a real data set that has been previously considered in the literature is given. PubDate: 2018-02-03 DOI: 10.1007/s11634-018-0312-7

Authors:Giuseppe Bove; Akinori Okada Abstract: Asymmetric pairwise relationships are frequently observed in experimental and non-experimental studies. They can be analysed with different aims and approaches. A brief review of models and methods of multidimensional scaling and cluster analysis able to deal with asymmetric proximities is provided taking a ‘data-analytic’ approach and emphasizing data visualization. PubDate: 2018-02-01 DOI: 10.1007/s11634-017-0307-9

Authors:Dawit G. Tadesse; Mark Carpenter Abstract: In this paper, we give a new feature selection algorithm for the binary class classification problem in sparse high-dimensional spaces. Singular value decomposition (SVD) is a popular dimension reduction method in higher-dimensional classification. The traditional SVD method begins by ranking the Singular Dimensions (SDs) from largest singular value to the smallest. However, when the number of signals is fewer than the number of noise, the first few ranked SDs are not necessarily the best for classification. We demonstrate, theoretically and empirically, that our method efficiently selects the SDs most appropriate for classification and significantly reduces the misclassification error. We also apply our method to real data text mining applications. PubDate: 2018-01-25 DOI: 10.1007/s11634-018-0311-8

Authors:Sébastien Loisel; Yoshio Takane Abstract: Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive. PubDate: 2018-01-18 DOI: 10.1007/s11634-018-0310-9

Authors:Amparo Baíllo; Javier Cárcamo; Konstantin Getman Abstract: The classification of the X-ray sources into classes (such as extragalactic sources, background stars,...) is an essential task in astronomy. Typically, one of the classes corresponds to extragalactic radiation, whose photon emission behaviour is well characterized by a homogeneous Poisson process. We propose to use normalized versions of the Wasserstein and Zolotarev distances to quantify the deviation of the distribution of photon interarrival times from the exponential class. Our main motivation is the analysis of a massive dataset from X-ray astronomy obtained by the Chandra Orion Ultradeep Project (COUP). This project yielded a large catalog of 1616 X-ray cosmic sources in the Orion Nebula region, with their series of photon arrival times and associated energies. We consider the plug-in estimators of these metrics, determine their asymptotic distributions, and illustrate their finite-sample performance with a Monte Carlo study. We estimate these metrics for each COUP source from three different classes. We conclude that our proposal provides a striking amount of information on the nature of the photon emitting sources. Further, these variables have the ability to identify X-ray sources wrongly catalogued before. As an appealing conclusion, we show that some sources, previously classified as extragalactic emissions, have a much higher probability of being young stars in Orion Nebula. PubDate: 2018-01-18 DOI: 10.1007/s11634-018-0309-2