Authors:John C. Gower Pages: 33 - 41 Abstract: The paper gives a short account of how I became interested in analysing asymmetry in square tables. The early history of the canonical analysis of skew-symmetry and the associated development of its geometrical interpretation are described. PubDate: 2018-03-01 DOI: 10.1007/s11634-014-0181-7 Issue No:Vol. 12, No. 1 (2018)

Authors:Donatella Vicari Pages: 43 - 64 Abstract: A CLUstering model for SKew-symmetric data including EXTernal information (CLUSKEXT) is proposed, which relies on the decomposition of a skew-symmetric matrix into within and between cluster effects which are further decomposed into regression and residual effects when possible external information on the objects is available. In order to fit the imbalances between objects, the model jointly searches for a partition of objects and appropriate weights which are in turn linearly linked to the external variables. The proposal is fitted in a least-squares framework and a decomposition of the fit is derived. An appropriate Alternating Least-Squares algorithm is provided to fit the model to illustrative real and artificial data. PubDate: 2018-03-01 DOI: 10.1007/s11634-015-0203-0 Issue No:Vol. 12, No. 1 (2018)

Authors:Gunnar Carlsson; Facundo Mémoli; Alejandro Ribeiro; Santiago Segarra Pages: 65 - 105 Abstract: This paper considers networks where relationships between nodes are represented by directed dissimilarities. The goal is to study methods that, based on the dissimilarity structure, output hierarchical clusters, i.e., a family of nested partitions indexed by a connectivity parameter. Our construction of hierarchical clustering methods is built around the concept of admissible methods, which are those that abide by the axioms of value—nodes in a network with two nodes are clustered together at the maximum of the two dissimilarities between them—and transformation—when dissimilarities are reduced, the network may become more clustered but not less. Two particular methods, termed reciprocal and nonreciprocal clustering, are shown to provide upper and lower bounds in the space of admissible methods. Furthermore, alternative clustering methodologies and axioms are considered. In particular, modifying the axiom of value such that clustering in two-node networks occurs at the minimum of the two dissimilarities entails the existence of a unique admissible clustering method. Finally, the developed clustering methods are implemented to analyze the internal migration in the United States. PubDate: 2018-03-01 DOI: 10.1007/s11634-017-0299-5 Issue No:Vol. 12, No. 1 (2018)

Authors:Mark de Rooij Pages: 107 - 130 Abstract: Longitudinal categorical data are often collected using an experimental design where the interest is in the differential development of the treatment group compared to the control group. Such differential development is often assessed based on average growth curves but can also be based on transitions. For longitudinal multinomial data we describe a transitional methodology for the statistical analysis based on a distance model. Such a distance approach has two advantages compared to a multinomial regression model: (1) sparse data can be handled more efficiently; (2) a graphical representation of the model can be made to enhance interpretation. Within this approach it is possible to jointly model the observations and missing values by adding a new category to the response variable representing the missingness condition. This approach is investigated in a Monte Carlo simulation study. The results show this is a promising way to deal with missing data, although the mechanism is not yet completely understood in all cases. Finally, an empirical example is presented where the advantages of the modeling procedure are highlighted. PubDate: 2018-03-01 DOI: 10.1007/s11634-015-0226-6 Issue No:Vol. 12, No. 1 (2018)

Authors:Marti Sagarra; Frank M. T. A. Busing; Cecilio Mar-Molinero; Josep Rialp Pages: 131 - 153 Abstract: Spanish financial institutions have been heavily affected by the banking crisis that began in 2008. Many of them, especially Spanish savings banks (or Cajas), had to merge with other institutions or had to be rescued. We address the question of up to what point the nature of competition in this sector has changed as a result of the crisis. Although institutions compete in many ways, we concentrate on their presence in the main street through bank branches open to the public (i.e., retail banking competition). Our measure of inter-firm rivalry is based on a geographical proximity measure that we calculate for the years 2008 (before the crisis) and 2012 (the last available data set). The technical approach is based on multidimensional unfolding, a methodology which allows us to graphically represent the asymmetric nature of such rivalry. These maps visualise the salient aspects of the system during the two dates analysed, and can be understood without a detailed technical knowledge. PubDate: 2018-03-01 DOI: 10.1007/s11634-014-0186-2 Issue No:Vol. 12, No. 1 (2018)

Authors:Daniel Baier; Sarah Frost Pages: 155 - 171 Abstract: Brand confusion occurs when a consumer is exposed to an advertisement (ad) for brand A but believes that it is for brand B. If more consumers are confused in this direction than in the other one (assuming that an ad for B is for A), this asymmetry is a disadvantage for A. Consequently, the confusion potential and structure of ads has to be checked: A sample of consumers is exposed to a sample of ads. For each ad the consumers have to specify their guess about the advertised brand. Then, the collected data are aggregated and analyzed using, e.g., MDS or two-mode clustering. In this paper we compare this approach to a new one where image data analysis and classification is applied: The confusion potential and structure of ads is related to featurewise distances between ads and—to model asymmetric effects—to the strengths of the advertised brands. A sample application for the German beer market is presented, the results are encouraging. PubDate: 2018-03-01 DOI: 10.1007/s11634-017-0282-1 Issue No:Vol. 12, No. 1 (2018)

Abstract: Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots. PubDate: 2018-05-25 DOI: 10.1007/s11634-018-0325-2

Authors:Daniel Fernández; Richard Arnold; Shirley Pledger; Ivy Liu; Roy Costilla Abstract: Many of the methods which deal with clustering in matrices of data are based on mathematical techniques such as distance-based algorithms or matrix decomposition and eigenvalues. In general, it is not possible to use statistical inferences or select the appropriateness of a model via information criteria with these techniques because there is no underlying probability model. This article summarizes some recent model-based methodologies for matrices of binary, count, and ordinal data, which are modelled under a unified statistical framework using finite mixtures to group the rows and/or columns. The model parameter can be constructed from a linear predictor of parameters and covariates through link functions. This likelihood-based one-mode and two-mode fuzzy clustering provides maximum likelihood estimation of parameters and the options of using likelihood information criteria for model comparison. Additionally, a Bayesian approach is presented in which the parameters and the number of clusters are estimated simultaneously from their joint posterior distribution. Visualization tools focused on ordinal data, the fuzziness of the clustering structures, and analogies of various standard plots used in the multivariate analysis are presented. Finally, a set of future extensions is enumerated. PubDate: 2018-05-15 DOI: 10.1007/s11634-018-0324-3

Authors:Aghiles Salah; Mohamed Nadif Abstract: Co-clustering addresses the problem of simultaneous clustering of both dimensions of a data matrix. When dealing with high dimensional sparse data, co-clustering turns out to be more beneficial than one-sided clustering even if one is interested in clustering along one dimension only. Aside from being high dimensional and sparse, some datasets, such as document-term matrices, exhibit directional characteristics, and the \(L_2\) normalization of such data, so that it lies on the surface of a unit hypersphere, is useful. Popular co-clustering assumptions such as Gaussian or Multinomial are inadequate for this type of data. In this paper, we extend the scope of co-clustering to directional data. We present Diagonal Block Mixture of Von Mises–Fisher distributions (dbmovMFs), a co-clustering model which is well suited for directional data lying on a unit hypersphere. By setting the estimate of the model parameters under the maximum likelihood (ML) and classification ML approaches, we develop a class of EM algorithms for estimating dbmovMFs from data. Extensive experiments, on several real-world datasets, confirm the advantage of our approach and demonstrate the effectiveness of our algorithms. PubDate: 2018-04-30 DOI: 10.1007/s11634-018-0323-4

Authors:Gilles Celeux; Cathy Maugis-Rabusseau; Mohammed Sedki Abstract: Several methods for variable selection have been proposed in model-based clustering and classification. These make use of backward or forward procedures to define the roles of the variables. Unfortunately, such stepwise procedures are slow and the resulting algorithms inefficient when analyzing large data sets with many variables. In this paper, we propose an alternative regularization approach for variable selection in model-based clustering and classification. In our approach the variables are first ranked using a lasso-like procedure in order to avoid slow stepwise algorithms. Thus, the variable selection methodology of Maugis et al. (Comput Stat Data Anal 53:3872–3882, 2000b) can be efficiently applied to high-dimensional data sets. PubDate: 2018-04-11 DOI: 10.1007/s11634-018-0322-5

Authors:José R. Berrendero; Javier Cárcamo Abstract: We obtain a decomposition of any quadratic classifier in terms of products of hyperplanes. These hyperplanes can be viewed as relevant linear components of the quadratic rule (with respect to the underlying classification problem). As an application, we introduce the associated multidirectional classifier; a piecewise linear classification rule induced by the approximating products. Such a classifier is useful to determine linear combinations of the predictor variables with ability to discriminate. We also show that this classifier can be used as a tool to reduce the dimension of the data and helps identify the most important variables to classify new elements. Finally, we illustrate with a real data set the use of these linear components to construct oblique classification trees. PubDate: 2018-04-07 DOI: 10.1007/s11634-018-0321-6

Authors:Shuji Ando; Kouji Tahata; Sadao Tomizawa Abstract: For square contingency tables, a double symmetry model having a matrix structure that combines both symmetry and point symmetry was proposed. Also, an index which represents the degree of departure from double symmetry was proposed. However, this index cannot simultaneously characterize the degree of departure from symmetry and the degree of departure from point symmetry. For measuring the degree of departure from double symmetry, the present paper proposes a bivariate index vector that can simultaneously characterize the degree of departure from symmetry and the degree of departure from point symmetry. PubDate: 2018-03-26 DOI: 10.1007/s11634-018-0320-7

Authors:Qiang Wang; Thanh-Tung Nguyen; Joshua Z. Huang; Thuy Thi Nguyen Abstract: In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data. PubDate: 2018-03-21 DOI: 10.1007/s11634-018-0318-1

Authors:Matthijs J. Warrens; Alexandra de Raadt Abstract: Cohen’s kappa is the most widely used coefficient for assessing interobserver agreement on a nominal scale. An alternative coefficient for quantifying agreement between two observers is Bangdiwala’s B. To provide a proper interpretation of an agreement coefficient one must first understand its meaning. Properties of the kappa coefficient have been extensively studied and are well documented. Properties of coefficient B have been studied, but not extensively. In this paper, various new properties of B are presented. Category B-coefficients are defined that are the basic building blocks of B. It is studied how coefficient B, Cohen’s kappa, the observed agreement and associated category coefficients may be related. It turns out that the relationships between the coefficients are quite different for \(2\times 2\) tables than for agreement tables with three or more categories. PubDate: 2018-03-19 DOI: 10.1007/s11634-018-0319-0

Authors:Wan-Lun Wang; Luis M. Castro; Yen-Ting Chang; Tsung-I Lin Abstract: Mixtures of common t factor analyzers (MCtFA) have been shown its effectiveness in robustifying mixtures of common factor analyzers (MCFA) when handling model-based clustering of the high-dimensional data with heavy tails. However, the MCtFA model may still suffer from a lack of robustness against observations whose distributions are highly asymmetric. This paper presents a further robust extension of the MCFA and MCtFA models, called the mixture of common restricted skew-t factor analyzers (MCrstFA), by assuming a restricted multivariate skew-t distribution for the common factors. The MCrstFA model can be used to accommodate severely non-normal (skewed and leptokurtic) random phenomena while preserving its parsimony in factor-analytic representation and performing graphical visualization in low-dimensional plots. A computationally feasible expectation conditional maximization either algorithm is developed to carry out maximum likelihood estimation. The numbers of factors and mixture components are simultaneously determined based on common likelihood penalized criteria. The usefulness of our proposed model is illustrated with simulated and real datasets, and experimental results signify its superiority over some existing competitors. PubDate: 2018-03-08 DOI: 10.1007/s11634-018-0317-2

Authors:Ravi Sankar Sangam; Hari Om Abstract: In data stream environment, most of the conventional clustering algorithms are not sufficiently efficient, since large volumes of data arrive in a stream and these data points unfold with time. The problem of clustering time-evolving metric data and categorical time-evolving data has separately been well explored in recent years, but the problem of clustering mixed type time-evolving data remains a challenging issue due to an awkward gap between the structure of metric and categorical attributes. In this paper, we devise a generalized framework, termed Equi-Clustream to dynamically cluster mixed type time-evolving data, which comprises three algorithms: a Hybrid Drifting Concept Detection Algorithm that detects the drifting concept between the current sliding window and previous sliding window, a Hybrid Data Labeling Algorithm that assigns an appropriate cluster label to each data vector of the current non-drifting window based on the clustering result of the previous sliding window, and a visualization algorithm that analyses the relationship between the clusters at different timestamps and also visualizes the evolving trends of the clusters. The efficacy of the proposed framework is shown by experiments on synthetic and real world datasets. PubDate: 2018-02-26 DOI: 10.1007/s11634-018-0316-3

Authors:Christian Carmona; Luis Nieto-Barajas; Antonio Canale Abstract: The Ministry of Social Development in Mexico is in charge of creating and assigning social programmes targeting specific needs in the population for the improvement of the quality of life. To better target the social programmes, the Ministry is aimed to find clusters of households with the same needs based on demographic characteristics as well as poverty conditions of the household. Available data consists of continuous, ordinal, and nominal variables, all of which come from a non-i.i.d complex design survey sample. We propose a Bayesian nonparametric mixture model that jointly models a set of latent variables, as in an underlying variable response approach, associated to the observed mixed scale data and accommodates for the different sampling probabilities. The performance of the model is assessed via simulated data. A full analysis of socio-economic conditions in households in the Mexican State of Mexico is presented. PubDate: 2018-02-08 DOI: 10.1007/s11634-018-0313-6

Authors:Edoardo Otranto; Massimo Mucciardi Abstract: The STAR model is widely used to represent the dynamics of a certain variable recorded at several locations at the same time. Its advantages are often discussed in terms of parsimony with respect to space-time VAR structures because it considers a single coefficient for each time and spatial lag. This hypothesis can be very strong; we add a certain degree of flexibility to the STAR model, providing the possibility for coefficients to vary in groups of locations. The new class of models (called Flexible STAR–FSTAR) is compared to the classical STAR and the space-time VAR by simulations and an application. PubDate: 2018-02-07 DOI: 10.1007/s11634-018-0314-5

Authors:Giuseppe Bove; Akinori Okada Abstract: Asymmetric pairwise relationships are frequently observed in experimental and non-experimental studies. They can be analysed with different aims and approaches. A brief review of models and methods of multidimensional scaling and cluster analysis able to deal with asymmetric proximities is provided taking a ‘data-analytic’ approach and emphasizing data visualization. PubDate: 2018-02-01 DOI: 10.1007/s11634-017-0307-9