Abstract: We propose a new stochastic block model that focuses on the analysis of interaction lengths in dynamic networks. The model does not rely on a discretization of the time dimension and may be used to analyze networks that evolve continuously over time. The framework relies on a clustering structure on the nodes, whereby two nodes belonging to the same latent group tend to create interactions and non-interactions of similar lengths. We introduce a variational expectation–maximization algorithm to perform inference, and adapt a widely used clustering criterion to perform model choice. Finally, we validate our methodology using simulated data experiments and showing two illustrative applications concerning face-to-face interaction data and a bike sharing network. PubDate: 2020-06-18

Abstract: We analyze the dynamic structure of lower tail dependence coefficients within groups of assets defined such that assets belonging to the same group are characterized by pairwise high associations between extremely low values. The groups are identified by means of a fuzzy cluster analysis algorithm. The tail dependence coefficients are estimated using the Joe–Clayton copula function, and the 75th percentile within clusters is used as a measure of each cluster’s overall tail dependence. The interdependence structure of the clusters’ tail dependence dynamics is then analyzed in order to determine whether the pattern of a cluster can be predicted based on the past values of the others, using a Granger causality approach. The hypothesis of a possible regime switching dynamics in tail dependence is also investigated by means of a Threshold Vector AutoRegressive model and the results are compared to those obtained with a linear autoregression. The whole procedure is described with reference to a case study dealing with the assets composing Eurostoxx 50, but it can be viewed as the proposal of a general method, that can be relevantly applied to whatever set of asset returns time series. PubDate: 2020-06-16

Abstract: We present a novel nonparametric Bayesian approach for performing cluster analysis in a context where observational units have data arising from multiple sources. Our approach uses a particle Gibbs sampler for inference in which cluster allocations are jointly updated using a conditional particle filter within a Gibbs sampler, improving the mixing of the MCMC chain. We develop several approaches to improving the computational performance of our algorithm. These methods can achieve greater than an order-of-magnitude improvement in performance at no cost to accuracy and can be applied more broadly to Bayesian inference for mixture models with a single dataset. We apply our algorithm to the discovery of risk cohorts amongst 243 patients presenting with kidney renal clear cell carcinoma, using samples from the Cancer Genome Atlas, for which there are gene expression, copy number variation, DNA methylation, protein expression and microRNA data. We identify 4 distinct consensus subtypes and show they are prognostic for survival rate (\(p < 0.0001\)). PubDate: 2020-06-12

Abstract: In many real classification problems a monotone relation between some predictors and the classes may be assumed when higher (or lower) values of those predictors are related to higher levels of the response. In this paper, we propose new boosting algorithms, based on LogitBoost, that incorporate this isotonicity information, yielding more accurate and easily interpretable rules. These algorithms are based on theoretical developments that consider isotonic regression. We show the good performance of these procedures not only on simulations, but also on real data sets coming from two very different contexts, namely cancer diagnostic and failure of induction motors. PubDate: 2020-06-12

Abstract: Different approaches to robustly measure the location of data associated with a random experiment have been proposed in the literature, with the aim of avoiding the high sensitivity to outliers or data changes typical for the mean. In particular, M-estimators and trimmed means have been studied in general spaces, and can be used to handle Hilbert-valued data. Both alternatives are of interest due to their success in the classical framework. Since fuzzy set-valued data can be identified with a convex cone of a separable Hilbert space, the previous concepts have been recently applied to the one-dimensional fuzzy case. The aim of this paper is to extend M-estimators and trimmed means to p-dimensional fuzzy set-valued data, and to theoretically prove that they inherit robustness from the real settings. Some of such theoretical results are more general and directly apply to Hilbert-valued estimators and, in consequence, to functional data. A real-life example will also be included to illustrate the computation and behaviour of these estimators under contamination. PubDate: 2020-06-12

Abstract: Data-driven algorithms stand and fall with the availability and quality of existing data sources. Both can be limited in high-dimensional settings (\(n \gg m\)). For example, supervised learning algorithms designed for molecular pheno- or genotyping are restricted to samples of the corresponding diagnostic classes. Samples of other related entities, such as arise in differential diagnosis, are usually not utilized in this learning scheme. Nevertheless, they might provide domain knowledge on the background or context of the original diagnostic task. In this work, we discuss the possibility of incorporating samples of foreign classes in the training of diagnostic classification models that can be related to the task of differential diagnosis. Especially in heterogeneous data collections comprising multiple diagnostic categories, the foreign ones can change the magnitude of available samples. More precisely, we utilize this information for the internal feature selection process of diagnostic models. We propose the use of chained correlations of original and foreign diagnostic classes. This method allows the detection of intermediate foreign classes by evaluating the correlation between class labels and features for each pair of original and foreign categories. Interestingly, this criterion does not require direct comparisons of the initial diagnostic groups and therefore, might be suitable for settings with restricted data access. PubDate: 2020-06-09

Abstract: In recent years, data have become increasingly higher dimensional and, therefore, an increased need has arisen for dimension reduction techniques for clustering. Although such techniques are firmly established in the literature for multivariate data, there is a relative paucity in the area of matrix variate, or three-way, data. Furthermore, the few methods that are available all assume matrix variate normality, which is not always sensible if cluster skewness or excess kurtosis is present. Mixtures of bilinear factor analyzers using skewed matrix variate distributions are proposed. In all, four such mixture models are presented, based on matrix variate skew-t, generalized hyperbolic, variance-gamma, and normal inverse Gaussian distributions, respectively. PubDate: 2020-06-01

Abstract: We consider model-based clustering methods for continuous, correlated data that account for external information available in the presence of mixed-type fixed covariates by proposing the MoEClust suite of models. These models allow different subsets of covariates to influence the component weights and/or component densities by modelling the parameters of the mixture as functions of the covariates. A familiar range of constrained eigen-decomposition parameterisations of the component covariance matrices are also accommodated. This paper thus addresses the equivalent aims of including covariates in Gaussian parsimonious clustering models and incorporating parsimonious covariance structures into all special cases of the Gaussian mixture of experts framework. The MoEClust models demonstrate significant improvement from both perspectives in applications to both univariate and multivariate data sets. Novel extensions to include a uniform noise component for capturing outliers and to address initialisation of the EM algorithm, model selection, and the visualisation of results are also proposed. PubDate: 2020-06-01

Abstract: Linear regression models based on finite Gaussian mixtures represent a flexible tool for the analysis of linear dependencies in multivariate data. They are suitable for dealing with correlated response variables when data come from a heterogeneous population composed of two or more sub-populations, each of which is characterised by a different linear regression model. Several types of finite mixtures of linear regression models have been specified by changing the assumptions on the parameters that differentiate the sub-populations and/or the vectors of regressors that affect the response variables. They are made more flexible in the class of models defined by mixtures of seemingly unrelated Gaussian linear regressions illustrated in this paper. With these models, the researcher is enabled to use a different vector of regressors for each dependent variable. The proposed class includes parsimonious models obtained by imposing suitable constraints on the variances and covariances of the response variables in the sub-populations. Details about the model identification and maximum likelihood estimation are given. The usefulness of these models is shown through the analysis of a real dataset. Regularity conditions for the model class are illustrated and a proof is provided that, when these conditions are met, the consistency of the maximum likelihood estimator under the examined models is ensured. In addition, the behaviour of this estimator in the presence of finite samples is numerically evaluated through the analysis of simulated datasets. PubDate: 2020-06-01

Abstract: In a standard classification framework a set of trustworthy learning data are employed to build a decision rule, with the final aim of classifying unlabelled units belonging to the test set. Therefore, unreliable labelled observations, namely outliers and data with incorrect labels, can strongly undermine the classifier performance, especially if the training size is small. The present work introduces a robust modification to the Model-Based Classification framework, employing impartial trimming and constraints on the ratio between the maximum and the minimum eigenvalue of the group scatter matrices. The proposed method effectively handles noise presence in both response and exploratory variables, providing reliable classification even when dealing with contaminated datasets. A robust information criterion is proposed for model selection. Experiments on real and simulated data, artificially adulterated, are provided to underline the benefits of the proposed method. PubDate: 2020-06-01

Abstract: Many relevant multidimensional phenomena are defined by nested latent concepts, which can be represented by a tree-structure supposing a hierarchical relationship among manifest variables. The root of the tree is a general concept which includes more specific ones. The aim of the paper is to reconstruct an observed data correlation matrix of manifest variables through an ultrametric correlation matrix which is able to pinpoint the hierarchical nature of the phenomenon under study. With this scope, we introduce a novel model which detects consistent latent concepts and their relationships starting from the observed correlation matrix. PubDate: 2020-05-28

Abstract: Examining the efficacy of composite-based structural equation modeling (SEM) features prominently in research. However, studies analyzing the efficacy of corresponding estimators usually rely on factor model data. Thereby, they assess and analyze their performance on erroneous grounds (i.e., factor model data instead of composite model data). A potential reason for this malpractice lies in the lack of available composite model-based data generation procedures for prespecified model parameters in the structural model and the measurements models. Addressing this gap in research, we derive model formulations and present a composite model-based data generation approach. The findings will assist researchers in their composite-based SEM simulation studies. PubDate: 2020-05-26

Abstract: Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems. PubDate: 2020-05-25

Abstract: Mixture models are among the most popular tools for clustering. However, when the dimension and the number of clusters is large, the estimation of the clusters become challenging, as well as their interpretation. Restriction on the parameters can be used to reduce the dimension. An example is given by mixture of factor analyzers for Gaussian mixtures. The extension of MFA to non-Gaussian mixtures is not straightforward. We propose a new constraint for parameters in non-Gaussian mixture model: the K components parameters are combinations of elements from a small dictionary, say H elements, with \(H \ll K\). Including a nonnegative matrix factorization (NMF) in the EM algorithm allows us to simultaneously estimate the dictionary and the parameters of the mixture. We propose the acronym NMF-EM for this algorithm, implemented in the R package nmfem. This original approach is motivated by passengers clustering from ticketing data: we apply NMF-EM to data from two Transdev public transport networks. In this case, the words are easily interpreted as typical slots in a timetable. PubDate: 2020-05-25

Abstract: There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model, several existing techniques such as the selection of the number of clusters, estimation using expectation-maximization and model selection are applicable. The proposed model is then demonstrated on real data to illustrate its clustering applications. PubDate: 2020-05-20

Abstract: Finite mixtures present a powerful tool for modeling complex heterogeneous data. One of their most important applications is model-based clustering. It assumes that each data group can be reasonably described by one mixture model component. This establishes a one-to-one relationship between mixture components and clusters. In some cases, however, this relationship can be broken due to the presence of observations from the same class recorded in different ways. This effect can occur because of recording inconsistencies due to the use of different scales, operator errors, or simply various recording styles. The idea presented in this paper aims to alleviate this issue through modifications incorporated into mixture models. While the proposed methodology is applicable to a broad class of mixture models, in this paper it is illustrated on Gaussian mixtures. Several simulation studies and an application to a real-life data set are considered, yielding promising results. PubDate: 2020-05-12

Abstract: In this article, we propose two classes of semiparametric mixture regression models with single-index for model based clustering. Unlike many semiparametric/nonparametric mixture regression models that can only be applied to low dimensional predictors, the new semiparametric models can easily incorporate high dimensional predictors into the nonparametric components. The proposed models are very general, and many of the recently proposed semiparametric/nonparametric mixture regression models are indeed special cases of the new models. Backfitting estimates and the corresponding modified EM algorithms are proposed to achieve optimal convergence rates for both parametric and nonparametric parts. We establish the identifiability results of the proposed two models and investigate the asymptotic properties of the proposed estimation procedures. Simulation studies are conducted to demonstrate the finite sample performance of the proposed models. Two real data applications using the new models reveal some interesting findings. PubDate: 2020-04-23

Abstract: In this paper, a new flexible approach to modeling data with multiple partial right-censoring points is proposed. This method is based on finite mixture models, flexible tool to model heterogeneity in data. A general framework to accommodate partial censoring is considered. In this setting, it is assumed that a certain portion of data points are censored and the rest are not. This situation occurs in many insurance loss data sets. A novel probability function is proposed to be used as a mixture component and the expectation-maximization algorithm is employed for estimating model parameters. The Bayesian information criterion is used for model selection. Additionally, an approach for the variability assessment of parameter estimates as well as the computation of quantiles commonly known as risk measures is considered. The proposed model is evaluated using a simulation study based on four common probability distribution functions used to model right skewed loss data and applied to a real data set with good results. PubDate: 2020-04-21

Abstract: Multivariate scale mixtures of skew-normal distributions are flexible models that account for the non-normality of data by means of a tail weight parameter and a shape vector representing the asymmetry of the model in a directional fashion. Its stochastic representation involves a skew-normal vector and a non negative mixing scalar variable, independent of the skew-normal vector, that injects tail weight behavior into the model. In this paper we look into the problem of finding the projection that maximizes skewness for vectors that follow a scale mixture of skew-normal distribution; when a simple condition on the moments of the mixing variable is fulfilled, it can be shown that the direction yielding the maximal skewness is proportional to the shape vector. This finding stresses the directional nature of the shape vector to regulate the asymmetry; it also provides the theoretical foundations motivating the skewness based projection pursuit problem in this class of distributions. Some examples that illustrate the application of our results are also given; they include a simulation experiment with artificial data, which sheds light on the usefulness and implications of our results, and the application to real data. PubDate: 2020-03-10