Subjects -> STATISTICS (Total: 130 journals)
 The end of the list has been reached or no journals were found for your choice.
Similar Journals
 Advances in Data Analysis and ClassificationJournal Prestige (SJR): 1.09 Citation Impact (citeScore): 1Number of Followers: 52      Hybrid journal (It can contain Open Access articles) ISSN (Print) 1862-5355 - ISSN (Online) 1862-5347 Published by Springer-Verlag  [2469 journals]
• On mathematical optimization for clustering categories in contingency
tables

Abstract: Abstract Many applications in data analysis study whether two categorical variables are independent using a function of the entries of their contingency table. Often, the categories of the variables, associated with the rows and columns of the table, are grouped, yielding a less granular representation of the categorical variables. The purpose of this is to attain reasonable sample sizes in the cells of the table and, more importantly, to incorporate expert knowledge on the allowable groupings. However, it is known that the conclusions on independence depend, in general, on the chosen granularity, as in the Simpson paradox. In this paper we propose a methodology to, for a given contingency table and a fixed granularity, find a clustered table with the highest $$\chi ^2$$ statistic. Repeating this procedure for different values of the granularity, we can either identify an extreme grouping, namely the largest granularity for which the statistical dependence is still detected, or conclude that it does not exist and that the two variables are dependent regardless of the size of the clustered table. For this problem, we propose an assignment mathematical formulation and a set partitioning one. Our approach is flexible enough to include constraints on the desirable structure of the clusters, such as must-link or cannot-link constraints on the categories that can, or cannot, be merged together, and ensure reasonable sample sizes in the cells of the clustered table from which trustful statistical conclusions can be derived. We illustrate the usefulness of our methodology using a dataset of a medical study.
PubDate: 2022-06-28

• Classification based on multivariate mixed type longitudinal data with an
application to the EU-SILC database

Abstract: Abstract Although many present day studies gather data of a diverse nature (numeric quantities, binary indicators or ordered categories) on the same units repeatedly over time, there only exist limited number of approaches in the literature to analyse so-called mixed-type longitudinal data. We present a statistical model capable of joint modelling several mixed-type outcomes, which also accounts for possible dependencies among the investigated outcomes. A thresholding approach to link binary or ordinal variables to their latent numeric counterparts allows us to jointly model all, including latent, numeric outcomes using a multivariate version of the linear mixed-effects model. We avoid the independence assumption over outcomes by relaxing the variance matrix of random effects to a completely general positive definite matrix. Moreover, we follow model-based clustering methodology to create a mixture of such models to model heterogeneity in the temporal evolution of the considered outcomes. The estimation of such an hierarchical model is approached by Bayesian principles with the use of Markov chain Monte Carlo methods. After a successful simulation study with the aim to examine the ability to consistently estimate the true parameter values and thus discover the different patterns, the EU-SILC dataset consisting of Czech households that were followed for 4 years in a time span from 2005 to 2016 was analysed. The households were classified into groups with a similar evolution of several closely related indicators of monetary poverty based on estimated classification probabilities.
PubDate: 2022-06-25

• Correction to: Principal component analysis constrained by layered simple
structures

Abstract: A Correction to this paper has been published: 10.1007/s11634-022-00503-9
PubDate: 2022-06-24

• Constrained clustering and multiple kernel learning without pairwise
constraint relaxation

Abstract: Abstract Clustering under pairwise constraints is an important knowledge discovery tool that enables the learning of appropriate kernels or distance metrics to improve clustering performance. These pairwise constraints, which come in the form of must-link and cannot-link pairs, arise naturally in many applications and are intuitive for users to provide. However, the common practice of relaxing discrete constraints to a continuous domain to ease optimization when learning kernels or metrics can harm generalization, as information which only encodes linkage is transformed to informing distances. We introduce a new constrained clustering algorithm that jointly clusters data and learns a kernel in accordance with the available pairwise constraints. To generalize well, our method is designed to maximize constraint satisfaction without relaxing pairwise constraints to a continuous domain where they inform distances. We show that the proposed method outperforms existing approaches on a large number of diverse publicly available datasets, and we discuss how our method can scale to handling large data.
PubDate: 2022-06-22

• Model-based two-way clustering of second-level units in ordinal multilevel
latent Markov models

Abstract: Abstract In this paper, an ordinal multilevel latent Markov model based on separate random effects is proposed. In detail, two distinct second-level discrete effects are considered in the model, one affecting the initial probability vector and the other affecting the transition probability matrix of the first-level ordinal latent Markov process. To model these separate effects, we consider a bi-dimensional mixture specification that allows to avoid unverifiable assumptions on the random effect distribution and to derive a two-way clustering of second-level units. Starting from a general model where the two random effects are dependent, we also obtain the independence model as a special case. The proposal is applied to data on the physical health status of a sample of elderly residents grouped into nursing homes. A simulation study assessing the performance of the proposal is also included.
PubDate: 2022-06-01

• Robust logistic zero-sum regression for microbiome compositional data

Abstract: Abstract We introduce the Robust Logistic Zero-Sum Regression (RobLZS) estimator, which can be used for a two-class problem with high-dimensional compositional covariates. Since the log-contrast model is employed, the estimator is able to do feature selection among the compositional parts. The proposed method attains robustness by minimizing a trimmed sum of deviances. A comparison of the performance of the RobLZS estimator with a non-robust counterpart and with other sparse logistic regression estimators is conducted via Monte Carlo simulation studies. Two microbiome data applications are considered to investigate the stability of the estimators to the presence of outliers. Robust Logistic Zero-Sum Regression is available as an R package that can be downloaded at https://github.com/giannamonti/RobZS.
PubDate: 2022-06-01

• A von Mises–Fisher mixture model for clustering numerical and
categorical variables

Abstract: Abstract This work presents a mixture model allowing to cluster variables of different types. All variables being measured on the same n statistical units, we first represent every variable with a unit-norm operator in $${\mathbb {R}}^{n\times n}$$ endowed with an appropriate inner product. We propose a von Mises–Fisher mixture model on the unit-sphere containing these operators. The parameters of the mixture model are estimated with an EM algorithm, combined with a K-means procedure to obtain a good starting point. The method is tested on simulated data and eventually applied to wine data.
PubDate: 2022-06-01

• A new three-step method for using inverse propensity weighting with latent
class analysis

Abstract: Abstract Bias-adjusted three-step latent class analysis (LCA) is widely popular to relate covariates to class membership. However, if the causal effect of a treatment on class membership is of interest and only observational data is available, causal inference techniques such as inverse propensity weighting (IPW) need to be used. In this article, we extend the bias-adjusted three-step LCA to incorporate IPW. This approach separates the estimation of the measurement model from the estimation of the treatment effect using IPW only for the later step. Compared to previous methods, this solves several conceptual issues and more easily facilitates model selection and the use of multiple imputation. This new approach, implemented in the software Latent GOLD, is evaluated in a simulation study and its use is illustrated using data of prostate cancer patients.
PubDate: 2022-06-01

• A two-step estimator for generalized linear models for longitudinal data
with time-varying measurement error

Abstract: Abstract We propose a novel approach for longitudinal data modeling within the Generalized Linear Models family, whenever a covariate of interest is affected by measurement error. We jointly model the response (outcome model), the covariate observed with error (measurement model) and the underlying unobserved time-varying error-free covariate (true score). This is done by assuming a first-order latent Markov chain for the true score. The estimation of the full joint model is hardly feasible when the number of covariates is large, as typical in real-data applications. Available algorithms are severely affected by numerical underflow and multiple local maxima. To overcome these problems, we propose an efficient two-step approach. With an extensive simulation study, we show that the two-step approach produces point estimates and standard errors which are almost identical to those obtained by the more time consuming, simultaneous (one-step) approach. The proposal is also illustrated by analyzing data from the Chinese Longitudinal Healthy Longevity Survey.
PubDate: 2022-06-01

• How many data clusters are in the Galaxy data set'

Abstract: Abstract In model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin’s concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model where a prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications are recommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also, the regularizing properties of the priors may be intentionally exploited to obtain a suitable clustering solution meeting prior expectations and needs of the application.
PubDate: 2022-06-01

• Factor and hybrid components for model-based clustering

Abstract: Abstract A major challenge when performing model-based clustering is a large increase in the number of free parameters as the data dimensionality increases. To combat this issue, parsimonious methods such allow component covariance matrices to share parameters by exploiting geometric redundancies. The present work considers an additional level of intracluster structure that also captures hybridisation of mean and covariance parameters between components for the multivariate normal distribution. We posit components with heterogeneous parameterisation; a subset are considered factor components and have explicit mean and covariance parameters, and the remainder are considered hybrid components that have means and covariances implied by a set of factor loadings that weight factor component parameters. An estimation procedure is provided using the Expectation-Maximization algorithm, and comparison to Gaussian mixture models with parsimonious covariances is made by evaluation on a collection of datasets.
PubDate: 2022-06-01

• Gaussian mixture model with an extended ultrametric covariance structure

Abstract: Gaussian Mixture Models (GMMs) are one of the most widespread methodologies for model-based clustering. They assume a multivariate Gaussian distribution for each component of the mixture, centered at the mean vector and with volume, shape and orientation derived by the covariance matrix. To reduce the large number of parameters produced by the covariance matrices, parsimonious parameterizations of the latter were proposed in literature, e.g., the eigen-decomposition and the parsimonious GMMs based on mixtures of probabilistic principal component analyzers and mixtures of factor analyzers. We introduce a new parameterization of a covariance matrix by defining an extended ultrametric covariance matrix and we implement it into a GMM. This structure can be used to describe multidimensional phenomena which are characterized by nested latent concepts having different levels of abstraction, from the most specific to the most general. The proposal is able to pinpoint a hierarchical structure on variables for each component of the GMM, thus identifying a different characterization of a multidimensional phenomenon for each component (cluster, subpopulation) of the mixture. At the same time, it defines a new parsimonious GMM since the ultrametric covariance structure reconstructs the relationships among variables with a limited number of parameters. The proposal is applied on synthetic and real data. On the former it shows good performance in terms of classification when compared to the other existing parameterizations, and on the latter it also provides insight into the hierarchical relationships among the variables for each cluster.
PubDate: 2022-06-01

• Are attitudes toward immigration changing in Europe' An analysis based
on latent class IRT models

Abstract: Abstract We analyze the changing attitudes toward immigration in EU host countries in the last few years (2010–2018) on the basis of the European Social Survey data. These data are collected by the administration of a questionnaire made of items concerning different aspects related to the immigration phenomenon. For this analysis, we rely on a latent class approach considering a variety of models that allow for: (1) multidimensionality; (2) discreteness of the latent trait distribution; (3) time-constant and time-varying covariates; and (4) sample weights. Through these models we find latent classes of Europeans with similar levels of immigration acceptance and we study the effect of different socio-economic covariates on the probability of belonging to these classes for which we provide a specific interpretation. In this way we show which countries tend to be more or less positive toward immigration and we analyze the temporal dynamics of the phenomenon under study.
PubDate: 2022-06-01

• Special issue on “Models and learning for clustering and
Classification”

PubDate: 2022-05-31

• Principal component analysis constrained by layered simple structures

Abstract: Abstract The paper proposes a procedure for principal component analysis called layered principal component analysis (LPCA) to produce a simple and interpretable loading matrix. The novelty of LPCA is that a loading matrix is constrained as a sum of matrices with simple structures called layers, and the resulting simplicity of the LPCA solution is controlled by how many layers are used. LPCA is a generalization of disjoint PCA proposed as reported by Ferrara (in: Giommi (ed) Topics in theoretical and applied statistics, Springer, Cham 2016). The number of layers controls the balance of simplicity and the fit to the data, and the user can choose the desired level of simplicity between the most restrictive but simplest case with a single layer or multiple layers with better fit to the data. The optimal number of layers is specified in terms of explained variance and two information criteria. Two simulation studies were conducted to evaluate how accurately the LPCA procedure recovers the true parameter values. The results showed that LPCA was effective for parameter recovery. The paper presents three examples of LPCA applied to real data, which show the potential of LPCA for producing simple and interpretable loading matrices.
PubDate: 2022-05-23

• Extending finite mixtures of nonlinear mixed-effects models with
covariate-dependent mixing weights

Abstract: Abstract Finite mixtures of nonlinear mixed-effects models have emerged as a prominent tool for modeling and clustering longitudinal data following nonlinear growth patterns with heterogeneous behavior. This paper proposes an extended finite mixtures of nonlinear mixed-effects model in which the mixing proportions are related to some explanatory covariates. A logistic function is incorporated to describe the relationship between the prior classification probabilities and the covariates of interest. For parameter estimation, we develop an analytically simple expectation conditional maximization algorithm coupled with the first-order Taylor approximation to linearize the model with pseudo data. The calculation of the standard errors of estimators via a general information-based method and the empirical Bayes estimation of random effects are also discussed. The methodology is illustrated through several simulation experiments and an application to the AIDS Clinical Trials Group Protocol 315 study.
PubDate: 2022-05-08

• Modal clustering of matrix-variate data

Abstract: Abstract The nonparametric formulation of density-based clustering, known as modal clustering, draws a correspondence between groups and the attraction domains of the modes of the density function underlying the data. Its probabilistic foundation allows for a natural, yet not trivial, generalization of the approach to the matrix-valued setting, increasingly widespread, for example, in longitudinal and multivariate spatio-temporal studies. In this work we introduce nonparametric estimators of matrix-variate distributions based on kernel methods, and analyze their asymptotic properties. Additionally, we propose a generalization of the mean-shift procedure for the identification of the modes of the estimated density. Given the intrinsic high dimensionality of matrix-variate data, we discuss some locally adaptive solutions to handle the problem. We test the procedure via extensive simulations, also with respect to some competitors, and illustrate its performance through two high-dimensional real data applications.
PubDate: 2022-05-05

• Sparsifying the least-squares approach to PCA: comparison of lasso and
cardinality constraint

Abstract: Abstract Sparse PCA methods are used to overcome the difficulty of interpreting the solution obtained from PCA. However, constraining PCA to obtain sparse solutions is an intractable problem, especially in a high-dimensional setting. Penalized methods are used to obtain sparse solutions due to their computational tractability. Nevertheless, recent developments permit efficiently obtaining good solutions of cardinality-constrained PCA problems allowing comparison between these approaches. Here, we conduct a comparison between a penalized PCA method with its cardinality-constrained counterpart for the least-squares formulation of PCA imposing sparseness on the component weights. We compare the penalized and cardinality-constrained methods through a simulation study that estimates the sparse structure’s recovery, mean absolute bias, mean variance, and mean squared error. Additionally, we use a high-dimensional data set to illustrate the methods in practice. Results suggest that using cardinality-constrained methods leads to better recovery of the sparse structure.
PubDate: 2022-04-27

• Basis expansion approaches for functional analysis of variance with
repeated measures

Abstract: Abstract The methodological contribution in this paper is motivated by biomechanical studies where data characterizing human movement are waveform curves representing joint measures such as flexion angles, velocity, acceleration, and so on. In many cases the aim consists of detecting differences in gait patterns when several independent samples of subjects walk or run under different conditions (repeated measures). Classic kinematic studies often analyse discrete summaries of the sample curves discarding important information and providing biased results. As the sample data are obviously curves, a Functional Data Analysis approach is proposed to solve the problem of testing the equality of the mean curves of a functional variable observed on several independent groups under different treatments or time periods. A novel approach for Functional Analysis of Variance (FANOVA) for repeated measures that takes into account the complete curves is introduced. By assuming a basis expansion for each sample curve, two-way FANOVA problem is reduced to Multivariate ANOVA for the multivariate response of basis coefficients. Then, two different approaches for MANOVA with repeated measures are considered. Besides, an extensive simulation study is developed to check their performance. Finally, two applications with gait data are developed.
PubDate: 2022-04-09
DOI: 10.1007/s11634-022-00500-y

• Early identification of biliary atresia using subspace and the bootstrap
methods

Abstract: Abstract In clinical medicine, physicians often rely on information derived from medical imaging systems, such as image data for diagnosis. To detect disease early, physicians extract essential information from data manually to distinguish accurately between positive and negative cases of disease. In recent years, deep learning (DL) has been used for this purpose, attracting the attention of prominent researchers because of its excellent performance. Consequently, DL and other artificial intelligence (AI) technologies are expected to develop further through integration with statistical and other approaches. Here, we examine biliary atresia (BA), a rare disease that affects primarily infants. Our study focuses on the identification of BA from image data (stool images of BA patients). Using AI and statistical approaches, we propose a machine learning classifier (model) for accurate diagnosis, efficient classification, and early detection of BA after exposure to limited training data. In an initial study, we used the subspace pattern recognition method for the development of a similar classifier. In this study, we propose the development of a filter based on the subspace method and a statistical approach. The filter enables the classifier to extract essential information from image data and discriminate efficiently between BA and non-BA patients.
PubDate: 2022-04-04
DOI: 10.1007/s11634-022-00493-8

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762