Abstract: We present a novel method, REMAXINT, that captures the gist of two-way interaction in row by column (i.e., two-mode) data, with one observation per cell. REMAXINT is a probabilistic two-mode clustering model that yields two-mode partitions with maximal interaction between row and column clusters. For estimation of the parameters of REMAXINT, we maximize a conditional classification likelihood in which the random row (or column) main effects are conditioned out. For testing the null hypothesis of no interaction between row and column clusters, we propose a \(max-F\) test statistic and discuss its properties. We develop a Monte Carlo approach to obtain its sampling distribution under the null hypothesis. We evaluate the performance of the method through simulation studies. Specifically, for selected values of data size and (true) numbers of clusters, we obtain critical values of the \(max-F\) statistic, determine empirical Type I error rate of the proposed inferential procedure and study its power to reject the null hypothesis. Next, we show that the novel method is useful in a variety of applications by presenting two empirical case studies and end with some concluding remarks. PubDate: 2021-04-27

Abstract: Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable models (DLVMs) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the partition. Addressing the known problem of sub-optimal local maxima found by greedy hill climbing heuristics, we introduce a new hybrid algorithm based on a genetic algorithm which allows to efficiently explore the space of solutions. The resulting algorithm carefully combines and merges different solutions, and allows the joint inference of the number K of clusters as well as the clusters themselves. Starting from this natural partition, the second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter \(\alpha \) as a regularization term controlling the granularity of the clustering. A new approximation of the criterion is derived as a log-linear function of \(\alpha \) , enabling a simple functional form of the merge decision criterion. This second step allows the exploration of the clustering at coarser scales. The proposed approach is compared with existing strategies on simulated as well as real settings, and its results are shown to be particularly relevant. A reference implementation of this work is available in the R-package greed accompanying the paper. PubDate: 2021-04-13

Abstract: In practice, data often display heteroscedasticity, making quantile regression (QR) a more appropriate methodology. Modeling the data, while maintaining a flexible nonparametric fitting, requires smoothing over a high-dimensional space which might not be feasible when the number of the predictor variables is large. This problem makes necessary the use of dimension reduction techniques for conditional quantiles, which focus on extracting linear combinations of the predictor variables without losing any information about the conditional quantile. However, nonlinear features can achieve greater dimension reduction. We, therefore, present the first nonlinear extension of the linear algorithm for estimating the central quantile subspace (CQS) using kernel data. First, we describe the feature CQS within the framework of reproducing kernel Hilbert space, and second, we illustrate its performance through simulation examples and real data applications. Specifically, we emphasize on visualizing various aspects of the data structure using the first two feature extractors, and we highlight the ability to combine the proposed algorithm with classification and regression linear algorithms. The results show that the feature CQS is an effective kernel tool for performing nonlinear dimension reduction for conditional quantiles. PubDate: 2021-03-23

Abstract: Shapelets are discriminative subsequences extracted from time-series data. Classifiers using shapelets have proven to achieve performances competitive to state-of-the-art methods, while enhancing the model’s interpretability. While a lot of research has been done for univariate time-series shapelets, extensions for the multivariate setting have not yet received much attention. To extend shapelets-based classification to a multidimensional setting, we developed a novel architecture for shapelets learning, by embedding them as trainable weights in a multi-layer Neural Network. We also investigated the introduction of a novel learning strategy for the shapelets, comprising of two additional terms in the optimization goal, to retrieve a reduced set of uncorrelated shapelets. This paper describes the proposed architecture and presents results on ten publicly available benchmark datasets, as well as a comparison with existing state-of-the-art methods. Moreover, the proposed optimization objective leads the model to automatically select smaller sets of uncorrelated shapelets, thus requiring no additional manual optimization on typically important hyper-parameters such as number and length of shapelets. The results show how the proposed approach achieves competitive performance across the datasets, and always leads to a significant reduction in the number of shapelets used. This can make it faster for a domain expert to match shapelets to real patterns, thus enhancing the interpretability of the model. Finally, since the shapelets learnt during training can be extracted from the model they can serve as meaningful insights on the classifier’s decisions and the interactions between different dimensions. PubDate: 2021-03-04 DOI: 10.1007/s11634-021-00437-8

Abstract: The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added to Lasso-based objective functions, in such a way that threshold values are set to bound the prediction errors in the different groups of interest (not necessarily disjoint). As a result, a constrained sparse regression model is defined by a nonlinear optimization problem. This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Both theoretical properties and empirical studies concerning the new method are explored in this paper. In addition, two illustrations of the method on biomedical and sociological contexts are considered. PubDate: 2021-03-01

Abstract: Two types of nominal classifications are distinguished, namely regular nominal classifications and dichotomous-nominal classifications. The first type does not include an ‘absence’ category (for example, no disorder), whereas the second type does include an ‘absence’ category. Cohen’s unweighted kappa can be used to quantify agreement between two regular nominal classifications with the same categories, but there are no coefficients for assessing agreement between two dichotomous-nominal classifications. Kappa coefficients for dichotomous-nominal classifications with identical categories are defined. All coefficients proposed belong to a one-parameter family. It is studied how the coefficients for dichotomous-nominal classifications are related and if the values of the coefficients depend on the number of categories. It turns out that the values of the new kappa coefficients can be strictly ordered in precisely two ways. The orderings suggest that the new coefficients are measuring the same thing, but to a different extent. If one accepts the use of magnitude guidelines, it is recommended to use stricter criteria for the new coefficients that tend to produce higher values. PubDate: 2021-03-01

Abstract: In forecasting, we often require interval forecasts instead of just a specific point forecast. To track streaming data effectively, this interval forecast should reliably cover the observed data and yet be as narrow as possible. To achieve this, we propose two methods based on regression trees: one ensemble method and one method based on a single tree. For the ensemble method, we use weighted results from the most recent models, and for the single-tree method, we retain one model until it becomes necessary to train a new model. We propose a novel method to update the interval forecast adaptively using root mean square prediction errors calculated from the latest data batch. We use wavelet-transformed data to capture long time variable information and conditional inference trees for the underlying regression tree model. Results show that both methods perform well, having good coverage without the intervals being excessively wide. When the underlying data generation mechanism changes, their performance is initially affected but can recover relatively quickly as time proceeds. The method based on a single tree performs the best in computational (CPU) time compared to the ensemble method. When compared to ARIMA and GARCH modelling, our methods achieve better or similar coverage and width but require considerably less CPU time. PubDate: 2021-03-01

Abstract: Motivated by the problem of identifying rod-shaped particles (e.g. bacilliform bacterium), in this paper we consider the multiple generalized circle detection problem. We propose a method for solving this problem that is based on center-based clustering, where cluster-centers are generalized circles. An efficient algorithm is proposed which is based on a modification of the well-known k-means algorithm for generalized circles as cluster-centers. In doing so, it is extremely important to have a good initial approximation. For the purpose of recognizing detected generalized circles, a QAD-indicator is proposed. Also a new DBC-index is proposed, which is specialized for such situations. The recognition process is intitiated by searching for a good initial partition using the DBSCAN-algorithm. If QAD-indicator shows that generalized circle-cluster-center does not recognize searched generalized circle for some cluster, the procedure continues searching for corresponding initial generalized circles for these clusters using the Incremental algorithm. After that, corresponding generalized circle-cluster-centers are calculated for obtained clusters. This will happen if a data point set stems from intersected or touching generalized circles. The method is illustrated and tested on different artificial data sets coming from a number of generalized circles and real images. PubDate: 2021-03-01

Abstract: Laplacian support vector machine (LapSVM), which is based on the semi-supervised manifold regularization learning framework, performs better than the standard SVM, especially for the case where the supervised information is insufficient. However, the use of hinge loss leads to the sensitivity of LapSVM to noise around the decision boundary. To enhance the performance of LapSVM, we present a novel semi-supervised SVM with the asymmetric squared loss (asy-LapSVM) which deals with the expectile distance and is less sensitive to noise-corrupted data. We further present a simple and efficient functional iterative method to solve the proposed asy-LapSVM, in addition, we prove the convergence of the functional iterative method from two aspects of theory and experiment. Numerical experiments performed on a number of commonly used datasets with noise of different variances demonstrate the validity of the proposed asy-LapSVM and the feasibility of the presented functional iterative method. PubDate: 2021-03-01

Abstract: There is a need for the development of models that are able to account for discreteness in data, along with its time series properties and correlation. Our focus falls on INteger-valued AutoRegressive (INAR) type models. The INAR type models can be used in conjunction with existing model-based clustering techniques to cluster discrete-valued time series data. With the use of a finite mixture model, several existing techniques such as the selection of the number of clusters, estimation using expectation-maximization and model selection are applicable. The proposed model is then demonstrated on real data to illustrate its clustering applications. PubDate: 2021-03-01

Abstract: During the past few years Boolean matrix factorization (BMF) has become an important direction in data analysis. The minimum description length principle (MDL) was successfully adapted in BMF for the model order selection. Nevertheless, a BMF algorithm performing good results w.r.t. standard measures in BMF is missing. In this paper, we propose a novel from-below Boolean matrix factorization algorithm based on formal concept analysis. The algorithm utilizes the MDL principle as a criterion for the factor selection. On various experiments we show that the proposed algorithm outperforms—from different standpoints—existing state-of-the-art BMF algorithms. PubDate: 2021-03-01

Abstract: Mixture models are among the most popular tools for clustering. However, when the dimension and the number of clusters is large, the estimation of the clusters become challenging, as well as their interpretation. Restriction on the parameters can be used to reduce the dimension. An example is given by mixture of factor analyzers for Gaussian mixtures. The extension of MFA to non-Gaussian mixtures is not straightforward. We propose a new constraint for parameters in non-Gaussian mixture model: the K components parameters are combinations of elements from a small dictionary, say H elements, with \(H \ll K\) . Including a nonnegative matrix factorization (NMF) in the EM algorithm allows us to simultaneously estimate the dictionary and the parameters of the mixture. We propose the acronym NMF-EM for this algorithm, implemented in the R package nmfem. This original approach is motivated by passengers clustering from ticketing data: we apply NMF-EM to data from two Transdev public transport networks. In this case, the words are easily interpreted as typical slots in a timetable. PubDate: 2021-03-01

Abstract: Data embedding (DE) or dimensionality reduction techniques are particularly well suited to embedding high-dimensional data into a space that in most cases will have just two dimensions. Low-dimensional space, in which data samples (data points) can more easily be visualized, is also often used for learning methods such as clustering. Sometimes, however, DE will identify dimensions that contribute little in terms of the clustering structures that they reveal. In this paper we look at regularized data embedding by clustering, and we propose a simultaneous learning approach for DE and clustering that reinforces the relationships between these two tasks. Our approach is based on a matrix decomposition technique for learning a spectral DE, a cluster membership matrix, and a rotation matrix that closely maps out the continuous spectral embedding, in order to obtain a good clustering solution. We compare our approach with some traditional clustering methods and perform numerical experiments on a collection of benchmark datasets to demonstrate its potential. PubDate: 2021-03-01

Abstract: Modelling functional data in the presence of spatial dependence is of great practical importance as exemplified by applications in the fields of demography, economy and geography, and has received much attention recently. However, for the classical scalar-on-function regression (SoFR) with functional covariates and scalar responses, only a relatively few literature is dedicated to this relevant area, which merits further research. We propose a robust spatial autoregressive scalar-on-function regression by incorporating a spatial autoregressive parameter and a spatial weight matrix into the SoFR to accommodate spatial dependencies among individuals. The t-distribution assumption for the error terms makes our model more robust than the classical spatial autoregressive models under normal distributions. We estimate the model by firstly projecting the functional predictor onto a functional space spanned by an orthonormal functional basis and then presenting an expectation–maximization algorithm. Simulation studies show that our estimators are efficient, and are superior in the scenario with spatial correlation and heavy tailed error terms. A real weather dataset demonstrates the superiority of our model to the SoFR in the case of spatial dependence. PubDate: 2021-03-01

Abstract: We propose a robust procedure to estimate a linear regression model with compositional and real-valued explanatory variables. The proposed procedure is designed to be robust against individual outlying cells in the data matrix (cellwise outliers), as well as entire outlying observations (rowwise outliers). Cellwise outliers are first filtered and then imputed by robust estimates. Afterwards, rowwise robust compositional regression is performed to obtain model coefficient estimates. Simulations show that the procedure generally outperforms a traditional rowwise-only robust regression method (MM-estimator). Moreover, our procedure yields better or comparable results to recently proposed cellwise robust regression methods (shooting S-estimator, 3-step regression) while it is preferable for interpretation through the use of appropriate coordinate systems for compositional data. An application to bio-environmental data reveals that the proposed procedure—compared to other regression methods—leads to conclusions that are best aligned with established scientific knowledge. PubDate: 2021-02-24 DOI: 10.1007/s11634-021-00436-9

Abstract: Principal component regression (PCR) is a two-stage procedure: the first stage performs principal component analysis (PCA) and the second stage builds a regression model whose explanatory variables are the principal components obtained in the first stage. Since PCA is performed using only explanatory variables, the principal components have no information about the response variable. To address this problem, we present a one-stage procedure for PCR based on a singular value decomposition approach. Our approach is based upon two loss functions, which are a regression loss and a PCA loss from the singular value decomposition, with sparse regularization. The proposed method enables us to obtain principal component loadings that include information about both explanatory variables and a response variable. An estimation algorithm is developed by using the alternating direction method of multipliers. We conduct numerical studies to show the effectiveness of the proposed method. PubDate: 2021-02-08 DOI: 10.1007/s11634-020-00435-2

Abstract: Dimensionality reduction algorithms are powerful mathematical tools for data analysis and visualization. In many pattern recognition applications, a feature extraction step is often required to mitigate the curse of the dimensionality, a collection of negative effects caused by an arbitrary increase in the number of features in classification tasks. Principal Component Analysis (PCA) is a classical statistical method that creates new features based on linear combinations of the original ones through the eigenvectors of the covariance matrix. In this paper, we propose PCA-KL, a parametric dimensionality reduction algorithm for unsupervised metric learning, based on the computation of the entropic covariance matrix, a surrogate for the covariance matrix of the data obtained in terms of the relative entropy between local Gaussian distributions instead of the usual Euclidean distance between the data points. Numerical experiments with several real datasets show that the proposed method is capable of producing better defined clusters and also higher classification accuracy in comparison to regular PCA and several manifold learning algorithms, making PCA-KL a promising alternative for unsupervised metric learning. PubDate: 2021-01-07 DOI: 10.1007/s11634-020-00434-3

Abstract: We introduce a latent subpace model which facilitates model-based clustering of functional data. Flexible clustering is attained by imposing jointly generalized hyperbolic distributions on projections of basis expansion coefficients into group specific subspaces. The model acquires parsimony by assuming these subspaces are of relatively low dimension. Parameter estimation is done through a multicycle ECM algorithm. Application to simulated and real datasets illustrate competitive clustering capabilities, and demonstrate the models general applicability. PubDate: 2021-01-07 DOI: 10.1007/s11634-020-00432-5

Abstract: A model is proposed to analyze longitudinal data where two response variables are available, one of which is a binary indicator of selection and the other is continuous and observed only if the first is equal to 1. The model also accounts for individual covariates and may be considered as a bivariate finite mixture growth model as it is based on three submodels: (i) a probit model for the selection variable; (ii) a linear model for the continuous variable; and (iii) a multinomial logit model for the class membership. To suitably address endogeneity, the first two components rely on correlated errors as in a standard selection model. The proposed approach is applied to the analysis of the dynamics of household portfolio choices based on an unbalanced panel dataset of Italian households over the 1998–2014 period. For this dataset, we identify three latent classes of households with specific investment behaviors and we assess the effect of individual characteristics on households’ portfolio choices. Our empirical findings also confirm the need to jointly model risky asset market participation and the conditional portfolio share to properly analyze investment behaviors over the life-cycle. PubDate: 2020-12-29 DOI: 10.1007/s11634-020-00433-4