for Journals by Title or ISSN for Articles by Keywords help
 Subjects -> COMPUTER SCIENCE (Total: 2064 journals)     - ANIMATION AND SIMULATION (31 journals)    - ARTIFICIAL INTELLIGENCE (101 journals)    - AUTOMATION AND ROBOTICS (105 journals)    - CLOUD COMPUTING AND NETWORKS (64 journals)    - COMPUTER ARCHITECTURE (10 journals)    - COMPUTER ENGINEERING (11 journals)    - COMPUTER GAMES (16 journals)    - COMPUTER PROGRAMMING (26 journals)    - COMPUTER SCIENCE (1196 journals)    - COMPUTER SECURITY (46 journals)    - DATA BASE MANAGEMENT (14 journals)    - DATA MINING (35 journals)    - E-BUSINESS (22 journals)    - E-LEARNING (29 journals)    - ELECTRONIC DATA PROCESSING (23 journals)    - IMAGE AND VIDEO PROCESSING (40 journals)    - INFORMATION SYSTEMS (110 journals)    - INTERNET (93 journals)    - SOCIAL WEB (51 journals)    - SOFTWARE (33 journals)    - THEORY OF COMPUTING (8 journals) COMPUTER SCIENCE (1196 journals)                  1 2 3 4 5 6 | Last
 Advances in Data Analysis and ClassificationJournal Prestige (SJR): 1.09 Citation Impact (citeScore): 1Number of Followers: 51      Hybrid journal (It can contain Open Access articles) ISSN (Print) 1862-5355 - ISSN (Online) 1862-5347 Published by Springer-Verlag  [2351 journals]
• Editorial for issue 2/2018
• Pages: 173 - 177
PubDate: 2018-06-01
DOI: 10.1007/s11634-018-0328-z
Issue No: Vol. 12, No. 2 (2018)

• Probabilistic clustering via Pareto solutions and significance tests
• Authors: María Teresa Gallegos; Gunter Ritter
Pages: 179 - 202
Abstract: The present paper proposes a new strategy for probabilistic (often called model-based) clustering. It is well known that local maxima of mixture likelihoods can be used to partition an underlying data set. However, local maxima are rarely unique. Therefore, it remains to select the reasonable solutions, and in particular the desired one. Credible partitions are usually recognized by separation (and cohesion) of their clusters. We use here the p values provided by the classical tests of Wilks, Hotelling, and Behrens–Fisher to single out those solutions that are well separated by location. It has been shown that reasonable solutions to a clustering problem are related to Pareto points in a plot of scale balance vs. model fit of all local maxima. We briefly review this theory and propose as solutions all well-fitting Pareto points in the set of local maxima separated by location in the above sense. We also design a new iterative, parameter-free cutting plane algorithm for the multivariate Behrens–Fisher problem.
PubDate: 2018-06-01
DOI: 10.1007/s11634-016-0278-2
Issue No: Vol. 12, No. 2 (2018)

• Eigenvalues and constraints in mixture modeling: geometric and
computational issues
• Authors: Luis Angel García-Escudero; Alfonso Gordaliza; Francesca Greselin; Salvatore Ingrassia; Agustín Mayo-Iscar
Pages: 203 - 233
Abstract: This paper presents a review about the usage of eigenvalues restrictions for constrained parameter estimation in mixtures of elliptical distributions according to the likelihood approach. The restrictions serve a twofold purpose: to avoid convergence to degenerate solutions and to reduce the onset of non interesting (spurious) local maximizers, related to complex likelihood surfaces. The paper shows how the constraints may play a key role in the theory of Euclidean data clustering. The aim here is to provide a reasoned survey of the constraints and their applications, considering the contributions of many authors and spanning the literature of the last 30 years.
PubDate: 2018-06-01
DOI: 10.1007/s11634-017-0293-y
Issue No: Vol. 12, No. 2 (2018)

• A data driven equivariant approach to constrained Gaussian mixture
modeling
• Authors: Roberto Rocci; Stefano Antonio Gattone; Roberto Di Mari
Pages: 235 - 260
Abstract: Maximum likelihood estimation of Gaussian mixture models with different class-specific covariance matrices is known to be problematic. This is due to the unboundedness of the likelihood, together with the presence of spurious maximizers. Existing methods to bypass this obstacle are based on the fact that unboundedness is avoided if the eigenvalues of the covariance matrices are bounded away from zero. This can be done imposing some constraints on the covariance matrices, i.e. by incorporating a priori information on the covariance structure of the mixture components. The present work introduces a constrained approach, where the class conditional covariance matrices are shrunk towards a pre-specified target matrix $$\varvec{\varPsi }.$$ Data-driven choices of the matrix $$\varvec{\varPsi },$$ when a priori information is not available, and the optimal amount of shrinkage are investigated. Then, constraints based on a data-driven $$\varvec{\varPsi }$$ are shown to be equivariant with respect to linear affine transformations, provided that the method used to select the target matrix be also equivariant. The effectiveness of the proposal is evaluated on the basis of a simulation study and an empirical example.
PubDate: 2018-06-01
DOI: 10.1007/s11634-016-0279-1
Issue No: Vol. 12, No. 2 (2018)

• Clustering of imbalanced high-dimensional media data
• Authors: Šárka Brodinová; Maia Zaharieva; Peter Filzmoser; Thomas Ortner; Christian Breiteneder
Pages: 261 - 284
Abstract: Media content in large repositories usually exhibits multiple groups of strongly varying sizes. Media of potential interest often form notably smaller groups. Such media groups differ so much from the remaining data that it may be worthy to look at them in more detail. In contrast, media with popular content appear in larger groups. Identifying groups of varying sizes is addressed by clustering of imbalanced data. Clustering highly imbalanced media groups is additionally challenged by the high dimensionality of the underlying features. In this paper, we present the imbalanced clustering (IClust) algorithm designed to reveal group structures in high-dimensional media data. IClust employs an existing clustering method in order to find an initial set of a large number of potentially highly pure clusters which are then successively merged. The main advantage of IClust is that the number of clusters does not have to be pre-specified and that no specific assumptions about the cluster or data characteristics need to be made. Experiments on real-world media data demonstrate that in comparison to existing methods, IClust is able to better identify media groups, especially groups of small sizes.
PubDate: 2018-06-01
DOI: 10.1007/s11634-017-0292-z
Issue No: Vol. 12, No. 2 (2018)

• Clusterwise analysis for multiblock component methods
• Authors: Stéphanie Bougeard; Hervé Abdi; Gilbert Saporta; Ndèye Niang
Pages: 285 - 313
Abstract: Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing.
PubDate: 2018-06-01
DOI: 10.1007/s11634-017-0296-8
Issue No: Vol. 12, No. 2 (2018)

• Asymptotic comparison of semi-supervised and supervised linear
discriminant functions for heteroscedastic normal populations
• Authors: Kenichi Hayashi
Pages: 315 - 339
Abstract: It has been reported that using unlabeled data together with labeled data to construct a discriminant function works successfully in practice. However, theoretical studies have implied that unlabeled data can sometimes adversely affect the performance of discriminant functions. Therefore, it is important to know what situations call for the use of unlabeled data. In this paper, asymptotic relative efficiency is presented as the measure for comparing analyses with and without unlabeled data under the heteroscedastic normality assumption. The linear discriminant function maximizing the area under the receiver operating characteristic curve is considered. Asymptotic relative efficiency is evaluated to investigate when and how unlabeled data contribute to improving discriminant performance under several conditions. The results show that asymptotic relative efficiency depends mainly on the heteroscedasticity of the covariance matrices and the stochastic structure of observing the labels of the cases.
PubDate: 2018-06-01
DOI: 10.1007/s11634-016-0266-6
Issue No: Vol. 12, No. 2 (2018)

• Local generalized quadratic distance metrics: application to the k
-nearest neighbors classifier
• Authors: Karim Abou-Moustafa; Frank P. Ferrie
Pages: 341 - 363
Abstract: Finding the set of nearest neighbors for a query point of interest appears in a variety of algorithms for machine learning and pattern recognition. Examples include k nearest neighbor classification, information retrieval, case-based reasoning, manifold learning, and nonlinear dimensionality reduction. In this work, we propose a new approach for determining a distance metric from the data for finding such neighboring points. For a query point of interest, our approach learns a generalized quadratic distance (GQD) metric based on the statistical properties in a “small” neighborhood for the point of interest. The locally learned GQD metric captures information such as the density, curvature, and the intrinsic dimensionality for the points falling in this particular neighborhood. Unfortunately, learning the GQD parameters under such a local learning mechanism is a challenging problem with a high computational overhead. To address these challenges, we estimate the GQD parameters using the minimum volume covering ellipsoid (MVCE) for a set of points. The advantage of the MVCE is two-fold. First, the MVCE together with the local learning approach approximate the functionality of a well known robust estimator for covariance matrices. Second, computing the MVCE is a convex optimization problem which, in addition to having a unique global solution, can be efficiently solved using a first order optimization algorithm. We validate our metric learning approach on a large variety of datasets and show that the proposed metric has promising results when compared with five algorithms from the literature for supervised metric learning.
PubDate: 2018-06-01
DOI: 10.1007/s11634-017-0286-x
Issue No: Vol. 12, No. 2 (2018)

• Unsupervised classification of children’s bodies using currents
• Authors: Sonia Barahona; Ximo Gual-Arnau; Maria Victoria Ibáñez; Amelia Simó
Pages: 365 - 397
Abstract: Object classification according to their shape and size is of key importance in many scientific fields. This work focuses on the case where the size and shape of an object is characterized by a current. A current is a mathematical object which has been proved relevant to the modeling of geometrical data, like submanifolds, through integration of vector fields along them. As a consequence of the choice of a vector-valued reproducing kernel Hilbert space (RKHS) as a test space for integrating manifolds, it is possible to consider that shapes are embedded in this Hilbert Space. A vector-valued RKHS is a Hilbert space of vector fields; therefore, it is possible to compute a mean of shapes, or to calculate a distance between two manifolds. This embedding enables us to consider size-and-shape clustering algorithms. These algorithms are applied to a 3D database obtained from an anthropometric survey of the Spanish child population with a potential application to online sales of children’s wear.
PubDate: 2018-06-01
DOI: 10.1007/s11634-017-0283-0
Issue No: Vol. 12, No. 2 (2018)

• A semiparametric Bayesian joint model for multiple mixed-type outcomes: an
application to acute myocardial infarction
• Authors: Alessandra Guglielmi; Francesca Ieva; Anna Maria Paganoni; Fernardo A. Quintana
Pages: 399 - 423
Abstract: We propose a Bayesian semiparametric regression model to represent mixed-type multiple outcomes concerning patients affected by Acute Myocardial Infarction. Our approach is motivated by data coming from the ST-Elevation Myocardial Infarction (STEMI) Archive, a multi-center observational prospective clinical study planned as part of the Strategic Program of Lombardy, Italy. We specifically consider a joint model for a variable measuring treatment time and in-hospital and 60-day survival indicators. One of our main motivations is to understand how the various hospitals differ in terms of the variety of information collected as part of the study. To do so we postulate a semiparametric random effects model that incorporates dependence on a location indicator that is used to explicitly differentiate among hospitals in or outside the city of Milano. The model is based on the two parameter Poisson-Dirichlet prior, also known as the Pitman-Yor process prior. We discuss the resulting posterior inference, including sensitivity analysis, and a comparison with the particular sub-model arising when a Dirichlet process prior is assumed.
PubDate: 2018-06-01
DOI: 10.1007/s11634-016-0273-7
Issue No: Vol. 12, No. 2 (2018)

• D-trace estimation of a precision matrix using adaptive Lasso penalties
• Authors: Vahe Avagyan; Andrés M. Alonso; Francisco J. Nogales
Pages: 425 - 447
Abstract: The accurate estimation of a precision matrix plays a crucial role in the current age of high-dimensional data explosion. To deal with this problem, one of the prominent and commonly used techniques is the $$\ell _1$$ norm (Lasso) penalization for a given loss function. This approach guarantees the sparsity of the precision matrix estimate for properly selected penalty parameters. However, the $$\ell _1$$ norm penalization often fails to control the bias of obtained estimator because of its overestimation behavior. In this paper, we introduce two adaptive extensions of the recently proposed $$\ell _1$$ norm penalized D-trace loss minimization method. They aim at reducing the produced bias in the estimator. Extensive numerical results, using both simulated and real datasets, show the advantage of our proposed estimators.
PubDate: 2018-06-01
DOI: 10.1007/s11634-016-0272-8
Issue No: Vol. 12, No. 2 (2018)

• Studying crime trends in the USA over the years 2000–2012
• Authors: Volodymyr Melnykov; Xuwen Zhu
Abstract: Studying crime trends and tendencies is an important problem that helps to identify socioeconomic patterns and relationships of crucial significance. Finite mixture models are famous for their flexibility in modeling heterogeneity in data. A novel approach designed for accounting for skewness in the distributions of matrix observations is proposed and applied to the United States crime data collected between 2000 and 2012 years. Then, the model is further extended by incorporating explanatory variables. A step-by-step model development demonstrates differences and improvements associated with every stage of the process. Results obtained by the final model are illustrated and thoroughly discussed. Multiple interesting conclusions have been drawn based on the developed model and obtained model-based clustering partition.
PubDate: 2018-06-23
DOI: 10.1007/s11634-018-0326-1

• Investigating consumers’ store-choice behavior via hierarchical
variable selection
• Authors: Toshiki Sato; Yuichi Takano; Takanobu Nakahara
Abstract: This paper is concerned with a store-choice model for investigating consumers’ store-choice behavior based on scanner panel data. Our store-choice model enables us to evaluate the effects of the consumer/product attributes not only on the consumer’s store choice but also on his/her purchase quantity. Moreover, we adopt a mixed-integer optimization (MIO) approach to selecting the best set of explanatory variables with which to construct the store-choice model. We devise two MIO models for hierarchical variable selection in which the hierarchical structure of product categories is used to enhance the reliability and computational efficiency of the variable selection. We assess the effectiveness of our MIO models through computational experiments on actual scanner panel data. These experiments are focused on the consumer’s choice among three types of stores in Japan: convenience stores, drugstores, and (grocery) supermarkets. The computational results demonstrate that our method has several advantages over the common methods for variable selection, namely, the stepwise method and $$L_1$$ -regularized regression. Furthermore, our analysis reveals that convenience stores are most strongly chosen for gift cards and garbage disposal permits, drugstores are most strongly chosen for products that are specific to drugstores, and supermarkets are most strongly chosen for health food products by women with families.
PubDate: 2018-06-15
DOI: 10.1007/s11634-018-0327-0

• Unifying data units and models in (co-)clustering
• Abstract: Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots.
PubDate: 2018-05-25
DOI: 10.1007/s11634-018-0325-2

• Finite mixture biclustering of discrete type multivariate data
• Authors: Daniel Fernández; Richard Arnold; Shirley Pledger; Ivy Liu; Roy Costilla
Abstract: Many of the methods which deal with clustering in matrices of data are based on mathematical techniques such as distance-based algorithms or matrix decomposition and eigenvalues. In general, it is not possible to use statistical inferences or select the appropriateness of a model via information criteria with these techniques because there is no underlying probability model. This article summarizes some recent model-based methodologies for matrices of binary, count, and ordinal data, which are modelled under a unified statistical framework using finite mixtures to group the rows and/or columns. The model parameter can be constructed from a linear predictor of parameters and covariates through link functions. This likelihood-based one-mode and two-mode fuzzy clustering provides maximum likelihood estimation of parameters and the options of using likelihood information criteria for model comparison. Additionally, a Bayesian approach is presented in which the parameters and the number of clusters are estimated simultaneously from their joint posterior distribution. Visualization tools focused on ordinal data, the fuzziness of the clustering structures, and analogies of various standard plots used in the multivariate analysis are presented. Finally, a set of future extensions is enumerated.
PubDate: 2018-05-15
DOI: 10.1007/s11634-018-0324-3

• Directional co-clustering
• Authors: Aghiles Salah; Mohamed Nadif
Abstract: Co-clustering addresses the problem of simultaneous clustering of both dimensions of a data matrix. When dealing with high dimensional sparse data, co-clustering turns out to be more beneficial than one-sided clustering even if one is interested in clustering along one dimension only. Aside from being high dimensional and sparse, some datasets, such as document-term matrices, exhibit directional characteristics, and the $$L_2$$ normalization of such data, so that it lies on the surface of a unit hypersphere, is useful. Popular co-clustering assumptions such as Gaussian or Multinomial are inadequate for this type of data. In this paper, we extend the scope of co-clustering to directional data. We present Diagonal Block Mixture of Von Mises–Fisher distributions (dbmovMFs), a co-clustering model which is well suited for directional data lying on a unit hypersphere. By setting the estimate of the model parameters under the maximum likelihood (ML) and classification ML approaches, we develop a class of EM algorithms for estimating dbmovMFs from data. Extensive experiments, on several real-world datasets, confirm the advantage of our approach and demonstrate the effectiveness of our algorithms.
PubDate: 2018-04-30
DOI: 10.1007/s11634-018-0323-4

• Variable selection in model-based clustering and discriminant analysis
with a regularization approach
• Authors: Gilles Celeux; Cathy Maugis-Rabusseau; Mohammed Sedki
Abstract: Several methods for variable selection have been proposed in model-based clustering and classification. These make use of backward or forward procedures to define the roles of the variables. Unfortunately, such stepwise procedures are slow and the resulting algorithms inefficient when analyzing large data sets with many variables. In this paper, we propose an alternative regularization approach for variable selection in model-based clustering and classification. In our approach the variables are first ranked using a lasso-like procedure in order to avoid slow stepwise algorithms. Thus, the variable selection methodology of Maugis et al. (Comput Stat Data Anal 53:3872–3882, 2000b) can be efficiently applied to high-dimensional data sets.
PubDate: 2018-04-11
DOI: 10.1007/s11634-018-0322-5

• Linear components of quadratic classifiers
• Authors: José R. Berrendero; Javier Cárcamo
Abstract: We obtain a decomposition of any quadratic classifier in terms of products of hyperplanes. These hyperplanes can be viewed as relevant linear components of the quadratic rule (with respect to the underlying classification problem). As an application, we introduce the associated multidirectional classifier; a piecewise linear classification rule induced by the approximating products. Such a classifier is useful to determine linear combinations of the predictor variables with ability to discriminate. We also show that this classifier can be used as a tool to reduce the dimension of the data and helps identify the most important variables to classify new elements. Finally, we illustrate with a real data set the use of these linear components to construct oblique classification trees.
PubDate: 2018-04-07
DOI: 10.1007/s11634-018-0321-6

• A bivariate index vector for measuring departure from double symmetry in
square contingency tables
• Authors: Shuji Ando; Kouji Tahata; Sadao Tomizawa
Abstract: For square contingency tables, a double symmetry model having a matrix structure that combines both symmetry and point symmetry was proposed. Also, an index which represents the degree of departure from double symmetry was proposed. However, this index cannot simultaneously characterize the degree of departure from symmetry and the degree of departure from point symmetry. For measuring the degree of departure from double symmetry, the present paper proposes a bivariate index vector that can simultaneously characterize the degree of departure from symmetry and the degree of departure from point symmetry.
PubDate: 2018-03-26
DOI: 10.1007/s11634-018-0320-7

• An efficient random forests algorithm for high dimensional data
classification
• Authors: Qiang Wang; Thanh-Tung Nguyen; Joshua Z. Huang; Thuy Thi Nguyen
Abstract: In this paper, we propose a new random forest (RF) algorithm to deal with high dimensional data for classification using subspace feature sampling method and feature value searching. The new subspace sampling method maintains the diversity and randomness of the forest and enables one to generate trees with a lower prediction error. A greedy technique is used to handle cardinal categorical features for efficient node splitting when building decision trees in the forest. This allows trees to handle very high cardinality meanwhile reducing computational time in building the RF model. Extensive experiments on high dimensional real data sets including standard machine learning data sets and image data sets have been conducted. The results demonstrated that the proposed approach for learning RFs significantly reduced prediction errors and outperformed most existing RFs when dealing with high-dimensional data.
PubDate: 2018-03-21
DOI: 10.1007/s11634-018-0318-1

JournalTOCs
School of Mathematical and Computer Sciences
Heriot-Watt University
Edinburgh, EH14 4AS, UK
Email: journaltocs@hw.ac.uk
Tel: +00 44 (0)131 4513762
Fax: +00 44 (0)131 4513327