Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Luis A. Arteaga-Molina, Juan M. Rodríguez-Poo In this paper local empirical likelihood-based inference for nonparametric categorical varying coefficient panel data models with fixed effects under cross-sectional dependence is investigated. First, we show that the naive empirical likelihood ratio is asymptotically standard chi-squared using a nonparametric version of Wilks’ theorem. The ratio is self-scale invariant and the plug-in estimate of the limiting variance is not needed. As a by product, we propose also an empirical maximum likelihood estimator of the categorical varying coefficient model and we obtain the asymptotic distribution of this estimator. We also illustrated the proposed technique in an application that reports estimates of strike activities from 17 OECD countries for the period 1951–85.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Viktor Bengs, Matthias Eulert, Hajo Holzmann We construct uniform and point-wise asymptotic confidence sets for the single edge in an otherwise smooth image function which are based on rotated differences of two one-sided kernel estimators. Using methods from M-estimation, we show consistency of the estimators of location, slope and height of the edge function and develop a uniform linearization of the contrast process. The uniform confidence bands then rely on a Gaussian approximation of the score process together with anti-concentration results for suprema of Gaussian processes, while point-wise bands are based on asymptotic normality. The finite-sample performance of the point-wise proposed methods is investigated in a simulation study. An illustration to real-world image processing is also given.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Mateo Díaz, Adolfo J. Quiroz, Mauricio Velasco For data living in a manifold M⊆Rm and a point p∈M, we consider a statistic Uk,n which estimates the variance of the angle between pairs (Xi−p,Xj−p) of vectors, for data points Xi, Xj, near p, and we evaluate this statistic as a tool for estimation of the intrinsic dimension of M at p. Consistency of the local dimension estimator is established and the asymptotic distribution of Uk,n is found under minimal regularity assumptions. Performance of the proposed methodology is compared against state-of-the-art methods on simulated data and real datasets.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Xinyi Li, Li Wang, Dan Nettleton The additive partially linear model (APLM) combines the flexibility of nonparametric regression with the parsimony of regression models, and has been widely used as a popular tool in multivariate nonparametric regression to alleviate the “curse of dimensionality”. A natural question raised in practice is the choice of structure in the nonparametric part, i.e., whether the continuous covariates enter into the model in linear or nonparametric form. In this paper, we present a comprehensive framework for simultaneous sparse model identification and learning for ultra-high-dimensional APLMs where both the linear and nonparametric components are possibly larger than the sample size. We propose a fast and efficient two-stage procedure. In the first stage, we decompose the nonparametric functions into a linear part and a nonlinear part. The nonlinear functions are approximated by constant spline bases, and a triple penalization procedure is proposed to select nonzero components using adaptive group LASSO. In the second stage, we refit data with selected covariates using higher order polynomial splines, and apply spline-backfitted local-linear smoothing to obtain asymptotic normality for the estimators. The procedure is shown to be consistent for model structure identification. It can identify zero, linear, and nonlinear components correctly and efficiently. Inference can be made on both linear coefficients and nonparametric functions. We conduct simulation studies to evaluate the performance of the method and apply the proposed method to a dataset on the Shoot Apical Meristem (SAM) of maize genotypes for illustration.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Serge Darolles, Gaëlle Le Fol, Yang Lu, Ran Sun We propose a new family of bivariate nonnegative integer-autoregressive (BINAR) models for count process data. We first generalize the existing BINAR(1) model by allowing for dependent thinning operators and arbitrary innovation distribution. The extended family allows for intuitive interpretation, as well as tractable aggregation and stationarity properties. We then introduce higher order BINAR(p) and BINAR(∞) dynamics to accommodate more flexible serial dependence patterns. So far, the literature has regarded such models as computationally intractable. We show that the extended BINAR family allows for closed-form predictive distributions at any horizons and for any values of p, which significantly facilitates non-linear forecasting and likelihood based estimation. Finally, a BINAR(∞) model with memory persistence is applied to open-ended mutual fund purchase and redemption order counts.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Asanka Gunawardana, Frank Konietschke We develop purely nonparametric multiple inference methods for general multivariate data that neither assume any specific data distribution nor identical covariance matrices across the treatment groups. Continuous, discrete, and even ordered categorical (ordinal) data can be analyzed with these procedures in a unified way. To test hypotheses formulated in terms of purely nonparametric treatment effects, we derive pseudo-rank based multiple contrast tests and simultaneous confidence intervals. Hereby, the simultaneous confidence intervals are compatible with the multiple comparisons. The small-sample performance of the procedures is examined in a simulation study which indicates that the proposed procedures (i) control the family-wise error rate quite accurately and (ii) have a substantially higher power under non-normality than mean-based parametric competing methods. Application of the proposed tests is demonstrated by analyzing a real data set.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Rounak Dey, Seunggeun Lee With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigate the asymptotic behavior of PCA under the generalized spiked population model. Based on our theoretical results, we propose a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we show that our methods can greatly reduce bias and improve prediction accuracy.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Feifei Chen, Simos G. Meintanis, Lixing Zhu We propose three new characterizations and corresponding distance-based weighted test criteria for the two-sample problem, and for testing symmetry and independence with multivariate data. All quantities have the common feature of involving characteristic functions, and it is seen that these quantities are intimately related to some earlier methods, thereby generalizing them. The connection rests on a special choice of the weight function involved. Equivalent expressions of the distances in terms of densities are given as well as a Bayesian interpretation of the weight function is involved. The asymptotic behavior of the tests is investigated both under the null hypothesis and under alternatives, and affine invariant versions of the test criteria are suggested. Numerical studies are conducted to examine the performances of the criteria. It is shown that the normal weight function, which is the hitherto most often used, is seriously suboptimal. The procedures are biased in the sense that the corresponding test criteria degenerate in high dimension and hence a bias correction is required as the dimension tends to infinity.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Thanh Mai Pham Ngoc This paper considers nonparametric density estimation with directional data. A new rule is proposed for bandwidth selection for kernel density estimation. The procedure is automatic, fully data-driven, and adaptive to the degree of smoothness of the density. An oracle inequality and optimal rates of convergence for the L2 error are derived. These theoretical results are illustrated with simulations.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Shigeyuki Hamori, Kaiji Motegi, Zheng Zhang This paper investigates the estimation of semiparametric copula models with data missing at random. The maximum pseudo-likelihood estimation of Genest et al. (1995) is infeasible if there are missing data. We propose a class of calibration estimators for the nonparametric marginal distributions and the copula parameters of interest by balancing the empirical moments of covariates between observed and whole groups. Our proposed estimators do not require the estimation of the missing mechanism, and they enjoy stable performance even when the sample size is small. We prove that our estimators satisfy consistency and asymptotic normality. We also provide a consistent estimator for the asymptotic variance. We show via extensive simulations that our proposed method dominates existing alternatives.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Lea Petrella, Valentina Raponi This paper proposes a maximum likelihood approach to jointly estimate marginal conditional quantiles of multivariate response variables in a linear regression framework. We consider a slight reparameterization of the multivariate asymmetric Laplace distribution proposed by Kotz et al. (2001) and exploit its location–scale mixture representation to implement a new EM algorithm for estimating model parameters. The idea is to extend the link between the asymmetric Laplace distribution and the well-known univariate quantile regression model to a multivariate context, i.e., when a multivariate dependent variable is concerned. The approach accounts for association among multiple responses and studies how the relationship between responses and explanatory variables can vary across different quantiles of the marginal conditional distribution of the responses. A penalized version of the EM algorithm is also presented to tackle the problem of variable selection. The validity of our approach is analyzed in a simulation study, where we also provide evidence on the efficiency gain of the proposed method compared to estimation obtained by separate univariate quantile regressions. A real data application examines the main determinants of financial distress in a sample of Italian firms.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Kelly Ramsay, Stéphane Durocher, Alexandre Leblanc We study depth measures for multivariate data defined by integrating univariate depth measures, specifically, integrated dual (ID) depth introduced by Cuevas and Fraiman (2009) which integrates univariate simplicial depth, and integrated rank-weighted (IRW) depth, which integrates univariate Tukey depth. We build on the results of Cuevas and Fraiman (2009) to show that IRW depth shares many depth properties with ID depth. Further, we provide additional results on exact computation, decreasing along rays, continuity and breakdown point that apply to both ID and IRW depth. We also establish asymptotic normality and consistency of the sample IRW depths. Lastly, we demonstrate the use of this depth measure with real and simulated datasets: calculating robust location estimators and dd-plots.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Liang Liang, Yanyuan Ma, Raymond J. Carroll Case-controls studies are popular epidemiological designs for detecting gene–environment interactions in the etiology of complex diseases, where the genetic susceptibility and environmental exposures may often be reasonably assumed independent in the source population. Various papers have presented analytical methods exploiting gene–environment independence to achieve better efficiency, all of which require either a rare disease assumption or a distributional assumption on the genetic variables. We relax both assumptions. We construct a semiparametric estimator in case-control studies exploiting gene–environment independence, while the distributions of genetic susceptibility and environmental exposures are both unspecified and the disease rate is assumed unknown and is not required to be close to zero. The resulting estimator is semiparametric efficient and its superiority over prospective logistic regression, the usual analysis in case-control studies, is demonstrated in various numerical illustrations.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): I-Ping Tu, Su-Yun Huang, Dai-Ni Hsieh Tensor data, such as image set, movie data, gene-environment interactions, or gene–gene interactions, have become a popular data format in many fields. Multilinear Principal Component Analysis (MPCA) has been recognized as an efficient dimension reduction method for tensor data analysis. However, a gratifying rank selection method for a general application of MPCA is not yet available. For example, both the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), arguably two of the most commonly used model selection methods, require more strict model assumptions when applying on the rank selection in MPCA. In this paper, we propose a rank selection rule for MPCA based on the minimum risk criterion and Stein’s unbiased risk estimate (SURE). We derive a neat formula while using the minimum model assumptions for MPCA. It is composed of a residual sum of squares for model fitting and a penalty on the model complexity referred as the generalized degrees of freedom (GDF). We allocate each term in the GDF to either the number of parameters used in the model or the complexity in separating the signal from the noise. Compared with AIC and BIC and their modification methods, this criterion reaches higher accuracies in a thorough simulation study. Importantly, it has potential for more general application because it makes fewer model assumptions.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Dominique Fourdrinier, Éric Marchand, William E. Strawderman Let X,Y,U be independent distributed as X∼Nd(θ,σ2Id), Y∼Nd(cθ,σ2Id), and U⊤U∼σ2χk2, or more generally spherically symmetric distributed with density ηd+k∕2f{η(‖x−θ‖2+‖u‖2+‖y−cθ‖2)}, with unknown parameters θ∈Rd and η=1∕σ2>0, known density f, and c∈R+. Based on observing X=x,U=u, we consider the problem of obtaining a predictive density qˆ(⋅;x,u) for Y as measured by the expected Kullback–Leibler loss. A benchmark procedure is the minimum risk equivariant density qˆMRE, which is generalized Bayes with respect to the prior π(θ,η)=1∕η. In dimension d≥3, we obtain improvements on qˆMRE, and further show that the dominance holds simultaneously for all f subject to finite moment and finite risk conditions. We also obtain that the Bayes predictive den...

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Marina Meilă If we have found a “good” clustering C of a data set, can we prove that C is not far from the (unknown) best clustering Copt of these data' Perhaps surprisingly, the answer to this question is sometimes yes. This paper gives spectral bounds on the distance d(C,Copt) for the case when “goodness” is measured by a quadratic cost, such as the squared distortion of K-means clustering or the Normalized Cut criterion of spectral clustering. The bounds exist only if the data admit a “good”, low-cost clustering. The results in this paper are non-asymptotic and model-free, in the sense that no assumptions are made on the data generating process. The bounds do not depend on undefined constants, and can be computed tractably from the data.

Abstract: Publication date: September 2019Source: Journal of Multivariate Analysis, Volume 173Author(s): Hyokyoung G. Hong, Qi Zheng, Yi Li Forward regression, a classical variable screening method, has been widely used for model building when the number of covariates is relatively low. However, forward regression is seldom used in high-dimensional settings because of the cumbersome computation and unknown theoretical properties. Some recent works have shown that forward regression, coupled with an extended Bayesian information criterion (EBIC)-based stopping rule, can consistently identify all relevant predictors in high-dimensional linear regression settings. However, the results are based on the sum of residual squares from linear models and it is unclear whether forward regression can be applied to more general regression settings, such as Cox proportional hazards models. We introduce a forward variable selection procedure for Cox models. It selects important variables sequentially according to the increment of partial likelihood, with an EBIC stopping rule. To our knowledge, this is the first study that investigates the partial likelihood-based forward regression in high-dimensional survival settings and establishes selection consistency results. We show that, if the dimension of the true model is finite, forward regression can discover all relevant predictors within a finite number of steps and their order of entry is determined by the size of the increment in partial likelihood. As partial likelihood is not a regular density-based likelihood, we develop some new theoretical results on partial likelihood and use these results to establish the desired sure screening properties. The practical utility of the proposed method is examined via extensive simulations and analysis of a subset of the Boston Lung Cancer Survival Cohort study, a hospital-based study for identifying biomarkers related to lung cancer patients’ survival.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Mao Ye, Zhao-Hua Lu, Yimei Li, Xinyuan Song Heterogeneous longitudinal data have become prevalent in medical, biological, and social studies. This paper proposes a finite mixture of varying coefficient models for handling heterogeneous populations. Each component of the mixture is modeled by a varying coefficient mixed-effect model that characterizes the longitudinal relations among variables. The identifiability of the mixture model is studied. Regression splines with equally spaced knots are used to approximate the varying coefficient functions, and a nested expectation maximization algorithm is developed to obtain the maximum likelihood estimation. We propose a penalized likelihood method based on the smoothly clipped absolute deviation (SCAD) penalty for the component selection of finite mixture of varying coefficient model. A modified BIC-based criterion based on the SCAD penalty, the BICSCAD, is proposed for selecting the penalty parameter and spline space simultaneously. The asymptotic properties of parameter estimation and component selection consistency are studied under mild conditions. Simulation studies are conducted to illustrate the component selection, parameter estimation, and inference of the proposed method. The model is then applied to a heterogeneous longitudinal data set from a study of the treatment effect on the use of heroin in the California Civil Addict Program.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): S. Rao Jammalamadaka, György H. Terdik Fourier analysis, and representation of circular distributions in terms of their Fourier coefficients, is quite commonly discussed and used for model-free inference such as testing uniformity and symmetry, in dealing with 2-dimensional directions. However, a similar discussion for spherical distributions, which are used to model 3-dimensional directional data, is not readily available in the literature in terms of their harmonics. This paper, in what we believe is the first such attempt, looks at probability distributions on a unit sphere through the perspective of spherical harmonics, analogous to the Fourier analysis for distributions on a unit circle. Representation of any continuous spherical density in terms of spherical harmonics is given, and such series expansions provided for some commonly used spherical distributions, as well as for two new spherical distributions that are introduced. Through the prism of harmonic analysis, one can look at the mean direction, dispersion, and various forms of symmetry for these models in a nonparametric setting. Aspects of distribution-free inference such as estimation and large-sample tests for various symmetries, are provided, each type of symmetry being characterized through its harmonics. The paper concludes with a real-data example analyzing the longitudinal sunspot activity.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Harry Joe, Haijun Li Skew-elliptical distributions constitute a large class of multivariate distributions that account for both skewness and a variety of tail properties. This class has simpler representations in terms of densities rather than cumulative distribution functions, and the tail density approach has previously been developed to study tail properties when multivariate densities have more tractable forms. The special skew-elliptical structure allows for derivations of specific forms for the tail densities for those skew-elliptical copulas that admit probability density functions, under heavy and light tail conditions on density generators. The tail densities of skew-elliptical copulas are explicit and depend only on tail properties of the underlying density generator and conditions on the skewness parameters. In the heavy-tail case skewness parameters affect tail densities of the skew-elliptical copulas more profoundly than that in the light tail case, whereas in the latter case the tail densities of skew-elliptical copulas are only proportional to the tail densities of symmetrical elliptical copulas. Various examples, including tail densities of skew-normal and skew-t distributions, are given.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Koji Tsukuda, Shun Matsuura Hypothesis testing for the proportionality of covariance matrices is a classical statistical problem and has been widely studied in the literature. However, there have been few treatments of this test in high-dimensional settings, especially for the case where the number of variables is larger than the sample size, despite high-dimensional statistical inference having recently received considerable attention. This paper studies hypothesis testing for the proportionality of two covariance matrices in the high-dimensional setting: m,n≍pδ for some δ∈(1∕2,1), where m and n denote the sample sizes and p denotes the number of variables. A test statistic is proposed and its asymptotic distribution is derived under multivariate normality. The non-asymptotic performance of the proposed test procedure is numerically examined.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Jingwei Wu, Hanxiang Peng, Wanzhu Tu By optimizing index functions against different outcomes, we propose a multivariate single-index model (SIM) for development of medical indices that simultaneously work with multiple outcomes. Fitting of a multivariate SIM is not fundamentally different from fitting a univariate SIM, as the former can be written as a sum of multiple univariate SIMs with appropriate indicator functions. What have not been carefully studied are the theoretical properties of the parameter estimators. Because of the lack of asymptotic results, no formal inference procedure has been made available for multivariate SIMs. In this paper, we examine the asymptotic properties of the multivariate SIM parameter estimators. We show that, under mild regularity conditions, estimators for the multivariate SIM parameters are indeed n-consistent and asymptotically normal. We conduct a simulation study to investigate the finite-sample performance of the corresponding estimation and inference procedures. To illustrate its use in practice, we construct an index measure of urine electrolyte markers for assessing the risk of hypertension in individual subjects.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Vinnie Ko, Nils Lid Hjort This article is concerned with inference in parametric copula setups, where both the marginals and the copula have parametric forms. For such models, two-stage maximum likelihood estimation, often referred to as inference function for margins, is used as an attractive alternative to the full maximum likelihood estimation strategy. Previous studies of the two-stage maximum likelihood estimator have largely been based on the assumption that the chosen parametric model captures the true model that generated data. We study the impact of dropping this true model assumption, both theoretically and numerically. We first show that the two-stage maximum likelihood estimator is consistent for a well-defined least false parameter value, different from the analogous least false parameter associated with the full maximum likelihood procedure. Then we demonstrate limiting normality of the full vector of estimators, with concise matrix notation for the variance matrices involved. Along with consistent estimators for these, we have built a model-robust machinery for inference in parametric copula models. The special case where the parametric model is assumed to hold corresponds to situations studied earlier in the literature, with simpler formulas for variance matrices. As a numerical illustration, we perform a set of simulations. We also analyze five-dimensional Norwegian precipitation data. We find that the variance of the copula parameter estimate can both increase and decrease, by dropping the true model assumption. In addition, we observe that the two-stage maximum likelihood estimator is still highly efficient when the true model assumption is dropped and thus the model robust asymptotic variance formulas are used. Additionally, we discover that using highly misspecified models can lead to situations where the asymptotic variance of the two-stage maximum likelihood estimator is lower than that of full maximum likelihood estimator. Our results are also used to analyze the mean squared error properties for both the full and the two-stage maximum likelihood estimators of any focus parameter.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Hui Zhao, Dayu Sun, Gang Li, Jianguo Sun This paper discusses regression analysis of incomplete event history studies with a focus on simultaneous estimation and variable selection. Such studies are commonly performed in areas such as medical studies and social sciences, and a great deal of literature has been devoted to their analysis except for the problem considered here (Sun and Zhao, 2013). We develop a new method, which will be referred to as a broken adaptive ridge regression approach. We establish its asymptotic properties, including the oracle property and clustering effect. We also report simulation results which indicate that the proposed method performs well, and better than the existing methods, in practice. In addition, an application is provided.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Liliana Forzani, Daniela Rodriguez, Ezequiel Smucler, Mariela Sued We consider model-based sufficient dimension reduction for generalized linear models and prove the consistency and asymptotic normality of the prediction estimator studied empirically for the normal case by Adragni and Cook (2009) when a sample version of the sufficient dimension reduction is used. Moreover, we provide a formula for the prediction that does need require explicitly computing the reduction.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Mengyan Li, Yanyuan Ma, Runze Li Covariate measurement error is a common problem. Improper treatment of measurement errors may affect the quality of estimation and the accuracy of inference. Extensive literature exists on homoscedastic measurement error models, but little research exists on heteroscedastic measurement. In this paper, we consider a general parametric regression model allowing for a covariate measured with heteroscedastic error. We allow both the variance function of the measurement errors and the conditional density function of the error-prone covariate given the error-free covariates to be completely unspecified. We treat the variance function using B-spline approximation and propose a semiparametric estimator based on efficient score functions to deal with the heteroscedasticity of the measurement error. The resulting estimator is consistent and enjoys good inference properties. Its finite-sample performance is demonstrated through simulation studies and a real data example.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Burak Alparslan Eroğlu In this paper, we propose a wavelet-based cointegration test for fractionally integrated time series. The proposed test is nonparametric and asymptotically invariant to different forms of short-run dynamics. The use of wavelets allows us to take advantage of the wavelet-based bootstrapping method mainly known as wavestrapping. In this regard, we introduce to the literature a new wavestrapping algorithm for multivariate time series processes specifically for cointegration tests. The Monte Carlo simulations indicate that this new wavestrapping procedure can alleviate the severe size distortions which we generally observe in cointegration tests with time series containing innovations that possess highly negative moving average roots.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Guangren Yang, Ling Zhang, Runze Li, Yuan Huang The varying-coefficient Cox model is flexible and useful for modeling the dynamic changes of regression coefficients in survival analysis. In this paper, we study feature screening for varying-coefficient Cox models in ultrahigh-dimensional covariates. The proposed screening procedure is based on the joint partial likelihood of all predictors, thus different from marginal screening procedures available in the literature. In order to carry out the new procedure, we propose an effective algorithm and establish its ascent property. We further prove that the proposed procedure possesses the sure screening property. That is, with probability tending to 1, the selected variable set includes the actual active predictors. We conducted simulations to evaluate the finite-sample performance of the proposed procedure and compared it with marginal screening procedures. A genomic data set is used for illustration purposes.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Jia Zhang, Haoming Shi, Lemeng Tian, Fengjun Xiao In this paper, we propose a penalized generalized empirical likelihood (PGEL) approach based on the smoothed moment functions Anatolyev (2005), Smith (1997), Smith (2004) for parameters estimation and variable selection in the growing (high) dimensional weakly dependent time series setting. The dimensions of the parameters and moment restrictions are both allowed to grow with the sample size at some moderate rates. The asymptotic properties of the estimators of the smoothed generalized empirical likelihood (SGEL) and its penalized version (SPGEL) are then obtained by properly restricting the degree of data dependence. It is shown that the SPGEL estimator maintains the oracle property despite the existence of data dependence and growing (high) dimensionality. We finally present simulation results and a real data analysis to illustrate the finite-sample performance and applicability of our proposed method.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Jianyu Liu, Guan Yu, Yufeng Liu Linear discriminant analysis (LDA) is a well-known classification technique that enjoyed great success in practical applications. Despite its effectiveness for traditional low-dimensional problems, extensions of LDA are necessary in order to classify high-dimensional data. Many variants of LDA have been proposed in the literature. However, most of these methods do not fully incorporate the structure information among predictors when such information is available. In this paper, we introduce a new high-dimensional LDA technique, namely graph-based sparse LDA (GSLDA), that utilizes the graph structure among the features. In particular, we use the regularized regression formulation for penalized LDA techniques, and propose to impose a structure-based sparse penalty on the discriminant vector β. The graph structure can be either given or estimated from the training data. Moreover, we explore the relationship between the within-class feature structure and the overall feature structure. Based on this relationship, we further propose a variant of our proposed GSLDA to utilize effectively unlabeled data, which can be abundant in the semi-supervised learning setting. With the new regularization, we can obtain a sparse estimate of β and more accurate and interpretable classifiers than many existing methods. Both the selection consistency of β estimation and the convergence rate of the classifier are established, and the resulting classifier has an asymptotic Bayes error rate. Finally, we demonstrate the competitive performance of the proposed GSLDA on both simulated and real data studies.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Young-Geun Choi, Johan Lim, Anindya Roy, Junyong Park This paper is concerned with the positive definiteness (PDness) problem in covariance matrix estimation. For high-dimensional data, many regularized estimators have been proposed under structural assumptions on the true covariance matrix, including sparsity. They were shown to be asymptotically consistent and rate-optimal in estimating the true covariance matrix and its structure. However, many of them do not take into account the PDness of the estimator and produce a non-PD estimate. To achieve PDness, researchers considered additional regularizations (or constraints) on eigenvalues, which make both the asymptotic analysis and computation much harder. In this paper, we propose a simple modification of the regularized covariance matrix estimator to make it PD while preserving the support. We revisit the idea of linear shrinkage and propose to take a convex combination between the first-stage estimator (the regularized covariance matrix without PDness) and a given form of diagonal matrix. The proposed modification, which we call the FSPD (Fixed Support and Positive Definiteness) estimator, is shown to preserve the asymptotic properties of the first-stage estimator if the shrinkage parameters are carefully selected. It has a closed form expression and its computation is optimization-free, unlike existing PD sparse estimators. In addition, the FSPD is generic in the sense that it can be applied to any non-PD matrix, including the precision matrix. The FSPD estimator is numerically compared with other sparse PD estimators to understand its finite-sample properties as well as its computational gain. It is also applied to two multivariate procedures relying on the covariance matrix estimator – the linear minimax classification problem and the Markowitz portfolio optimization problem – and is shown to improve substantially the performance of both procedures.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Tung Duy Luu, Jalal Fadili, Christophe Chesneau In this paper, we consider a high-dimensional nonparametric regression model with fixed design and iid random errors. We propose an estimator by exponential weighted aggregation with a group-analysis sparsity and a prior on the weights. We prove that our estimator satisfies a sharp group-analysis sparse oracle inequality with a small remainder term that ensures its good theoretical performance. We also propose a forward–backward proximal Langevin Monte Carlo algorithm to sample from the target distribution (which is neither smooth nor log-concave) and derive its convergence guarantees. In turn, this enables us to implement our estimator and validate it with numerical experiments.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Nobuhiro Taneichi, Yuri Sekiya, Jun Toyama Tests of the hypothesis of conditional independence in J×K×L contingency tables are considered. An expression to approximate the null distribution of the test statistics is derived. Using this expression, transformed statistics are obtained which converge to a chi-square limiting distribution faster than the original statistics do. Simulations are used to compare the transformed statistics with the original ones, and transformed statistics are proposed based on a Bartlett-type adjustment. Through this work and earlier ones, we cover testing hierarchical loglinear models in three-way tables except for a model having three interaction terms.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Maria Umlauft, Marius Placzek, Frank Konietschke, Markus Pauly Repeated measures designs are frequently used for planning experiments in the life or social sciences. Typical examples include the comparison of different treatments over time, where both factor levels may possess an additional structure. For such designs, the statistical analysis typically consists of several steps. If the global null is rejected, multiple comparisons are performed. Usually, general factorial repeated measures designs are inferred by classical linear mixed models. Common underlying assumptions, such as normality or variance homogeneity are, however, often not met in practice. Furthermore, when dealing with, e.g., ordinal or ordered categorical data, means are no longer meaningful to describe an effect and other effect sizes should be used. To this end, we developmultiple contrast tests for nonparametric treatment effects in general factorial repeated measures designs within this paper and equip them with a novel, asymptotically correct wild bootstrap approach. Because regulatory authorities require the calculation of confidence intervals, this work also provides simultaneous confidence intervals for linear contrasts and for the ratio of different contrasts in meaningful effects. Extensive simulations are conducted to foster the theoretical findings. Finally, the analysis of two datasets exemplify the applicability of the novel procedures.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): X. Jessie Jeng, Xiongzhi Chen We propose a ranking and selection procedure to prioritize relevant predictors and control false discovery proportion (FDP) in variable selection. Our procedure utilizes a new ranking method built upon the de-sparsified Lasso estimator. We show that the new ranking method achieves the optimal order of minimum non-zero effects in ranking relevant predictors ahead of irrelevant ones. Adopting the new ranking method, we develop a variable selection procedure to asymptotically control FDP at a user-specified level. We show that our procedure can consistently estimate the FDP of variable selection as long as the de-sparsified Lasso estimator is asymptotically normal. In simulations, our procedure compares favorably to existing methods in ranking efficiency and FDP control when the regression model is relatively sparse.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Natalie Neumeyer, Marek Omelka, Šárka Hudecová This paper is concerned with modeling the dependence structure of two (or more) time series in the presence of a (possibly multivariate) covariate which may include past values of the time series. We assume that the covariate influences only the conditional mean and the conditional variance of each of the time series but the distribution of the standardized innovations is not influenced by the covariate and is stable in time. The joint distribution of the time series is then determined by the conditional means, the conditional variances and the marginal distributions of the innovations, which we estimate nonparametrically, and the copula of the innovations, which represents the dependency structure. We consider a nonparametric and a semiparametric estimator based on the estimated residuals. We show that under suitable assumptions, these copula estimators are asymptotically equivalent to estimators that would be based on the unobserved innovations. The theoretical results are illustrated by simulations and a real data example.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Mehrdad Naderi, Wen-Liang Hung, Tsung-I Lin, Ahad Jamalizadeh This paper presents a new finite mixture model based on the multivariate normal mean–variance mixture of Birnbaum–Saunders (NMVBS) distribution. We develop a computationally analytical EM algorithm for model fitting. Due to the dependence of this algorithm on initial values and the number of mixing components, a learning-based EM algorithm and an extended variant are proposed. Numerical simulations show that the proposed algorithms allow for better clustering performance and classification accuracy than some competing approaches. The effectiveness and prominence of the proposed methodology are also shown through an application to an extrasolar planet dataset.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Youming Liu, Cong Wu This paper considers point-wise estimation of density functions under the local anisotropic Hölder condition by the wavelet method. A linear wavelet estimate is first introduced and shown to be optimal. A data driven version is provided for adaptivity and the influence of the dimension is reduced under the independence structure of the estimated density.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Jakob Raymaekers, Peter Rousseeuw The well-known spatial sign covariance matrix (SSCM) carries out a radial transform which moves all data points to a sphere, followed by computing the classical covariance matrix of the transformed data. Its popularity stems from its robustness to outliers, fast computation, and applications to correlation and principal component analysis. In this paper we study more general radial functions. It is shown that the eigenvectors of the generalized SSCM are still consistent and the ranks of the eigenvalues are preserved. The influence function of the resulting scatter matrix is derived, and it is shown that its asymptotic breakdown value is as high as that of the original SSCM. A simulation study indicates that the best results are obtained when the inner half of the data points are not transformed and points lying far away are moved to the center.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): M.C. Jones, Éric Marchand In this article, we develop a sum and share decomposition to model multivariate discrete distributions, and more specifically multivariate count data that can be divided into a number of distinct categories. From a Poisson mixture model for the sum and a multinomial mixture model for the shares, a rich ensemble of properties, examples and relationships arises. As a main example, a seemingly new multivariate model involving a negative binomial sum and Pólya shares is considered, previously seen only in the bivariate case, for which we present two contrasting applications. For other choices of the distribution of the sum, natural but novel discrete multivariate Liouville distributions emerge; an important special case of these is that of Schur constant distributions. Analogies and interactions with related continuous distributions are to the fore throughout.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Gwangsu Kim, Taeryon Choi We study the asymptotic properties of nonparametric Bayesian structural equation models (SEMs). Under mild conditions, when adjusting nonparametric error distributions, the posteriors of Bayesian SEMs achieve the optimal convergence rate up to logn terms in the nonparametric means and nonlinear relationships of the latent variables. Furthermore, we consider quantile regressions of the error and latent variables in Bayesian SEMs, and we show posterior consistency in Bayesian quantile regression. The theoretical results are validated using simulation studies.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Yoonseok Lee, Debasri Mukherjee, Aman Ullah This paper considers multivariate local linear least squares estimation of panel data models when fixed effects are present. One-step estimation of the local marginal effect is of prime interest. A within-group nonparametric estimator is developed, where the fixed effects are eliminated by subtracting individual-specific locally weighted time average, i.e., the local-within-transformation. It is shown that the local-within-transformation-based estimator satisfies the standard properties of the local linear estimator. In comparison, nonparametric estimators based on the conventional (global) within-transformation or first difference result in estimators which are biased, even in large samples. The new estimator is used to examine the nonlinear relationship between income and nitrogen-oxide level (i.e., the environmental Kuznets curve) based on US state-level panel data.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Shen Zhang, Peixin Zhao, Gaorong Li, Wangli Xu In this paper, we propose a nonparametric independence screening method for sparse ultra-high dimensional generalized varying coefficient models with longitudinal data. Our methods combine the ideas of sure independence screening (SIS) in sparse ultra-high dimensional generalized linear models and varying coefficient models with the marginal generalized estimating equation (GEE) method, called NIS-GEE, considering both the marginal correlation between response and covariates, and the subject correlation for variable screening. The corresponding iterative algorithm is introduced to enhance the performance of the proposed NIS-GEE method. Furthermore it is shown that, under some regularity conditions, the proposed NIS-GEE method enjoys the sure screening properties. Simulation studies and a real data analysis are used to assess the performance of the proposed method.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Tonglin Zhang, Jorge Mateu This article aims to introduce the concept of substationarity for spatial point processes (SPPs). Substationarity is a new concept that has never been studied in the literature. Substationarity means that the distribution of an SPP can only be invariant under location shifts within a linear subspace of the domain. This notion lies theoretically between stationarity and nonstationarity. To formally propose the approach, the article provides the definition of substationarity and estimation of the first-order intensity function, including the subspace. As this may be unknown, we recommend using a parametric method to estimate the linear subspace and a nonparametric one to estimate the first-order intensity function given the linear subspace. It is thus a semiparametric approach. The simulation study shows that both the estimators of the linear subspace and the first-order intensity function are reliable. In an application to a Canadian forest wildfire data set, the article concludes that substationarity of wildfire occurrences may be assumed along the longitude, indicating that latitude is a more important factor than longitude in Canadian forest wildfire studies.

Abstract: Publication date: May 2019Source: Journal of Multivariate Analysis, Volume 171Author(s): Ngoc M. Tran, Petra Burdejová, Maria Ospienko, Wolfgang K. Härdle Principal component analysis (PCA) is a widely used dimension reduction tool in high-dimensional data analysis. In risk quantification in finance, climatology and many other applications, however, the interest lies in capturing the tail variations rather than variation around the mean. To this end, we develop Principal Expectile Analysis (PEC), which generalizes PCA for expectiles. It can be seen as a dimension reduction tool for extreme-value theory, where fluctuations in the τ-expectile level of the data are approximated by a low-dimensional subspace. We provide algorithms based on iterative least squares, derive bounds on their convergence time, and compare their performance through simulations. We apply the algorithms to a Chinese weather dataset and fMRI data from an investment decision study.

Abstract: Publication date: Available online 19 March 2019Source: Journal of Multivariate AnalysisAuthor(s): T. Nagler, C. Bumann, C. Czado Vine copulas allow the construction of flexible dependence models for an arbitrary number of variables using only bivariate building blocks. The number of parameters in a vine copula model increases quadratically with the dimension, which poses challenges in high-dimensional applications. To alleviate the computational burden and risk of overfitting, we propose a modified Bayesian information criterion (BIC) tailored to sparse vine copula models. We argue that this criterion can consistently distinguish between the true and alternative models under less stringent conditions than the classical BIC. The criterion suggested here can further be used to select the hyper-parameters of sparse model classes, such as truncated and thresholded vine copulas. We present a computationally efficient implementation and illustrate the benefits of the proposed concepts with a case study where we model the dependence in a large portfolio.

Abstract: Publication date: Available online 18 March 2019Source: Journal of Multivariate AnalysisAuthor(s): Dulal K. Bhaumik, Rachel K. Nordgren The standard approach for prediction of multiple correlated outcome measures overpredicts the unknown observation in the linear model setup if associated covariate measures follow a certain distribution. It is desired to have a nonempty confidence region when some covariate measures are missing and required to be estimated. This article develops a methodology for prediction and proposes a shrinkage predictor with a smaller risk compared to the one based on the maximum likelihood estimate. It also provides an algorithm for constructing a nonempty confidence region for unknown covariates. Proposed methodology is shown to perform well in terms of maintaining a smaller risk in prediction and the coverage probability in calibration. Results are illustrated with a recent behavioral science dataset.

Abstract: Publication date: Available online 18 March 2019Source: Journal of Multivariate AnalysisAuthor(s): Stefan Birr, Tobias Kley, Stanislav Volgushev Finding parametric models that accurately describe the dependence structure of observed data is a central task in the analysis of time series. Classical frequency domain methods provide a popular set of tools for fitting and diagnostics of time series models, but their applicability is seriously impacted by the limitations of covariances as a measure of dependence. Motivated by recent developments of frequency domain methods that are based on copulas instead of covariances, we propose a novel graphical tool to assess the quality of time series models for describing dependencies that go beyond linearity. We provide a theoretical justification of our approach and show in simulations that it can successfully distinguish between subtle differences in time series dynamics, including non-linear dynamics which result from GARCH and EGARCH models. We also demonstrate the utility of the proposed tools through an application to modeling returns of the S&P 500 stock market index.

Abstract: Publication date: Available online 8 March 2019Source: Journal of Multivariate AnalysisAuthor(s): Bouchra R. Nasri, Bruno N. Rémillard In this paper, we propose an intuitive way to couple several dynamic time series models even when there are no innovations. This extends previous work for modeling dependence between innovations of stochastic volatility models. We consider time-dependent and time-independent copula models and we study the asymptotic behavior of some empirical processes constructed from pseudo-observations, as well as the behavior of maximum pseudo-likelihood estimators of the associated copula parameters. The results show that even if the univariate dynamic models depend on unknown parameters, the limiting behavior of many processes of interest does not depend on the estimation errors. One can perform tests for change points on the full distribution, the margins or the copula, as if the parameters of the dynamic models were known. This is also true for some parametric models of time-dependent copulas. This interesting property makes it possible to construct consistent tests of specification for the dependence models, without having to consider the dynamic time series models. Monte Carlo simulations are used to demonstrate the power of the proposed goodness-of-fit test in finite samples. An application to financial data is given.

Abstract: Publication date: Available online 27 February 2019Source: Journal of Multivariate AnalysisAuthor(s): Pavel Krupskii, Harry Joe We propose three methods for estimating the joint tail probabilities based on a d-variate copula with dimension d≥2. For the first two methods, we use two different tail expansions of the copula which are valid under mild regularity conditions. We estimate the coefficients of these expansions using the maximum likelihood approach with appropriate data beyond a threshold in the tail. For the third method, we propose a family of tail-weighted measures of multivariate dependence and use these measures to estimate the coefficients of the second tail expansion using regression. This expansion is then used to estimate the joint tail probabilities when the empirical probabilities cannot be used because of lack of data in the tail. The three proposed methods can also be used to estimate tail dependence coefficients of a multivariate copula. Simulation studies are used to indicate when the methods give more accurate estimates of the tail probabilities and tail dependence coefficients. We apply the proposed methods to analyze tail properties of a data set of financial returns.

Abstract: Publication date: Available online 26 February 2019Source: Journal of Multivariate AnalysisAuthor(s): Marius Hofert, Wayne Oldford, Avinash Prasad, Mu Zhu A framework for quantifying dependence between random vectors is introduced. Using the notion of a collapsing function, random vectors are summarized by single random variables, referred to as collapsed random variables. Measures of association computed from the collapsed random variables are then used to measure the dependence between random vectors. To this end, suitable collapsing functions are presented. Furthermore, the notion of a collapsed distribution function and collapsed copula are introduced and investigated for certain collapsing functions. This investigation yields a multivariate extension of the Kendall distribution and its corresponding Kendall copula for which some properties and examples are provided. In addition, non-parametric estimators for the collapsed measures of association are provided along with their corresponding asymptotic properties. Finally, data applications to bioinformatics and finance are presented along with a general graphical assessment of independence between groups of random variables.

Abstract: Publication date: Available online 16 February 2019Source: Journal of Multivariate AnalysisAuthor(s): Jeffrey Näf, Marc S. Paolella, Paweł Polak A mean–variance heterogeneous tails mixture distribution is proposed for modeling financial asset returns. It captures, along with the obligatory leptokurtosis, different tail behavior among the assets. Its construction allows for joint maximum likelihood estimation of all model parameters via an expectation–maximization algorithm and thus is applicable in high dimensions. A useful and unique feature of the model is that the tail behavior of the individual assets is driven by asset-specific news effects. In the bivariate iid case, the model corresponds to the standard CAPM model, but enriched with a filter for capturing the news impact associated with both the market and asset excess returns. An empirical application using a portfolio of highly tail-heterogeneous cryptocurrencies and realistic transaction costs shows superior out-of-sample portfolio performance compared to numerous competing models. A model extension to capture asset-specific asymmetry is also discussed.

Abstract: Publication date: Available online 13 February 2019Source: Journal of Multivariate AnalysisAuthor(s): Elisa Perrone, Liam Solus, Caroline Uhler The space of discrete copulas admits a representation as a convex polytope, and this has been exploited in entropy-copula methods used in hydrology and climatology. In this paper, we focus on the class of component-wise convex copulas, i.e., ultramodular copulas, which describe the joint behavior of stochastically decreasing random vectors. We show that the family of ultramodular discrete copulas and its generalization to component-wise convex discrete quasi-copulas also admit representations as polytopes. In doing so, we draw connections to the Birkhoff polytope, the alternating sign matrix polytope, and their generalizations, thereby unifying and extending results on these polytopes from both the statistics and the discrete geometry literature.

Abstract: Publication date: Available online 28 January 2019Source: Journal of Multivariate AnalysisAuthor(s): Anna Castañer, M. Mercè Claramunt, Claude Lefèvre, Stéphane Loisel In this paper, we introduce a new multivariate dependence model that generalizes the standard Schur-constant model. The difference is that the random vector considered is partially exchangeable, instead of exchangeable, whence the term partially Schur-constant. Its advantage is to allow some heterogeneity of marginal distributions and a more flexible dependence structure, which broadens the scope of potential applications. We first show that the associated joint survival function is a monotonic multivariate function. Next, we derive two distributional representations that provide an intuitive understanding of the underlying dependence. Several other properties are obtained, including correlations within and between subvectors. As an illustration, we explain how such a model could be applied to risk management for insurance networks.

Abstract: Publication date: Available online 28 January 2019Source: Journal of Multivariate AnalysisAuthor(s): Janina Engel, Andrea Pagano, Matthias Scherer A flexible probabilistic approach for the constructing of realistic topologies of interbank networks is presented. This constitutes a challenging task, since information on bilateral inter-banking activities is classified confidential and the number of banks in most European countries is substantial. First, we analyze what information on European inter-banking liabilities is publicly available. Second, we present an approach for the reconstruction of network topologies satisfying known characteristics through an exponential random graph model (ERGM), which incorporates the available information as side conditions. Third, we conduct a case study calibrating the model to the Italian and the German interbank market. Samples of both models are then analyzed with respect to different network statistics. The relevance of the presented results stems from the urgent need of having realistic instances of possible adjacency matrices as input in technical studies on the stability of inter-banking networks. Various such studies exist, however, most of them rely on toy models for the analyzed adjacency matrices.

Abstract: Publication date: Available online 27 November 2018Source: Journal of Multivariate AnalysisAuthor(s): Marie-Pier Côté, Christian Genest Many copula families, including the classes of Archimedean, elliptical and Liouville copulas, may be written as the survival copula of a random vector R×(Y1,Y2), where R is a strictly positive random variable independent of the random vector (Y1,Y2) . A unified framework is presented for studying the dependence structure underlying this stochastic representation, which is called the background risk model. Formulas for the copula, Kendall’s tau and tail dependence coefficients are obtained and special cases are detailed. The usefulness of the construction for model building is illustrated with an extension of Archimedean copulas with completely monotone generators, based on the Farlie–Gumbel–Morgenstern copula. In particular, explicit expressions for the distribution and the Tail-Value-at-Risk of the aggregated risk RY1+RY2 are available in a generalization of the widely used multivariate Pareto-II model.