Abstract: Abstract This article proposes a Bayesian approach to estimating the spectral density of a stationary time series using a prior based on a mixture of P-spline distributions. Our proposal is motivated by the B-spline Dirichlet process prior of Edwards et al. (Stat Comput 29(1):67–78, 2019. https://doi.org/10.1007/s11222-017-9796-9) in combination with Whittle’s likelihood and aims at reducing the high computational complexity of its posterior computations. The strength of the B-spline Dirichlet process prior over the Bernstein–Dirichlet process prior of Choudhuri et al. (J Am Stat Assoc 99(468):1050–1059, 2004. https://doi.org/10.1198/016214504000000557) lies in its ability to estimate spectral densities with sharp peaks and abrupt changes due to the flexibility of B-splines with variable number and location of knots. Here, we suggest to use P-splines of Eilers and Marx (Stat Sci 11(2):89–121, 1996. https://doi.org/10.1214/ss/1038425655) that combine a B-spline basis with a discrete penalty on the basis coefficients. In addition to equidistant knots, a novel strategy for a more expedient placement of knots is proposed that makes use of the information provided by the periodogram about the steepness of the spectral power distribution. We demonstrate in a simulation study and two real case studies that this approach retains the flexibility of the B-splines, achieves similar ability to accurately estimate peaks due to the new data-driven knot allocation scheme but significantly reduces the computational costs. PubDate: 2021-01-20
Abstract: Abstract The popularity of the classical general linear model (CGLM) is attributable mostly to its ease of fitting and validating; however, the CGLM is inappropriate for correlated observations. In this paper we explore linear models for correlated observations with an exchangeable structure (Arnold in J Am Stat Assoc 74:194–199, 1979). For the case of \(N>1\) repeated measures observations having site-dependent or site-independent covariates, the maximum likelihood estimates (MLEs) of the model’s parameters are derived, likelihood ratio tests are obtained for relevant model building hypotheses, and some Monte Carlo simulation studies are performed to illuminate important aspects of the models and their tests of hypotheses. For the case of site-independent covariates, closed-form solutions exist for the MLEs and exact tests can be constructed for the model building hypotheses. Simulations revealed that these exact tests remain robust in the presence of moderate skewness or outliers. However, these fortuitous closed-form occurrences vanish for the case of site-dependent covariates. In order to ameliorate this deficiency, some Monte Carlo simulations are performed to estimate the bias of these MLEs, the probability of a multimodal likelihood, and the suitability of the limiting chi-squared approximation to the model building hypotheses. These simulations reveal that the estimated biases of the slope parameters are negligible for sample size combinations (n, N) as small as (2, 6). Likewise, this sample size combination resulted in only an approximate 1% estimated probability of a multimodal likelihood, which drastically decreased with the increase of either n or N. Moreover, the limiting \(\chi ^2\) distributional assumption appears to hold reasonably well for a sample size of \(N=100\) , regardless of the value of n. Finally, we provide examples of fitting our model and conducting tests of hypotheses using two medical datasets. PubDate: 2021-01-20
Abstract: Abstract When a survey study is related to sensitive issues such as political orientation, sexual orientation, and income, respondents may not be willing to reply truthfully, which leads to bias results. To protect the respondents’ privacy and improve their willingness to provide true answers, Warner (J Am Stat Assoc 60:63–69, 1965) proposed the randomized response (RR) technique in which respondents select a question by means of a random device in order to ensure that they maintain privacy. Huang (Stat Neerl 58:75–82, 2004) extended the RR design of Warner (1965) to propose a two-stage RR design. Not only can this method be used to estimate the population proportion of persons with a sensitive characteristic, but also estimate the honest answer rate in the first stage. This work develops a covariate extension of the two-stage RR design of Huang (2004) by applying logistic regression to investigate the effects of covariates on a sensitive characteristic and an honest response. Simulation experiments are conducted to study the finite-sample performance of the maximum likelihood estimators of the logistic regression parameters. The proposed methodology is applied to analyze the survey data of sexuality of freshmen at Feng Chia University in Taiwan in 2016. PubDate: 2021-01-18
Abstract: Abstract In regression models with a grouping structure among the explanatory variables, variable selection at the group and within group individual variable level is important to improve model accuracy and interpretability. In this article, we propose a hierarchical bi-level variable selection approach for censored survival data in the linear part of a partially linear additive hazards model where the covariates are naturally grouped. The proposed method is capable of conducting simultaneous group selection and individual variable selection within selected groups. Computational algorithms are developed, and the asymptotic rates and selection consistency of the proposed estimators are established. Simulation results indicate that our proposed method outperforms several existing penalties, for example, LASSO, SCAD, and adaptive LASSO. Application of the proposed method is illustrated with the Mayo Clinic primary biliary cirrhosis (PBC) data. PubDate: 2021-01-15
Abstract: Abstract We consider a computationally simple orthogonal projection method to implement the (Bai and Ng in Econometrica 70:191–221, 2002) information criterion to select the factor dimension for panel interactive effects models that bypasses issues arising from the joint estimation of the slope coefficients and factor structure. Our simulations show that it performs well in cases the method can be implemented. PubDate: 2021-01-15
Abstract: Abstract We propose an importance sampling algorithm with proposal distribution obtained from variational approximation. This method combines the strength of both importance sampling and variational method. On one hand, this method avoids the bias from variational method. On the other hand, variational approximation provides a way to design the proposal distribution for the importance sampling algorithm. Theoretical justification of the proposed method is provided. Numerical results show that using variational approximation as the proposal can improve the performance of importance sampling and sequential importance sampling. PubDate: 2021-01-13
Abstract: Abstract This study considers the problem of multiple change-points detection. For this problem, we develop an objective Bayesian multiple change-points detection procedure in a normal model with heterogeneous variances. Our Bayesian procedure is based on a combination of binary segmentation and the idea of the screening and ranking algorithm (Niu and Zhang in Ann Appl Stat 6:1306–1326, 2012). Using the screening and ranking algorithm, we can overcome the drawbacks of binary segmentation, as it cannot detect a small segment of structural change in the middle of a large segment or segments of structural changes with small jump magnitude. We propose a detection procedure based on a Bayesian model selection procedure to address this problem in which no subjective input is considered. We construct intrinsic priors for which the Bayes factors and model selection probabilities are well defined. We find that for large sample sizes, our method based on Bayes factors with intrinsic priors is consistent. Moreover, we compare the behavior of the proposed multiple change-points detection procedure with existing methods through a simulation study and two real data examples. PubDate: 2021-01-12
Abstract: Abstract We consider a variety of tests for testing goodness–of–fit in a parametric Cox proportional hazards (PH) and accelerated failure time (AFT) model in the presence of Type–II right censoring. The testing procedures considered can be divided in two categories: an approach involving transforming the data to a complete sample and an approach using test statistics that can directly accommodate Type-II right censoring. The power of the proposed tests are compared through a Monte Carlo study for various scenarios. It is found that both approaches are useful for testing exponentiality if the censoring proportion in a data set is lower than 30%, but that it is recommended to use the approach that first transforms the sample to a complete sample when one encounters higher censoring proportions. PubDate: 2021-01-08
Abstract: Abstract A partial least squares regression is proposed for estimating the function-on-function regression model where a functional response and multiple functional predictors consist of random curves with quadratic and interaction effects. The direct estimation of a function-on-function regression model is usually an ill-posed problem. To overcome this difficulty, in practice, the functional data that belong to the infinite-dimensional space are generally projected into a finite-dimensional space of basis functions. The function-on-function regression model is converted to a multivariate regression model of the basis expansion coefficients. In the estimation phase of the proposed method, the functional variables are approximated by a finite-dimensional basis function expansion method. We show that the partial least squares regression constructed via a functional response, multiple functional predictors, and quadratic/interaction terms of the functional predictors is equivalent to the partial least squares regression constructed using basis expansions of functional variables. From the partial least squares regression of the basis expansions of functional variables, we provide an explicit formula for the partial least squares estimate of the coefficient function of the function-on-function regression model. Because the true forms of the models are generally unspecified, we propose a forward procedure for model selection. The finite sample performance of the proposed method is examined using several Monte Carlo experiments and two empirical data analyses, and the results were found to compare favorably with an existing method. PubDate: 2021-01-06
Abstract: Abstract Presence of outliers or contamination in the process control affect the construction of quality control limits badly. Therefore, more attention is to paid robust methods describing the data majority. The main focus of this study is to construct robust R charts by using ranked set sampling (RSS) and median ranked set sampling (MRSS) designs under contaminated skewed distributions such as Marshall–Olkin bivarite Weibull and Bivarite Lognormal distributions. Three robust methods named as robust modified Shewhart, robust modified weighted variance and robust modified skewness correction methods are introduced. An extensive simulation study is presented in order to analyse the efficiency of different sampling designs and robust modified R charts on the charts’ performance. These methods are evaluated in terms of their Type I risks (p). The simulation study showed that, in the contamination case, Shewhart, weighted variance and skewness correction charts based on classic estimators are affected from outliers under simple random sampling and RSS designs and cause the increase of the p values in the process. However using MRSS design is not affected by outliers. In the case of contamination, proposed robust MSC method under MRSS has the best performance for all cases. Moreover a real data application is conducted to show the superiority of proposed methods. PubDate: 2021-01-05
Abstract: Abstract We derive an explicit representation for the transition law of a p-tempered \(\alpha \) -stable process of Ornstein–Uhlenbeck-type and use it to develop a methodology for simulation. Our results apply in both the univariate and multivariate cases. Special attention is given to the case where \(p\le \alpha \) , which is more complicated and requires developing the new class of so-called incomplete gamma distributions. PubDate: 2021-01-04
Abstract: Abstract Several researchers have addressed the problem of testing the homogeneity of the scale parameters of several independent inverse Gaussian distributions based on the likelihood ratio test. However, only approximations of the distribution function of the test statistic are available in the literature. In this note, we present the exact distribution of the likelihood ratio test statistic for testing the equality of the scale parameters of several independent inverse Gaussian populations in a closed form. To this end, we apply the Mellin inverse transform and the Jacobi polynomial expansion to the moments of the likelihood ratio test statistic. We also propose an approximate method based on the Jacobi polynomial expansion. Finally, we apply an accurate numerical method, which is based on the inverse of characteristic function, to obtain a near-exact approximation of the likelihood ratio test statistic distribution. The proposed methods are illustrated via numerical and real data examples. PubDate: 2021-01-03
Abstract: Abstract The skip-lot sampling plan (SkSP) is employed in supply chains to decrease the amount of inspection required for submitted lots when they have demonstrated a succession of lots with excellent quality. As only some fractions of lots are examined, the cost of inspection is reduced. With the current abundance of high-yield products, however, the majority of SkSP schemes have been utilized for attributes testing, which does not fully reveal the SkSP’s economic advantages. Thus, on the basis of the process capability index Cpk, the variables SkSP with single sampling as a reference plan (Cpk-SkSP-2) was developed. With management of the lot’s quality and tolerable risks agreeable to both the supplier and the buyer, the Cpk-SkSP-2 were incorporated with acceptance probabilities (rather than asymptotic approximations), which yielded the exact sampling distribution of the Cpk estimator at the specified quality standards. Furthermore, the equilibrium probability for the acceptance of Cpk-SkSP-2 was derived from a Markov chain technique. These treatments enable minimization of the average number of samples required to render more reliable and optimal plan parameters for the inspection of products with a low fraction of defectives. The results are compared with the variables Cpk-based single sampling plans. Finally, a graphical user interface was built on the basis of our proposed Cpk-SkSP-2 procedures and methodologies to facilitate data input, plan selection, criteria computation, and decision-making in practice. PubDate: 2021-01-03
Abstract: Abstract Exponential regression models with censored data are most widely used in practice. In the modeling process, there exist situations where the covariates are not directly observed but are observed after being contaminated by unknown functions of an observable confounder in a multiplicative manner. The problem of outlier detection is a fundamental and important problem in applied statistics. In this paper, we use a nonparametric regression method to adjust the covariates and recast the outlier detection issue into a high-dimensional regularization regression issue in the covariate-adjusted exponential regression model with censored data. We propose a smoothly clipped absolute deviation (SCAD) penalized likelihood method to detect the possible outliers, which features that the proposed method can simultaneously deal with outlier detection and estimations for the regression coefficients. The coordinate descent algorithm is employed to facilitate computation. Simulation studies are conducted to evaluate the finite-sample performance of our proposed method. An application to a German breast cancer study demonstrates the utility of the proposed method in practice. PubDate: 2021-01-02
Abstract: Abstract Khmaladze martingale transformation provides an asymptotically-distribution-free method for a goodness-of-fit test. With its usage not being restricted to testing for normality, it can also be selected to test for a location-scale family of distributions such as logistic and Cauchy distributions. Despite its merits, the Khmaladze martingale transformation, however, could not have enjoyed deserved celebrity since it is computationally expensive; it entails the complex and time-consuming computations, including optimization, integration of a fractional function, matrix inversion, etc. To overcome these computational challenges, this paper proposes a fast algorithm which provides a solution to the Khmaladze martingale transformation method. To that end, the proposed algorithm is equipped with a novel strategy, named integration-in-advance, which rigorously exploits the structure of the Khmaladze martingale transformation. PubDate: 2020-12-01
Abstract: Abstract In this paper, a partially linear varying-coefficient model with measurement errors in the nonparametric component as well as missing response variable is studied. Two estimators for the parameter vector and nonparametric function are proposed based on the locally corrected profile least squares method. The first estimator is constructed by using the complete-case data only, and another by using an imputation technique. Both proposed estimators of the parametric component are shown to be asymptotically normal, and the estimators of nonparametric function are proved to achieve the optimal strong convergence rate as the usual nonparametric regression. Some simulation studies are conducted to compare the behavior of these estimators and the results confirm that the estimators based on the imputation technique perform better than the complete-case data estimator in finite samples. Finally, an application to a real data set is illustrated. PubDate: 2020-12-01
Abstract: Abstract In the big data setting, working data sets are often distributed on multiple machines. However, classical statistical methods are often developed to solve the problems of single estimation or inference. We employ a novel parallel quasi-likelihood method in generalized linear models, to make the variances between different sub-estimators relatively similar. Estimates are obtained from projection subsets of data and later combined by suitably-chosen unknown weights. We also show the proposed method to produce better asymptotic efficiency than using the simple average. Furthermore, simulation examples show that the proposed method can significantly improve statistical inference. PubDate: 2020-12-01
Abstract: Abstract Data containing many zeroes is popular in statistical applications, such as survey data. A confidence interval based on the traditional normal approximation may lead to poor coverage probabilities, especially when the nonzero values are highly skewed and the sample size is small or moderately large. The empirical likelihood (EL), a powerful nonparametric method, was proposed to construct confidence intervals under such a scenario. However, the traditional empirical likelihood experiences the issue of under-coverage problem which causes the coverage probability of the EL-based confidence intervals to be lower than the nominal level, especially in small sample sizes. In this paper, we investigate the numerical performance of three modified versions of the EL: the adjusted empirical likelihood, the transformed empirical likelihood, and the transformed adjusted empirical likelihood for data with various sample sizes and various proportions of zero values. Asymptotic distributions of the likelihood-type statistics have been established as the standard chi-square distribution. Simulations are conducted to compare coverage probabilities with other existing methods under different distributions. Real data has been given to illustrate the procedure of constructing confidence intervals. PubDate: 2020-12-01
Abstract: Abstract Word alignment has lots of applications in various natural language processing (NLP) tasks. As far as we are aware, there is no word alignment package in the R environment. In this paper, word.alignment, a new R software package is introduced which implements a statistical word alignment model as an unsupervised learning. It uses IBM Model 1 as a machine translation model based on the use of the EM algorithm and the Viterbi search in order to find the best alignment. It also provides the symmetric alignment using three heuristic methods such as union, intersection, and grow-diag. It has also the ability to build an automatic bilingual dictionary applying an innovative rule. The generated dictionary is suitable for a number of NLP tasks. This package provides functions for measuring the quality of the word alignment via comparing the alignment with a gold standard alignment based on five metrics as well. It is easily installed and executable on the mostly widely used platforms. Note that it is easily usable and we show that its results are almost everywhere better than some other word alignment tools. Finally, some examples illustrating the use of word.alignment is provided. PubDate: 2020-12-01
Abstract: Abstract We improve instability-based methods for the selection of the number of clusters k in cluster analysis by developing a corrected clustering distance that corrects for the unwanted influence of the distribution of cluster sizes on cluster instability. We show that our corrected instability measure outperforms current instability-based measures across the whole sequence of possible k, overcoming limitations of current insability-based methods for large k. We also compare, for the first time, model-based and model-free approaches to determining cluster-instability and find their performance to be comparable. We make our method available in the R-package cstab. PubDate: 2020-12-01