Abstract: A flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and Negative Binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared in simulations and by utilizing several real data applications from the area of health and social science. PubDate: 2021-03-01

Abstract: We introduce a new class of robust M-estimators for performing simultaneous parameter estimation and variable selection in high-dimensional regression models. We first explain the motivations for the key ingredient of our procedures which are inspired by regularization methods used in wavelet thresholding in noisy signal processing. The derived penalized estimation procedures are shown to enjoy theoretically the oracle property both in the classical finite dimensional case as well as the high-dimensional case when the number of variables p is not fixed but can grow with the sample size n, and to achieve optimal asymptotic rates of convergence. A fast accelerated proximal gradient algorithm, of coordinate descent type, is proposed and implemented for computing the estimates and appears to be surprisingly efficient in solving the corresponding regularization problems including the case for ultra high-dimensional data where \(p \gg n\) . Finally, a very extensive simulation study and some real data analysis, compare several recent existing M-estimation procedures with the ones proposed in the paper, and demonstrate their utility and their advantages. PubDate: 2021-03-01

Abstract: In linear time series analysis, the incorporation of the moving-average term in autoregressive models yields parsimony while retaining flexibility; in particular, the first order autoregressive moving-average model, ARMA(1,1) is notable since it retains a good approximating capability with just two parameters. In the same spirit, we assess empirically whether a similar result holds for threshold processes. First, we show that the first order threshold autoregressive moving-average process, TARMA(1,1) exhibits complex, high-dimensional, behaviour with parsimony, by comparing it with threshold autoregressive processes, TAR(p), with possibly large autoregressive order p. Second, we study the descriptive power of the TARMA(1,1) model with respect to the class of autoregressive models, seen as universal approximators: in several situations, the TARMA(1,1) model outperforms AR(p) models even when p is large. Lastly, we analyze two real world data sets: the sunspot number and the male US unemployment rate time series. In both cases, we show that TARMA models provide a better fit with respect to the best TAR models proposed in literature. PubDate: 2021-03-01

Abstract: The expectation-maximisation algorithm is employed to perform maximum likelihood estimation in a wide range of situations, including regression analysis based on clusterwise regression models. A disadvantage of using this algorithm is that it is unable to provide an assessment of the sample variability of the maximum likelihood estimator. This inability is a consequence of the fact that the algorithm does not require deriving an analytical expression for the Hessian matrix, thus preventing from a direct evaluation of the asymptotic covariance matrix of the estimator. A solution to this problem when performing linear regression analysis through a multivariate Gaussian clusterwise regression model is developed. Two estimators of the asymptotic covariance matrix of the maximum likelihood estimator are proposed. In practical applications their use makes it possible to avoid resorting to bootstrap techniques and general purpose mathematical optimisers. The performances of these estimators are evaluated in analysing small simulated and real datasets; the obtained results illustrate their usefulness and effectiveness in practical applications. From a theoretical point of view, under suitable conditions, the proposed estimators are shown to be consistent. PubDate: 2021-03-01

Abstract: Several methods have been devised to mitigate the effects of outlier values on survey estimates. If outliers are a concern for estimation of population quantities, it is even more necessary to pay attention to them in a small area estimation (SAE) context, where sample size is usually very small and the estimation in often model based. In this paper we set two goals: The first is to review recent developments in outlier robust SAE. In particular, we focus on the use of partial bias corrections when outlier robust fitted values under a working model generate biased predictions from sample data containing representative outliers. Then we propose an outlier robust bootstrap MSE estimator for M-quantile based small area predictors which considers a bounded-block-bootstrap approach. We illustrate these methods through model based and design based simulations and in the context of a particular survey data set that has many of the outlier characteristics that are observed in business surveys. PubDate: 2021-03-01

Abstract: Multistage ranking models, including the popular Plackett–Luce distribution (PL), rely on the assumption that the ranking process is performed sequentially, by assigning the positions from the top to the bottom one (forward order). A recent contribution to the ranking literature relaxed this assumption with the addition of the discrete-valued reference order parameter, yielding the novel Extended Plackett–Luce model (EPL). Inference on the EPL and its generalization into a finite mixture framework was originally addressed from the frequentist perspective. In this work, we propose the Bayesian estimation of the EPL in order to address more directly and efficiently the inference on the additional discrete-valued parameter and the assessment of its estimation uncertainty, possibly uncovering potential idiosyncratic drivers in the formation of preferences. We overcome initial difficulties in employing a standard Gibbs sampling strategy to approximate the posterior distribution of the EPL by combining the data augmentation procedure and the conjugacy of the Gamma prior distribution with a tuned joint Metropolis–Hastings algorithm within Gibbs. The effectiveness and usefulness of the proposal is illustrated with applications to simulated and real datasets. PubDate: 2021-03-01

Abstract: This paper introduces a temporal bivariate area-level linear mixed model with independent time effects for estimating small area socioeconomic indicators. The model is fitted by using the residual maximum likelihood method. Empirical best linear unbiased predictors of these indicators are derived. An approximation to the matrix of mean squared errors (MSE) is given and four MSE estimators are proposed. The first MSE estimator is a plug-in version of the MSE approximation. The remaining MSE estimators rely on parametric bootstrap procedures. Three simulation experiments designed to analyze the behavior of the fitting algorithm, the predictors and the MSE estimators are carried out. An application to real data from the 2005 and 2006 Spanish living conditions survey illustrate the introduced statistical methodology. The target is the estimation of 2006 poverty proportions and gaps by provinces and sex. PubDate: 2021-03-01

Abstract: This paper deals with simultaneous prediction for time series models. In particular, it presents a simple procedure which gives well-calibrated simultaneous prediction intervals with coverage probability close to the target nominal value. Although the exact computation of the proposed intervals is usually not feasible, an approximation can be easily attained by means of a suitable bootstrap simulation procedure. This new predictive solution is much simpler to compute than those ones already proposed in the literature, based on asymptotic calculations. Applications of the bootstrap calibrated procedure to AR, MA and ARCH models are presented. PubDate: 2021-03-01

Abstract: In this work we propose a new class of long-memory models with time-varying fractional parameter. In particular, the dynamics of the long-memory coefficient, d, is specified through a stochastic recurrence equation driven by the score of the predictive likelihood, as suggested by Creal et al. (J Appl Econom 28:777–795, 2013) and Harvey (Dynamic models for volatility and heavy tails: with applications to financial and economic time series, Cambridge University Press, Cambridge, 2013). We demonstrate the validity of the proposed model by a Monte Carlo experiment and an application to two real time series. PubDate: 2021-03-01

Abstract: In this paper, we propose a model to describe the mutual interactions among the lifecycles of three substitute products acting simultaneously in a common market, thus competing for the same customers or cooperating to supply demand. To date, the literature only describes models for two competitors; therefore, the present work represents the first attempt at creating and implementing a model for three actors. The new model is applied to real data in the energy context, and its performance is compared to the performance of current models for two competitors. Regarding the datasets examined, the new model shows a relevant improvement in terms of forecasting performance, that is forecasting accuracy and prediction confidence band width. PubDate: 2021-03-01

Abstract: This article is concerned with the Bayesian optimal design problem for multi-factor nonlinear models. In particular, the Bayesian \(\varPsi _q\) -optimality criterion proposed by Dette et al. (Stat Sinica 17:463–480, 2007) is considered. It is shown that the product-type designs are optimal for the additive multi-factor nonlinear models with or without constant term when the proposed sufficient conditions are satisfied. Some examples of application using the exponential growth models with several variables are presented to illustrate optimal designs based on the Bayesian \(\varPsi _q\) -optimality criterion considered. PubDate: 2021-03-01

Abstract: The bivariate Fay–Herriot model is an area-level linear mixed model that can be used for estimating the domain means of two correlated target variables. Under this model, the dependent variables are direct estimators calculated from survey data and the auxiliary variables are true domain means obtained from external data sources. Administrative registers do not always give good auxiliary variables, so that statisticians sometimes take them from alternative surveys and therefore they are measured with error. We introduce a variant of the bivariate Fay–Herriot model that takes into account the measurement error of the auxiliary variables and we give fitting algorithms to estimate the model parameters. Based on the new model, we introduce empirical best predictors of domain means and we propose a parametric bootstrap procedure for estimating the mean squared error. We finally give an application to estimate poverty proportions and gaps in the Spanish Living Condition Survey, with auxiliary information from the Spanish Labour Force Survey. PubDate: 2021-03-01

Abstract: Experimental results are often interpreted through statistical tests, where the alternative hypothesis represents the theory to be evinced; if the experimental results lead to the rejection of the null hypothesis, the theory is supported by empirical evidence. In these cases, the reproducibility of this empirical evidence can be measured by the Reproducibility Probability (RP) of the test, which coincides with the probability of rejecting the null hypothesis. The terminology “Reproducibility” Probability stems from the fact that it is usually computed when an experiment provides a significant result to evaluate the probability that a further identical and independent experiment confirms the statistical significance. In recent literature, some RP estimators have been proposed. They are useful for two reasons: they allow us to evaluate the reliability of the obtained statistical significance and some estimates can be used as a test statistic, owing to the so-called “RP-testing” decision rule (reject the null hypothesis if and only if the RP estimate is greater than 1/2). Unfortunately, the usually adopted RP estimators are affected by a high mean squared error. In this paper, a new class of RP estimators is introduced and examined to improve their estimation precision. Specifically, the performances of the new RP estimators have been compared with those of the existing estimators and a 30% greater reduction in the mean squared error (on average) was observed. Moreover, the new estimator with the best performance allowed the use of the RP-testing decision rule. Hence, this work achieves the double goal of improving Reproducibility Probability estimation and preserving RP-testing. PubDate: 2021-03-01

Abstract: The generalized Pareto distribution (GPD) is a family of continuous distributions used to model the tail of the distribution to values higher than a threshold u. Despite the advantages of the GPD representation, its shape and scale parameters do not correspond to the expected value, which complicates the interpretation of regression models specified using the GPD. This study proposes a linear regression model in which the response variable is a GPD, using a new parametrization that is indexed by mean and precision parameters. The main advantage of our new parametrization is the straightforward interpretation of the regression coefficients in terms of the expectation of the positive real line response variable, as is usual in the context of generalized linear models. Furthermore, we propose a model for extreme values, in which the GPD parameters (mean and precision) are defined on the basis of a dynamic linear regression model. The novelty of the study lies in the time variation of the mean and precision parameter of the resulting distribution. The parameter estimation of these new models is performed under the Bayesian paradigm. Simulations are conducted to analyze the performance of our proposed models. Finally, the models are applied to environmental datasets (temperature datasets), illustrating their capabilities in challenging cases in extreme value theory. PubDate: 2021-03-01

Abstract: We discuss and characterise connections between frequentist, confidence distribution and objective Bayesian inference, when considering higher-order asymptotics, matching priors, and confidence distributions based on pivotal quantities. The focus is on testing precise or sharp null hypotheses on a scalar parameter of interest. Moreover, we illustrate that the application of these procedures requires little additional effort compared to the application of standard first-order theory. In this respect, using the R software, we indicate how to perform in practice the computation with three examples in the context of data from inter-laboratory studies, of the stress–strength reliability, and of a growth curve from dose–response data. PubDate: 2021-03-01

Abstract: This study considers interval-valued time series data. To characterize such data, we propose an auto-interval-regressive (AIR) model using the order statistics from normal distributions. Furthermore, to better capture the heteroscedasticity in volatility, we design a heteroscedastic volatility AIR (HVAIR) model. We derive the likelihood functions of the AIR and HVAIR models to obtain the maximum likelihood estimator. Monte Carlo simulations are then conducted to evaluate our methods of estimation and confirm their validity. A real data example from the S&P 500 Index is used to demonstrate our method. PubDate: 2021-03-01

Abstract: Applying quantile regression to count data presents logical and practical complications which are usually solved by artificially smoothing the discrete response variable through jittering. In this paper, we present an alternative approach in which the quantile regression coefficients are modeled by means of (flexible) parametric functions. The proposed method avoids jittering and presents numerous advantages over standard quantile regression in terms of computation, smoothness, efficiency, and ease of interpretation. Estimation is carried out by minimizing a “simultaneous” version of the loss function of ordinary quantile regression. Simulation results show that the described estimators are similar to those obtained with jittering, but are often preferable in terms of bias and efficiency. To exemplify our approach and provide guidelines for model building, we analyze data from the US National Medical Expenditure Survey. All the necessary software is implemented in the existing R package qrcm. PubDate: 2021-02-17

Abstract: In this paper, we tackle the problem of splitting a long (potentially time consuming) questionnaire into two parts, where each participant only responds to a fraction of the questions, and all respondents obtain a common portion of questions. We propose a method that combines regression models to the two independent samples (questionnaires) in the survey. Each sample includes the common response variable Y and common covariate x, while two vectors of specific covariates z and w are recorded such that no single sampling unit has answered both z and w. This corresponds to the problem of statistical matching that we tackle under the assumption of conditional independence. In the statistical matching context, we use a macro approach to estimate parameters of a regression model. This means that we can estimate the joint distribution of all variables of interest with available data utilizing the assumption of conditional independence. We make use of this here by fitting three regression models with the same response variable for each model. Combining the three models allows us to obtain a prediction model with all covariates in common. We compare the performance of our proposed method in simulation studies as well as a real data example. Our method gives better results as compared to commonly used alternative methods. The proposed routine is easy to apply in practice and it neither requires the formulation of a model for the covariates itself nor an imputation model for the missing covariates vectors z and w. PubDate: 2021-02-16

Abstract: Estimating the size of a hard-to-count population is a challenging matter. In particular, when only few observations of the population to be estimated are available. The matter gets even more complex when one-inflation occurs. This situation is illustrated with the help of two examples: the size of a dice snake population in Graz (Austria) and the number of flare stars in the Pleiades. The paper discusses how one-inflation can be easily handled in likelihood approaches and also discusses how variances and confidence intervals can be obtained by means of a semi-parametric bootstrap. A Bayesian approach is mentioned as well and all approaches result in similar estimates of the hidden size of the population. Finally, a simulation study is provided which shows that the unconditional likelihood approach as well as the Bayesian approach using Jeffreys’ prior perform favorable. PubDate: 2021-01-30

Abstract: Repeated measures designs are widely used in practice to increase power, reduce sample size, and increase efficiency in data collection. Correlation between repeated measurements is one of the first research questions that needs to be addressed in a repeated-measure study. In addition to an estimate for correlation, confidence interval should be computed and reported for statistical inference. The asymptotic interval based on the delta method is traditionally calculated due to its simplicity. However, this interval is often criticized for its unsatisfactory performance with regards to coverage and interval width. Bootstrap could be utilized to reduce the interval width, and the widely used bootstrap intervals include the percentile interval, the bias-corrected interval, and the bias-corrected with acceleration interval. Wilcox (Comput Stat Data Anal 22:89–98,1996) suggested a modified percentile interval with the interval levels adjusted by sample size to have the coverage probability close to the nominal level. For a study with repeated measures, more parameters in addition to sample size would affect the coverage probability. For these reasons, we propose modifying the percentiles in the percentile interval to guarantee the coverage probability based on simulation studies. We analyze the correlation between imaging volumes and memory scores from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study to illustrate the application of the considered intervals. The proposed interval is exact with the coverage probability guaranteed, and is recommended for use in practice. PubDate: 2021-01-21