Abstract: Abstract Repeated measures designs are widely used in practice to increase power, reduce sample size, and increase efficiency in data collection. Correlation between repeated measurements is one of the first research questions that needs to be addressed in a repeated-measure study. In addition to an estimate for correlation, confidence interval should be computed and reported for statistical inference. The asymptotic interval based on the delta method is traditionally calculated due to its simplicity. However, this interval is often criticized for its unsatisfactory performance with regards to coverage and interval width. Bootstrap could be utilized to reduce the interval width, and the widely used bootstrap intervals include the percentile interval, the bias-corrected interval, and the bias-corrected with acceleration interval. Wilcox (Comput Stat Data Anal 22:89–98,1996) suggested a modified percentile interval with the interval levels adjusted by sample size to have the coverage probability close to the nominal level. For a study with repeated measures, more parameters in addition to sample size would affect the coverage probability. For these reasons, we propose modifying the percentiles in the percentile interval to guarantee the coverage probability based on simulation studies. We analyze the correlation between imaging volumes and memory scores from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study to illustrate the application of the considered intervals. The proposed interval is exact with the coverage probability guaranteed, and is recommended for use in practice. PubDate: 2021-01-21
Abstract: Abstract Consider a data set as a body of evidence that might confirm or disconfirm a hypothesis about a parameter value. If the posterior probability of the hypothesis is high enough, then the truth of the hypothesis is accepted for some purpose such as reporting a new discovery. In that way, the posterior probability measures the sufficiency of the evidence for accepting the hypothesis. It would only follow that the evidence is relevant to the hypothesis if the prior probability were not already high enough for acceptance. A measure of the relevancy of the evidence is the Bayes factor since it is the ratio of the posterior odds to the prior odds. Measures of the sufficiency of the evidence and measures of the relevancy of the evidence are not mutually exclusive. An example falling in both classes is the likelihood ratio statistic, perhaps based on a pseudolikelihood function that eliminates nuisance parameters. There is a sense in which the likelihood ratio statistic measures both the sufficiency of the evidence and its relevancy. That result is established by representing the likelihood ratio statistic in terms of a conditional possibility measure that satisfies logical coherence rather than probabilistic coherence. PubDate: 2021-01-01
Abstract: Abstract We propose a semiparametric P-Spline model to deal with spatial panel data. This model includes a non-parametric spatio-temporal trend, a spatial lag of the dependent variable, and a time series autoregressive noise. Specifically, we consider a spatio-temporal ANOVA model, disaggregating the trend into spatial and temporal main effects, as well as second- and third-order interactions between them. Algorithms based on spatial anisotropic penalties are used to estimate all the parameters in a closed form without the need for multidimensional optimization. Monte Carlo simulations and an empirical analysis of regional unemployment in Italy show that our model represents a valid alternative to parametric methods aimed at disentangling strong and weak cross-sectional dependence when both spatial and temporal heterogeneity are smoothly distributed. PubDate: 2020-12-01
Abstract: Abstract This paper proposes a new aggregated classification scheme aimed to support the implementation of semantic text analysis methods in contexts characterized by the presence of rare text categories. The proposed approach starts from the aggregate supervised text classifier developed by Hopkins and King and moves forward, relying on rare event sampling methods. In detail, it enables the analyst to enlarge the number of estimated sentiment categories, both preserving the estimation accuracy and reducing the working time to unconditionally increase the size of the training set. The approach is applied to study the daily evolution of the web reputation of one of the last mega-event taking place in Europe: Expo Milano. The corpus consists of more than one million tweets in both Italian and English, discussing about the event. The analysis provides an interesting portrayal of the evolution of the Expo stakeholders’ opinions over time and allows the identification of the main drivers of the Expo reputation. The algorithm will be implemented as a running option in the next release of the R package ReadMe. PubDate: 2020-12-01
Abstract: Abstract Classification of a disease often depends on more than one test, and the tests can be interrelated. Under the incorrect assumption of independence, the test result using dependent biomarkers can lead to a conflicting disease classification. We develop a copula-based method for this purpose that takes dependency into account and leads to a unique decision. We first construct the joint probability distribution of the biomarkers considering Frank’s, Clayton’s and Gumbel’s copulas. We then develop the classification method and perform a comprehensive simulation. Using simulated data sets, we study the statistical properties of joint probability distributions and determine the joint threshold with maximum classification accuracy. Our simulation study results show that parameter estimates for the copula-based bivariate distributions are not biased. We observe that the thresholds for disease classification converge to a stationary distribution across different choices of copulas. We also observe that the classification accuracy decreases with the increasing value of the dependence parameter of the copulas. Finally, we illustrate our method with a real data example, where we identify the joint threshold of Apolipoprotein B to Apolipoprotein A1 ratio and total cholesterol to high-density lipoprotein ratio for the classification of myocardial infarction. We conclude, the copula-based method works well in identifying the joint threshold of two dependent biomarkers for an outcome classification. Our method is flexible and allows modeling broad classes of bivariate distributions that take dependency into account. The threshold may allow clinicians to classify uniquely individuals at risk of developing the disease and plan for early intervention. PubDate: 2020-12-01
Abstract: Abstract The simplex is the geometrical locus of D-dimensional positive data with constant sum, called compositions. A possible distribution for compositions is the Dirichlet. In Dirichlet models, there are no scale parameters and the D shapes are assumed dependent on auxiliary variables. This peculiar feature makes Dirichlet models difficult to apply and to interpret. Here, we propose a generalization of the Dirichlet, called the simplicial generalized Beta (SGB) distribution. It includes an overall shape parameter, a scale composition and the D Dirichlet shapes. The SGB is flexible enough to accommodate many practical situations. SGB regression models are applied to data from the United Kingdom Time Use Survey. The R-package SGB makes the methods accessible to users. PubDate: 2020-12-01
Abstract: Abstract Motivated by the need to introduce design improvements to the Internet network to make it robust to high traffic volume anomalies, we analyze statistical properties of the time separation between arrivals of consecutive anomalies in the Internet2 network. Using several statistical techniques, we demonstrate that for all unidirectional links in Internet2, these interarrival times have distributions whose tail probabilities decay like a power law. These heavy-tailed distributions have varying tail indexes, which in some cases imply infinite variance. We establish that the interarrival times can be modeled as independent and identically distributed random variables, and propose a model for their distribution. These findings allow us to use the tools of of renewal theory, which in turn allows us to estimate the distribution of the waiting time for the arrival of the next anomaly. We show that the waiting time is stochastically substantially longer than the time between the arrivals, and may in some cases have infinite expected value. All our findings are tabulated and displayed in the form of suitable graphs, including the relevant density estimates. PubDate: 2020-12-01
Abstract: Abstract Multi-state models are considered in the field of survival analysis for modelling illnesses that evolve through several stages over time. Multi-state models can be developed by applying several techniques, such as non-parametric, semi-parametric and stochastic processes, particularly Markov processes. When the development of an illness is being analysed, its progression is tracked periodically. Medical reviews take place at discrete times, and a panel data analysis can be formed. In this paper, a discrete-time piecewise non-homogeneous Markov process is constructed for modelling and analysing a multi-state illness with a general number of states. The model is built, and relevant measures, such as survival function, transition probabilities, mean total times spent in a group of states and the conditional probability of state change, are determined. A likelihood function is built to estimate the parameters and the general number of cut-points included in the model. Time-dependent covariates are introduced, the results are obtained in a matrix algebraic form and the algorithms are shown. The model is applied to analyse the behaviour of breast cancer. A study of the relapse and survival times of 300 breast cancer patients who have undergone mastectomy is developed. The results of this paper are implemented computationally with MATLAB and R. PubDate: 2020-12-01
Abstract: Abstract The problem of detecting a major change point in a stochastic process is often of interest in applications, in particular when the effects of modifications of some external variables, on the process itself, must be identified. We here propose a modification of the classical Pearson \(\chi ^2\) test to detect the presence of such major change point in the transition probabilities of an inhomogeneous discrete time Markov Chain, taking values in a finite space. The test can be applied also in presence of big identically distributed samples of the Markov Chain under study, which might not be necessarily independent. The test is based on the maximum likelihood estimate of the size of the ’right’ experimental unit, i.e. the units that must be aggregated to filter out the small scale variability of the transition probabilities. We here apply our test both to simulated data and to a real dataset, to study the impact, on farmland uses, of the new Common Agricultural Policy, which entered into force in EU in 2015. PubDate: 2020-12-01
Abstract: Abstract We establish (a) the probability mass function of the interpoint distance (IPD) between random vectors that are drawn from the multivariate power series family of distributions (MPSD); (b) obtain the distribution of the IPD within one sample and across two samples from this family; (c) determine the distribution of the MPSD Euclidean norm and distance from fixed points in \({\mathbb {Z}}^d\) ; and (d) provide the distribution of the IPDs of vectors drawn from a mixture of the MPSD distributions. We present a method for testing the homogeneity of MPSD mixtures using the sample IPDs. PubDate: 2020-12-01
Abstract: In this paper, we consider the asymptotic properties of the nearest neighbors estimation for long memory functional data. Under some regularity assumptions, we investigate the asymptotic normality and the uniform consistency of the nearest neighbors estimators for the nonparametric regression models when the explanatory variable and the errors are of long memory and the explanatory variable takes values in some abstract functional space. The finite sample performance of the proposed estimator is discussed through simulation studies. PubDate: 2020-12-01
Abstract: Abstract This paper focuses on studying the relationships among a set of categorical (ordinal) variables collected in a contingency table. Besides the marginal and conditional (in)dependencies, thoroughly analyzed in the literature, we consider the context-specific independencies holding only in a subspace of the outcome space of the conditioning variables. To this purpose we consider the hierarchical multinomial marginal models and we provide several original results about the representation of context-specific independencies through these models. The theoretical results are supported by an application concerning the innovation degree of Italian enterprises. PubDate: 2020-12-01
Abstract: Abstract This paper applies the modified least absolute shrinkage and selection operator (LASSO) to the regression model with dependent disturbances, especially, long-memory disturbances. Assuming the norm of different column in the regression matrix may have different order of observation length n, we introduce a modified LASSO estimator where the tuning parameter \(\lambda\) is not a scalar but vector. When the dimension of parameters is fixed, we derive the asymptotic distribution of the modified LASSO estimators under certain regularity condition. When the dimension of parameters increases with respect to n, the consistency on the probability of the correct selection of penalty parameters is shown under certain regularity conditions. Some simulation studies are examined. PubDate: 2020-12-01
Abstract: Abstract Lack of independence in the residuals from linear regression motivates the use of random effect models in many applied fields. We start from the one-way anova model and extend it to a general class of one-factor Bayesian mixed models, discussing several correlation structures for the within group residuals. All the considered group models are parametrized in terms of a single correlation (hyper-)parameter, controlling the shrinkage towards the case of independent residuals (iid). We derive a penalized complexity (PC) prior for the correlation parameter of a generic group model. This prior has desirable properties from a practical point of view: (i) it ensures appropriate shrinkage to the iid case; (ii) it depends on a scaling parameter whose choice only requires a prior guess on the proportion of total variance explained by the grouping factor; (iii) it is defined on a distance scale common to all group models, thus the scaling parameter can be chosen in the same manner regardless the adopted group model. We show the benefit of using these PC priors in a case study in community ecology where different group models are compared. PubDate: 2020-12-01
Abstract: Abstract In this paper we propose a sufficient dimension reduction algorithm based on the difference of inverse medians. The classic methodology based on inverse means in each slice was recently extended, by using inverse medians, to robustify existing methodology at the presence of outliers. Our effort is focused on using differences between inverse medians in pairs of slices. We demonstrate that our method outperforms existing methods at the presence of outliers. We also propose a second algorithm which is not affected by the ordering of slices when the response variable is categorical with no underlying ordering of its values. PubDate: 2020-12-01
Abstract: Abstract Structural change in any time series is practically unavoidable, and thus correctly detecting breakpoints plays a pivotal role in statistical modelling. This research considers segmented autoregressive models with exogenous variables and asymmetric GARCH errors, GJR-GARCH and exponential-GARCH specifications, which utilize the leverage phenomenon to demonstrate asymmetry in response to positive and negative shocks. The proposed models incorporate skew Student-t distribution and prove the advantages of the fat-tailed skew Student-t distribution versus other distributions when structural changes appear in financial time series. We employ Bayesian Markov Chain Monte Carlo methods in order to make inferences about the locations of structural change points and model parameters and utilize deviance information criterion to determine the optimal number of breakpoints via a sequential approach. Our models can accurately detect the number and locations of structural change points in simulation studies. For real data analysis, we examine the impacts of daily gold returns and VIX on S&P 500 returns during 2007–2019. The proposed methods are able to integrate structural changes through the model parameters and to capture the variability of a financial market more efficiently. PubDate: 2020-11-26
Abstract: Abstract A measure of interrater absolute agreement for ordinal scales is proposed capitalizing on the dispersion index for ordinal variables proposed by Giuseppe Leti. The procedure allows to overcome the limits affecting traditional measures of interrater agreement in different fields of application. An unbiased estimator of the proposed measure is introduced and its sampling properties are investigated. In order to construct confidence intervals for interrater absolute agreement both asymptotic results and bootstrapping methods are used and their performance is evaluated. Simulated data are employed to demonstrate the accuracy and practical utility of the new procedure for assessing agreement. Finally, an application to a real case is provided. PubDate: 2020-11-24
Abstract: Abstract Conditional Autoregressive Value-at-Risk and Conditional Autoregressive Expectile have become two popular approaches for direct measurement of market risk. Since their introduction several improvements both in the Bayesian and in the classical framework have been proposed to better account for asymmetry and local non-linearity. Here we propose a unified Bayesian Conditional Autoregressive Risk Measures approach by using the Skew Exponential Power distribution. Further, we extend the proposed models using a semiparametric P-Spline approximation answering for a flexible way to consider the presence of non-linearity. To make the statistical inference we adapt the MCMC algorithm proposed in Bernardi et al. (2018) to our case. The effectiveness of the whole approach is demonstrated using real data on daily return of five stock market indices. PubDate: 2020-11-18
Abstract: Abstract The notion of testing for equivalence of two treatments is widely used in clinical trials, pharmaceutical experiments, bioequivalence and quality control. It is traditionally operated within the intersection–union principle (IU). According to this principle the null hypothesis is stated as the set of effects the differences \(\delta\) of which lie outside a suitable equivalence interval and the alternative as the set of \(\delta\) that lie inside it. In the literature related solutions are essentially based on likelihood techniques, which in turn are rather difficult to deal with. A recently published paper goes beyond most of likelihood limitations by using the IU approach within the permutation theory. One more paper, based on Roy’s union–intersection principle (UI) within the permutation theory, goes beyond some limitations of traditional two-sided tests. Such UI approach, effectively a mirror image of IU, assumes a null hypothesis where \(\delta\) lies inside the equivalence interval and an alternative where it lies outside. Since testing for equivalence can rationally be analyzed by both principles but, as the two differ in terms of the mirror-like roles assigned to the hypotheses under study, they are not strictly comparable. The present paper’s main goal is to look into these problems and provide a sort of comparative analysis of both by highlighting the related requirements, properties, limitations, difficulties, and pitfalls so as to get practitioners properly acquainted with their correct use in practical contexts. PubDate: 2020-11-10
Abstract: Abstract There are now many theoretical explanations for why Benford’s law of digit bias surfaces in so many diverse fields and data sets. After briefly reviewing some of these, we discuss in detail recurrence relations. As these are discrete analogues of differential equations and model a variety of real world phenomena, they provide an important source of systems to test for Benfordness. Previous work showed that fixed depth recurrences with constant coefficients are Benford modulo some technical assumptions which are usually met; we briefly review that theory and then prove some new results extending to the case of linear recurrence relations with non-constant coefficients. We prove that, for certain families of functions f and g, a sequence generated by a recurrence relation of the form \(a_{n+1} = f(n)a_n + g(n)a_{n-1}\) is Benford for all initial values. The proof proceeds by parameterizing the coefficients to obtain a recurrence relation of lower degree, and then converting to a new parameter space. From there we show that for suitable choices of f and g where f(n) is nondecreasing and \(g(n)/f(n)^2 \rightarrow 0\) as \(n \rightarrow \infty \) , the main term dominates and the behavior is equivalent to equidistribution problems previously studied. We also describe the results of generalizing further to higher-degree recurrence relations and multiplicative recurrence relations with non-constant coefficients, as well as the important case when f and g are values of random variables. PubDate: 2020-10-29