 AStA Advances in Statistical AnalysisJournal Prestige (SJR): 0.548 Citation Impact (citeScore): 1Number of Followers: 2      Hybrid journal (It can contain Open Access articles) ISSN (Print) 1863-818X - ISSN (Online) 1863-8171 Published by Springer-Verlag  [2467 journals]
• Left-truncated health insurance claims data: theoretical review and
empirical application

Abstract: Abstract From the inventory of the health insurer AOK in 2004, we draw a sample of a quarter million people and follow each person’s health claims continuously until 2013. Our aim is to estimate the effect of a stroke on the dementia onset probability for Germans born in the first half of the 20th century. People deceased before 2004 are randomly left-truncated, and especially their number is unknown. Filtrations, modelling the missing data, enable circumventing the unknown number of truncated persons by using a conditional likelihood. Dementia onset after 2013 is a fixed right-censoring event. For each observed health history, Jacod’s formula yields its conditional likelihood contribution. Asymptotic normality of the estimated intensities is derived, related to a sample size definition including the number of truncated people. The standard error results from the asymptotic normality and is easily computable, despite the unknown sample size. The claims data reveal that after a stroke, with time measured in years, the intensity of dementia onset increases from 0.02 to 0.07. Using the independence of the two estimated intensities, a 95% confidence interval for their difference is [0.053, 0.057]. The effect halves when we extend the analysis to an age-inhomogeneous model, but does not change further when we additionally adjust for multi-morbidity.
PubDate: 2023-02-02

• Statistical guarantees for sparse deep learning

Abstract: Abstract Neural networks are becoming increasingly popular in applications, but our mathematical understanding of their potential and limitations is still limited. In this paper, we further this understanding by developing statistical guarantees for sparse deep learning. In contrast to previous work, we consider different types of sparsity, such as few active connections, few active nodes, and other norm-based types of sparsity. Moreover, our theories cover important aspects that previous theories have neglected, such as multiple outputs, regularization, and $$\ell_{2}$$ -loss. The guarantees have a mild dependence on network widths and depths, which means that they support the application of sparse but wide and deep networks from a statistical perspective. Some of the concepts and tools that we use in our derivations are uncommon in deep learning and, hence, might be of additional interest.
PubDate: 2023-01-24

• Addressing non-normality in multivariate analysis using the t-distribution

Abstract: Abstract The main aim of this paper is to propose a set of tools for assessing non-normality taking into consideration the class of multivariate t-distributions. Assuming second moment existence, we consider a reparameterized version of the usual t distribution, so that the scale matrix coincides with covariance matrix of the distribution. We use the local influence procedure and the Kullback–Leibler divergence measure to propose quantitative methods to evaluate deviations from the normality assumption. In addition, the possible non-normality due to the presence of both skewness and heavy tails is also explored. Our findings based on two real datasets are complemented by a simulation study to evaluate the performance of the proposed methodology on finite samples.
PubDate: 2023-01-21

• Bayesian ridge regression for survival data based on a vine copula-based
prior

Abstract: Abstract Ridge regression estimators can be interpreted as a Bayesian posterior mean (or mode) when the regression coefficients follow multivariate normal prior. However, the multivariate normal prior may not give efficient posterior estimates for regression coefficients, especially in the presence of interaction terms. In this paper, the vine copula-based priors are proposed for Bayesian ridge estimators under the Cox proportional hazards model. The semiparametric Cox models are built on the posterior density under two likelihoods: Cox’s partial likelihood and the full likelihood under the gamma process prior. The simulations show that the full likelihood is generally more efficient and stable for estimating regression coefficients than the partial likelihood. We also show via simulations and a data example that the Archimedean copula priors (the Clayton and Gumbel copula) are superior to the multivariate normal prior and the Gaussian copula prior.
PubDate: 2022-12-30

• Hedonic pricing modelling with unstructured predictors: an application to
Italian Fashion Industry

Abstract: Abstract This study proposes a comparison of hedonic pricing models that use attributes obtained by featurizing text. We collected prices of items sold on the websites of five famous fashion producers in order to estimate hedonic pricing models that leverage the information contained in product descriptions. We mapped product descriptions to a high-dimensional feature space and compared predictive accuracy and variable selection properties of some statistical estimators that leverage sparse modelling, topic modelling and aggregated predictors, to test whether better predictive accuracy comes with an empirically consistent selection of attributes. We call this approach Hedonic Text-Regression modelling. Its novelty is that by using attributes obtained by text-mining of product descriptions, we obtain an estimate of the implicit price of the words contained therein. Empirically, all the proposed models outperformed the traditional hedonic pricing model in terms of predictive accuracy, while also providing consistent variable selection.
PubDate: 2022-12-13

• Imputation-based empirical likelihood inferences for partially nonlinear
quantile regression models with missing responses

Abstract: Abstract In this paper, we consider the confidence interval construction for the partially nonlinear models with missing responses at random under the framework of quantile regression. We propose an imputation-based empirical likelihood method to construct statistical inferences for both the unknown parametric vector in the nonlinear function and the nonparametric function and show that the proposed empirical log-likelihood ratios are both asymptotically chi-squared in theory. Furthermore, the confidence region for the parametric vector and the pointwise confidence interval for the nonparametric function are constructed. Some simulation studies are implemented to assess the performances of the proposed estimation method, and simulation results indicate that the proposed method is workable.
PubDate: 2022-12-01
DOI: 10.1007/s10182-022-00441-z

• Local spatial log-Gaussian Cox processes for seismic data

Abstract: Abstract In this paper, we propose the use of advanced and flexible statistical models to describe the spatial displacement of earthquake data. The paper aims to account for the external geological information in the description of complex seismic point processes, through the estimation of models with space varying parameters. A local version of the Log-Gaussian Cox processes (LGCP) is introduced and applied for the first time, exploiting the inferential tools in Baddeley (Spat Stat 22:261–295, 2017), estimating the model by the local Palm likelihood. We provide methods and approaches accounting for the interaction among points, typically described by LGCP models through the estimation of the covariance parameters of the Gaussian Random Field, that in this local version are allowed to vary in space, providing a more realistic description of the clustering feature of seismic events. Furthermore, we contribute to the framework of diagnostics, outlining suitable methods for the local context and proposing a new step-wise approach addressing the particular case of multiple covariates. Overall, we show that local models provide good inferential results and could serve as the basis for future spatio-temporal local model developments, peculiar for the description of the complex seismic phenomenon.
PubDate: 2022-12-01
DOI: 10.1007/s10182-022-00444-w

• Assessment of agricultural sustainability in European Union countries: a
group-based multivariate trajectory approach

Abstract: Abstract Sustainability of agriculture is difficult to measure and assess because it is a multidimensional concept that involves economic, social and environmental aspects and is subjected to temporal evolution and geographical differences. Existing studies assessing agricultural sustainability in the European Union (EU) are affected by several shortcomings that limit their relevance for policy makers. Specifically, most of them focus on farm level or cover a small set of countries, and the few exceptions covering a broad set of countries consider only a subset of the sustainable dimensions or rely on cross-sectional data. In this paper, we consider yearly data on 12 indicators (5 for the economic, 3 for the social and 4 for the environmental dimension) measured on 26 EU countries in the period 2004–2018 (15 years), and apply group-based multivariate trajectory modeling to identify groups of countries with common trends of sustainable objectives. An expectation-maximization algorithm is proposed to perform maximum likelihood estimation from incomplete data without relying on an explicit imputation procedure. Our results highlight three groups of countries with distinguished strong and weak sustainable objectives. Strong objectives common to all the three groups include improvement of productivity, increase of personal income in rural areas, reduction of poverty in rural areas, increase of production of renewable energy, rise of organic farming and reduction of nitrogen balance. Instead, enhancement of manager turnover and reduction of greenhouse gas emissions are weak objectives common to all the three groups of countries. Our findings represent a valuable resource to formulate new schemes for the attribution of subsidies within the Common Agricultural Policy (CAP).
PubDate: 2022-12-01
DOI: 10.1007/s10182-022-00437-9

• Some measures of kurtosis and their inference on large datasets

Abstract: Abstract This paper deals with the estimation of kurtosis on large datasets. It aims at overcoming two frequent limitations in applications: first, Pearson's standardized fourth moment is computed as a unique measure of kurtosis; second, the fact that data might be just samples is neglected, so that the opportunity of using suitable inferential tools, like standard errors and confidence intervals, is discarded. In the paper, some recent indexes of kurtosis are reviewed as alternatives to Pearson’s standardized fourth moment. The asymptotic distribution of their natural estimators is derived, and it is used as a tool to evaluate efficiency and to build confidence intervals. A simulation study is also conducted to provide practical indications about the choice of a suitable index. As a conclusion, researchers are warned against the use of classical Pearson’s index when the sample size is too low and/or the distribution is skewed and/or heavy-tailed. Specifically, the occurrence of heavy tails can deprive Pearson’s index of any meaning or produce unreliable confidence intervals. However, such limitations can be overcome by reverting to the reviewed alternative indexes, relying just on low-order moments.
PubDate: 2022-12-01
DOI: 10.1007/s10182-022-00442-y

• A quantile regression perspective on external preference mapping

Abstract: Abstract External preference mapping is widely used in marketing and R&D divisions to understand the consumer behaviour. The most common preference map is obtained through a two-step procedure that combines principal component analysis and least squares regression. The standard approach exploits classical regression and therefore focuses on the conditional mean. This paper proposes the use of quantile regression to enrich the preference map looking at the whole distribution of the consumer preference. The enriched maps highlight possible different consumer behaviour with respect to the less or most preferred products. This is pursued by exploring the variability of liking along the principal components as well as focusing on the direction of preference. The use of different aesthetics (colours, shapes, size, arrows) equips standard preference map with additional information and does not force the user to change the standard tool she/he is used to. The proposed methodology is shown in action on a case study pertaining yogurt preferences.
PubDate: 2022-12-01
DOI: 10.1007/s10182-022-00440-0

• On the Gaussian representation of the Riesz probability distribution on
symmetric matrices

Abstract: Abstract The Riesz probability distribution on symmetric matrices represents an important extension of the Wishart distribution. It is defined by its Laplace transform involving the notion of generalized power. Based on the fact that some Wishart distributions are presented by the mean of the multivariate Gaussian distribution, it is shown that some Riesz probability distributions which are not necessarily Wishart are also presented by the mean of Gaussian samples with missing data. As a corollary, we deduce a Gaussian representation of the inverse Riesz distribution and we give its expectation. The results are assessed in simulation studies.
PubDate: 2022-12-01
DOI: 10.1007/s10182-022-00436-w

• Scoring predictions at extreme quantiles

Abstract: Abstract Prediction of quantiles at extreme tails is of interest in numerous applications. Extreme value modelling provides various competing predictors for this point prediction problem. A common method of assessment of a set of competing predictors is to evaluate their predictive performance in a given situation. However, due to the extreme nature of this inference problem, it can be possible that the predicted quantiles are not seen in the historical records, particularly when the sample size is small. This situation poses a problem to the validation of the prediction with its realization. In this article, we propose two non-parametric scoring approaches to assess extreme quantile prediction mechanisms. The proposed assessment methods are based on predicting a sequence of equally extreme quantiles on different parts of the data. We then use the quantile scoring function to evaluate the competing predictors. The performance of the scoring methods is compared with the conventional scoring method and the superiority of the former methods are demonstrated in a simulation study. The methods are then applied to analyze cyber Netflow data from Los Alamos National Laboratory and daily precipitation data at a station in California available from Global Historical Climatology Network.
PubDate: 2022-12-01
DOI: 10.1007/s10182-021-00421-9

• Estimating the Impact of Medical Care Usage on Work Absenteeism by a
Trivariate Probit Model with Two Binary Endogenous Explanatory Variables

Abstract: Abstract The aim of this paper is to estimate the effects of seeking medical care on missing work. Specifically, our case study explores the question: Does visiting a medical provider cause an employee to miss work' To address this, we employ a model that can consistently estimate the impacts of two endogenous binary regressors. The model is based on three equations connected via a multivariate Gaussian distribution, which makes it possible to model the correlations among the equations, hence accounting for unobserved heterogeneity. Parameter estimation is reliably carried out via a trust region algorithm with analytical derivative information. We find that, observationally, having a curative visit associates with a nearly 80% increase in the probability of missing work, while having a preventive visit correlates with a smaller 13% increase in the likelihood of missing work. However, after addressing potential endogeneity, neither type of visit appears to significantly relate to missing work. That finding also applies to visits that occur during the previous year. Therefore, we conclude that the observed links between medical usage and absenteeism derive from unobserved heterogeneity, rather than direct causal channels. The modeling framework is available through the R package GJRM.
PubDate: 2022-10-18
DOI: 10.1007/s10182-022-00464-6

• Control charts for measurement error models

Abstract: Abstract We consider a linear measurement error model (MEM) with AR(1) process in the state equation which is widely used in applied research. This MEM could be equivalently re-written as ARMA(1,1) process, where the MA(1) parameter is related to the variance of measurement errors. As the MA(1) parameter is of essential importance for these linear MEMs, it is of much relevance to provide instruments for online monitoring in order to detect its possible changes. In this paper we develop control charts for online detection of such changes, i.e., from AR(1) to ARMA(1,1) and vice versa, as soon as they occur. For this purpose, we elaborate on both cumulative sum (CUSUM) and exponentially weighted moving average (EWMA) control charts and investigate their performance in a Monte Carlo simulation study. The empirical illustration of our approach is conducted based on time series of daily realized volatilities.
PubDate: 2022-10-05
DOI: 10.1007/s10182-022-00462-8

• Sieve bootstrapping the memory parameter in long-range dependent
stationary functional time series

Abstract: Abstract We consider a sieve bootstrap procedure to quantify the estimation uncertainty of long-memory parameters in stationary functional time series. We use a semiparametric local Whittle estimator to estimate the long-memory parameter. In the local Whittle estimator, discrete Fourier transform and periodogram are constructed from the first set of principal component scores via a functional principal component analysis. The sieve bootstrap procedure uses a general vector autoregressive representation of the estimated principal component scores. It generates bootstrap replicates that adequately mimic the dependence structure of the underlying stationary process. We first compute the estimated first set of principal component scores for each bootstrap replicate and then apply the semiparametric local Whittle estimator to estimate the memory parameter. By taking quantiles of the estimated memory parameters from these bootstrap replicates, we can nonparametrically construct confidence intervals of the long-memory parameter. As measured by coverage probability differences between the empirical and nominal coverage probabilities at three levels of significance, we demonstrate the advantage of using the sieve bootstrap compared to the asymptotic confidence intervals based on normality.
PubDate: 2022-10-01
DOI: 10.1007/s10182-022-00463-7

• Correction to: Assessment of agricultural sustainability in European Union
countries: a group-based multivariate trajectory approach

PubDate: 2022-09-01
DOI: 10.1007/s10182-022-00438-8

• Comment on: On the role of data, statistics and decisions in a pandemic
statistics for climate protection and health—dare (more) progress!

Abstract: Abstract In the Corona pandemic, it became clear with burning clarity how much good quality statistics are needed, and at the same time how unsuccessful we are at providing such statistics despite the existing technical and methodological possibilities and diverse data sources. It is therefore more than overdue to get to the bottom of the causes of these issues and to learn from the findings. This defines a high aspiration, namely that firstly a diagnosis is carried out in which the causes of the deficiencies with their interactions are identified as broadly as possible. Secondly, such a broad diagnosis should result in a therapy that includes a coherent strategy that can be generalised, i.e. that goes beyond the Corona pandemic.
PubDate: 2022-09-01
DOI: 10.1007/s10182-022-00447-7

• Authors’ response: on the role of data, statistics and decisions in
a pandemic

PubDate: 2022-07-30
DOI: 10.1007/s10182-022-00460-w

• Comment “On the role of data, statistics and decisions in a
pandemic” by Jahn et al.

Abstract: Abstract We comment the paper by Jahn et al. (On the role of data, statistics and decisions in a pandemic, 2022).
PubDate: 2022-06-18
DOI: 10.1007/s10182-022-00451-x

• Describing a landscape we are yet discovering

PubDate: 2022-06-09
DOI: 10.1007/s10182-022-00449-5

