Authors:Michael Kane; Xun (Tony) Jiang, Simon Urbanek Abstract: Reproducible document standards, like R Markdown, facilitate the programmatic creation of documents whose content is itself programmatically generated. While programmatic content alone may not be sufficient for a rendered document since it does not include prose (content generated by an author to provide context, a narrative, etc.) programmatic generation can provide substantial efficiencies for structuring and constructing documents. This paper explores the programmatic generation of reproducible documents by distinguishing components that can be created by computational means from those requiring human-generation, providing guidelines for the generation of these documents, and identifying a use case in clinical trial reporting. These concepts and use case are illustrated through the listdown package for the R programming environment, which is is currently available on the Comprehensive R Archive Network. PubDate: Wed, 20 Jul 2022 00:00:00 +000
Authors:Flavio Santi; Maria Michela Dickson, Giuseppe Espa, Diego Giuliani Abstract: This paper presents the R package plot3logit which enables the covariate effects of trinomial regression models to be represented graphically by means of a ternary plot. The aim of the plot is helping the interpretation of regression coefficients in terms of the effects that a change in values of regressors has on the probability distribution of the dependent variable. Such changes may involve either a single regressor, or a group of them (composite changes), and the package permits both cases to be handled in a user-friendly way. Moreover, plot3logit can compute and draw confidence regions of the effects of covariate changes and enables multiple changes and profiles to be represented and compared jointly. Upstream and downstream compatibility makes the package able to work with other R packages or applications other than R. PubDate: Tue, 19 Jul 2022 00:00:00 +000
Authors:Christophe Dutang; Vincent Goulet, Nicholas Langevin Abstract: Actuaries model insurance claim amounts using heavy tailed probability distributions. They routinely need to evaluate quantities related to these distributions such as quantiles in the far right tail, moments or limited moments. Furthermore, actuaries often resort to simulation to solve otherwise untractable risk evaluation problems. The paper discusses our implementation of support functions for the Feller-Pareto distribution for the R package actuar. The Feller-Pareto defines a large family of heavy tailed distributions encompassing the transformed beta family and many variants of the Pareto distribution. PubDate: Sat, 16 Jul 2022 00:00:00 +000
Authors:Vincent Arel-Bundock Abstract: modelsummary is a package to summarize data and statistical models in R. It supports over one hundred types of models out-of-the-box, and allows users to report the results of those models side-by-side in a table, or in coefficient plots. It makes it easy to execute common tasks such as computing robust standard errors, adding significance stars, and manipulating coefficient and model labels. Beyond model summaries, the package also includes a suite of tools to produce highly flexible data summary tables, such as dataset overviews, correlation matrices, (multi-level) cross-tabulations, and balance tables (also known as "Table 1"). The appearance of the tables produced by modelsummary can be customized using external packages such as kableExtra, gt, flextable, or huxtable; the plots can be customized using ggplot2. Tables can be exported to many output formats, including HTML, LaTeX, Text/Markdown, Microsoft Word, Powerpoint, Excel, RTF, PDF, and image files. Tables and plots can be embedded seamlessly in rmarkdown, knitr, or Sweave dynamic documents. The modelsummary package is designed to be simple, robust, modular, and extensible. PubDate: Mon, 11 Jul 2022 00:00:00 +000
Authors:Marek Gagolewski Abstract: Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician's or data scientist's repertoire to complement their numerical computing and data wrangling skills. PubDate: Mon, 11 Jul 2022 00:00:00 +000
Authors:Benjamin D. Youngman Abstract: This article introduces the R package evgam. The package provides functions for fitting extreme value distributions. These include the generalized extreme value and generalized Pareto distributions. The former can also be fitted through a point process representation. Package evgam supports quantile regression via the asymmetric Laplace distribution, which can be useful for estimating high thresholds, sometimes used to discriminate between extreme and non-extreme values. The main addition of package evgam is to let extreme value distribution parameters have generalized additive model forms, the smoothness of which can be objectively estimated using Laplace's method. Illustrative examples fitting various distributions with various specifications are given. These include daily precipitation accumulations for part of Colorado, US, used to illustrate spatial models, and daily maximum temperatures for Fort Collins, Colorado, US, used to illustrate temporal models. PubDate: Mon, 11 Jul 2022 00:00:00 +000
Authors:Luca Pappalardo; Filippo Simini, Gianni Barlacchi, Roberto Pellungrini Abstract: The last decade has witnessed the emergence of massive mobility datasets, such as tracks generated by GPS devices, call detail records, and geo-tagged posts from social media platforms. These datasets have fostered a vast scientific production on various applications of mobility analysis, ranging from computational epidemiology to urban planning and transportation engineering. A strand of literature addresses data cleaning issues related to raw spatiotemporal trajectories, while the second line of research focuses on discovering the statistical "laws" that govern human movements. A significant effort has also been put on designing algorithms to generate synthetic trajectories able to reproduce, realistically, the laws of human mobility. Last but not least, a line of research addresses the crucial problem of privacy, proposing techniques to perform the re-identification of individuals in a database. A view on state-of-the-art cannot avoid noticing that there is no statistical software that can support scientists and practitioners with all the aspects mentioned above of mobility data analysis. In this paper, we propose scikit-mobility, a Python library that has the ambition of providing an environment to reproduce existing research, analyze mobility data, and simulate human mobility habits. scikit-mobility is efficient and easy to use as it extends pandas, a popular Python library for data analysis. Moreover, scikit-mobility provides the user with many functionalities, from visualizing trajectories to generating synthetic data, from analyzing statistical patterns to assessing the privacy risk related to the analysis of mobility datasets. PubDate: Mon, 11 Jul 2022 00:00:00 +000
Authors:Andrew O. Finley; Abhirup Datta, Sudipto Banerjee Abstract: This paper describes and illustrates functionality of the spNNGP R package. The package provides a suite of spatial regression models for Gaussian and non-Gaussian pointreferenced outcomes that are spatially indexed. The package implements several Markov chain Monte Carlo (MCMC) and MCMC-free nearest neighbor Gaussian process (NNGP) models for inference about large spatial data. Non-Gaussian outcomes are modeled using a NNGP Pólya-Gamma latent variable. OpenMP parallelization options are provided to take advantage of multiprocessor systems. Package features are illustrated using simulated and real data sets. PubDate: Mon, 11 Jul 2022 00:00:00 +000