Authors:Mark P. J. van der Loo; Edwin de Jonge Abstract: Checking data quality against domain knowledge is a common activity that pervades statistical analysis from raw data to output. The R package validate facilitates this task by capturing and applying expert knowledge in the form of validation rules: logical restrictions on variables, records, or data sets that should be satisfied before they are considered valid input for further analysis. In the validate package, validation rules are objects of computation that can be manipulated, investigated, and confronted with data or versions of a data set. The results of a confrontation are then available for further investigation, summarization or visualization. Validation rules can also be endowed with metadata and documentation and they may be stored or retrieved from external sources such as text files or tabular formats. This data validation infrastructure thus allows for systematic, user-defined definition of data quality requirements that can be reused for various versions of a data set or by data correction algorithms that are parameterized by validation rules. PubDate: Wed, 31 Mar 2021 00:00:00 +000

Authors:Jose Ameijeiras-Alonso; Rosa M. Crujeiras, Alberto Rodriguez-Casal Abstract: In several applied fields, multimodality assessment is a crucial task as a previous exploratory tool or for determining the suitability of certain distributions. The goal of this paper is to present the utilities of the R package multimode, which collects different exploratory and testing non-parametric approaches for determining the number of modes and their estimated location. Specifically, some graphical tools (SiZer map, mode tree or mode forest) are provided, allowing for the identification of mode patterns, based on the kernel density estimation. Several formal testing procedures for determining the number of modes are described in this paper and implemented in the multimode package, including methods based on the ideas of the critical bandwidth, the excess mass or using a combination of both. This package also includes a function for estimating the modes locations and different classical data examples that have been considered in mode testing literature. PubDate: Mon, 22 Mar 2021 00:00:00 +000

Authors:Alexander Lange; Bernhard Dalheimer, Helmut Herwartz, Simone Maxand Abstract: Structural vector autoregressive (SVAR) models are frequently applied to trace the contemporaneous linkages among (macroeconomic) variables back to an interplay of orthogonal structural shocks. Under Gaussianity the structural parameters are unidentified without additional (often external and not data-based) information. In contrast, the often reasonable assumption of heteroskedastic and/or non-Gaussian model disturbances offers the possibility to identify unique structural shocks. We describe the R package svars which implements statistical identification techniques that can be both heteroskedasticity-based or independence-based. Moreover, it includes a rich variety of analysis tools that are well known in the SVAR literature. Next to a comprehensive review of the theoretical background, we provide a detailed description of the associated R functions. Furthermore, a macroeconomic application serves as a step-by-step guide on how to apply these functions to the identification and interpretation of structural VAR models. PubDate: Fri, 19 Mar 2021 00:00:00 +000

Authors:Jin Zhu; Wenliang Pan, Wei Zheng, Xueqin Wang Abstract: The rapid development of modern technology has created many complex datasets in non-linear spaces, while most of the statistical hypothesis tests are only available in Euclidean or Hilbert spaces. To properly analyze the data with more complicated structures, efforts have been made to solve the fundamental test problems in more general spaces (Lyons 2013; Pan, Tian, Wang, and Zhang 2018; Pan, Wang, Zhang, Zhu, and Zhu 2020). In this paper, we introduce a publicly available R package Ball for the comparison of multiple distributions and the test of mutual independence in metric spaces, which extends the test procedures for the equality of two distributions (Pan et al. 2018) and the independence of two random objects (Pan et al. 2020). The Ball package is computationally efficient since several novel algorithms as well as engineering techniques are employed in speeding up the ball test procedures. Two real data analyses and diverse numerical studies have been performed, and the results certify that the Ball package can detect various distribution differences and complicated dependencies in complex datasets, e.g., directional data and symmetric positive definite matrix data. PubDate: Fri, 19 Mar 2021 00:00:00 +000

Authors:Yun-Hee Choi; Laurent Briollais, Wenqing He, Karen Kopciuk Abstract: FamEvent is a comprehensive R package for simulating and modeling age-at-disease onset in families carrying a rare gene mutation. The package can simulate complex family data for variable time-to-event outcomes under three common family study designs (population, high-risk clinic and multi-stage) with various levels of missing genetic information among family members. Residual familial correlation can be induced through the inclusion of a frailty term or a second gene. Disease-gene carrier probabilities are evaluated assuming Mendelian transmission or empirically from the data. When genetic information on the disease gene is missing, an expectation-maximization algorithm is employed to calculate the carrier probabilities. Penetrance model functions with ascertainment correction adapted to the sampling design provide age-specific cumulative disease risks by sex, mutation status, and other covariates for simulated data as well as real data analysis. Robust standard errors and 95% confidence intervals are available for these estimates. Plots of pedigrees and penetrance functions based on the fitted model provide graphical displays to evaluate and summarize the models. PubDate: Fri, 19 Mar 2021 00:00:00 +000

Authors:Alexander Meier; Claudia Kirch, Haeran Cho Abstract: Time series data, i.e., temporally ordered data, is routinely collected and analysed in in many fields of natural science, economy, technology and medicine, where it is of importance to verify the assumption of stochastic stationarity prior to modeling the data. Nonstationarities in the data are often attributed to structural changes with segments between adjacent change-points being approximately stationary. A particularly important, and thus widely studied, problem in statistics and signal processing is to detect changes in the mean at unknown time points. In this paper, we present the R package mosum, which implements elegant and mathematically well-justified procedures for the multiple mean change problem using the moving sum statistics. PubDate: Fri, 19 Mar 2021 00:00:00 +000

Authors:Thomas Nagler Abstract: Calling multi-threaded C++ code from R has its perils. Since the R interpreter is single-threaded, one must not check for user interruptions or print to the R console from multiple threads. One can, however, synchronize with R from the main thread. The R package RcppThread (current version 1.0.0) contains a header only C++ library for thread safe communication with R that exploits this fact. It includes C++ classes for threads, a thread pool, and parallel loops that routinely synchronize with R. This article explains the package's functionality and gives examples of its usage. The synchronization mechanism may also apply to other threading frameworks. Benchmarks suggest that, although synchronization causes overhead, the parallel abstractions of RcppThread are competitive with other popular libraries in typical scenarios encountered in statistical computing. PubDate: Wed, 03 Feb 2021 00:00:00 +000

Authors:Rodney Sparapani; Charles Spanbauer, Robert McCulloch Abstract: In this article, we introduce the BART R package which is an acronym for Bayesian additive regression trees. BART is a Bayesian nonparametric, machine learning, ensemble predictive modeling method for continuous, binary, categorical and time-to-event outcomes. Furthermore, BART is a tree-based, black-box method which fits the outcome to an arbitrary random function, f , of the covariates. The BART technique is relatively computationally efficient as compared to its competitors, but large sample sizes can be demanding. Therefore, the BART package includes efficient state-of-the-art implementations for continuous, binary, categorical and time-to-event outcomes that can take advantage of modern off-the-shelf hardware and software multi-threading technology. The BART package is written in C++ for both programmer and execution efficiency. The BART package takes advantage of multi-threading via forking as provided by the parallel package and OpenMP when available and supported by the platform. The ensemble of binary trees produced by a BART fit can be stored and re-used later via the R predict function. In addition to being an R package, the installed BART routines can be called directly from C++. The BART package provides the tools for your BART toolbox. PubDate: Thu, 14 Jan 2021 02:07:31 +000

Authors:Michael W. Robbins; Steven Davenport Abstract: The R package microsynth has been developed for implementation of the synthetic control methodology for comparative case studies involving micro- or meso-level data. The methodology implemented within microsynth is designed to assess the efficacy of a treatment or intervention within a well-defined geographic region that is itself a composite of several smaller regions (where data are available at the more granular level for comparison regions as well). The effect of the intervention on one or more time-varying outcomes is evaluated by determining a synthetic control region that resembles the treatment region across pre-intervention values of the outcome(s) and time-invariant covariates and that is a weighted composite of many untreated comparison regions. The microsynth procedure includes functionality that enables its user to (1) calculate weights for synthetic control, (2) tabulate results for statistical inferences, and (3) create time series plots of outcomes for treatment and synthetic control. In this article, microsynth is described in detail and its application is illustrated using data from a drug market intervention in Seattle, WA. PubDate: Thu, 14 Jan 2021 02:07:31 +000

Authors:Samuel L. Brilleman; Rory Wolfe, Margarita Moreno-Betancur, Michael J. Crowther Abstract: The simsurv R package allows users to simulate survival (i.e., time-to-event) data from standard parametric distributions (exponential, Weibull, and Gompertz), two-component mixture distributions, or a user-defined hazard function. Baseline covariates can be included under a proportional hazards assumption. Clustered event times, for example individuals within a family, are also easily accommodated. Time-dependent effects (i.e., nonproportional hazards) can be included by interacting covariates with linear time or a user-defined function of time. Under a user-defined hazard function, event times can be generated for a variety of complex models such as flexible (spline-based) baseline hazards, models with time-varying covariates, or joint longitudinal-survival models. PubDate: Thu, 14 Jan 2021 02:07:31 +000

Authors:Andreas Hill; Alexander Massey, Daniel Mandallaz Abstract: Forest inventories provide reliable evidence-based information to assess the state and development of forests over time. They typically consist of a random sample of plot locations in the forest that are assessed individually by field crews. Due to the high costs of these terrestrial campaigns, remote sensing information available in high quantity and low costs is frequently incorporated in the estimation process in order to reduce inventory costs or improve estimation precision. With respect to this objective, the application of multiphase forest inventory methods (e.g., double- and triple-sampling regression estimators) has proved to be efficient. While these methods have been successfully applied in practice, the availability of open-source software has been rare if not non-existent. The R package forestinventory provides a comprehensive set of global and small area regression estimators for multiphase forest inventories under simple and cluster sampling. The implemented methods have been demonstrated in various scientific studies ranging from small to large scale forest inventories, and can be used for post-stratification, regression and regression within strata. This article gives an extensive review of the mathematical theory of this family of design-based estimators, puts them into a common framework of forest inventory scenarios and demonstrates their application in the R environment. PubDate: Thu, 14 Jan 2021 02:07:31 +000