Similar Journals
Journal of Statistical Software
Journal Prestige (SJR): 13.802 Citation Impact (citeScore): 16 Number of Followers: 16 Open Access journal ISSN (Print) 1548-7660 - ISSN (Online) 1548-7660 Published by American Statistical Association [5 journals] |
- Hierarchical Clustering with Contiguity Constraint in R
Authors: Guillaume Guénard; Pierre Legendre
Abstract: This article presents a new implementation of hierarchical clustering for the R language that allows one to apply spatial or temporal contiguity constraints during the clustering process. The need for contiguity constraint arises, for instance, when one wants to partition a map into different domains of similar physical conditions, identify discontinuities in time series, group regional administrative units with respect to their performance, and so on. To increase computation efficiency, we programmed the core functions in plain C. The result is a new R function, constr.hclust, which is distributed in package adespatial. The program implements the general agglomerative hierarchical clustering algorithm described by Lance and Williams (1966; 1967), with the particularity of allowing only clusters that are contiguous in geographic space or along time to fuse at any given step. Contiguity can be defined with respect to space or time. Information about spatial contiguity is provided by a connection network among sites, with edges describing the links between connected sites. Clustering with a temporal contiguity constraint is also known as chronological clustering. Information on temporal contiguity can be implicitly provided as the rank positions of observations in the time series. The implementation was mirrored on that found in the hierarchical clustering function hclust of the standard R package stats (R Core Team 2022). We transcribed that function from Fortran to C and added the functionality to apply constraints when running the function. The implementation is efficient. It is limited mainly by input/output access as massive amounts of memory are potentially needed to store copies of the dissimilarity matrix and update its elements when analyzing large problems. We provided R computer code for plotting results for numbers of clusters.
PubDate: Sun, 04 Sep 2022 00:00:00 +000
- Spbsampling: An R Package for Spatially Balanced Sampling
Authors: Francesco Pantalone; Roberto Benedetti, Federica Piersimoni
Abstract: The basic idea underpinning the theory of spatially balanced sampling is that units closer to each other provide less information about a target of inference than units farther apart. Therefore, it should be desirable to select a sample well spread over the population of interest, or a spatially balanced sample. This situation is easily understood in, among many others, environmental, geological, biological, and agricultural surveys, where usually the main feature of the population is to be geo-referenced. Since traditional sampling designs generally do not exploit the spatial features and since it is desirable to take into account the information regarding spatial dependence, several sampling designs have been developed in order to achieve this objective. In this paper, we present the R package Spbsampling, which provides functions in order to perform three specific sampling designs that pursue the aforementioned purpose. In particular, these sampling designs achieve spatially balanced samples using a summary index of the distance matrix. In this sense, the applicability of the package is much wider, as a distance matrix can be defined for units according to variables different than geographical coordinates.
PubDate: Wed, 24 Aug 2022 00:00:00 +000
- Blang: Bayesian Declarative Modeling of General Data Structures and
Inference via Algorithms Based on Distribution Continua
Authors: Alexandre Bouchard-Côté; Kevin Chern, Davor Cubranic, Sahand Hosseini, Justin Hume, Matteo Lepur, Zihui Ouyang, Giorgio Sgarbi
Abstract: Consider a Bayesian inference problem where a variable of interest does not take values in a Euclidean space. These "non-standard" data structures are in reality fairly common. They are frequently used in problems involving latent discrete factor models, networks, and domain specific problems such as sequence alignments and reconstructions, pedigrees, and phylogenies. In principle, Bayesian inference should be particularly wellsuited in such scenarios, as the Bayesian paradigm provides a principled way to obtain confidence assessment for random variables of any type. However, much of the recent work on making Bayesian analysis more accessible and computationally efficient has focused on inference in Euclidean spaces. In this paper, we introduce Blang, a domain specific language and library aimed at bridging this gap. Blang allows users to perform Bayesian analysis on arbitrary data types while using a declarative syntax similar to the popular family of probabilistic programming languages, BUGS. Blang is augmented with intuitive language additions to create data types of the user's choosing. To perform inference at scale on such arbitrary state spaces, Blang leverages recent advances in sequential Monte Carlo and non-reversible Markov chain Monte Carlo methods.
PubDate: Tue, 23 Aug 2022 00:00:00 +000
- exuber: Recursive Right-Tailed Unit Root Testing with R
Authors: Kostas Vasilopoulos; Efthymios Pavlidis, Enrique Martínez-García
Abstract: This paper introduces the R package exuber for testing and date-stamping periods of mildly explosive dynamics (exuberance) in time series. The package computes test statistics for the supremum augmented Dickey-Fuller test (SADF) of Phillips, Wu, and Yu (2011), the generalized SADF (GSADF) of Phillips, Shi, and Yu (2015a,b), and the panel GSADF proposed by Pavlidis, Yusupova, Paya, Peel, Martínez-García, Mack, and Grossman (2016); generates finite-sample critical values based on Monte Carlo and bootstrap methods; and implements the corresponding date-stamping procedures. The recursive least-squares algorithm that we introduce in our implementation of these techniques utilizes the matrix inversion lemma and in that way achieves significant speed improvements. We illustrate the speed gains in a simulation experiment, and provide illustrations of the package using artificial series and a panel on international house prices.
PubDate: Fri, 19 Aug 2022 00:00:00 +000
- Automatic Identification and Forecasting of Structural Unobserved
Components Models with UComp
Authors: Diego J. Pedregal
Abstract: UComp is a powerful library for building unobserved components models, useful for forecasting and other important operations, such us de-trending, cycle analysis, seasonal adjustment, signal extraction, etc. One of the most outstanding features that makes UComp unique among its class of related software implementations is that models may be built automatically by identification algorithms (three versions are available). These algorithms select the best model among many possible combinations. Another relevant feature is that it is coded in C++, opening the door to link it to different popular and widely used environments, like R, MATLAB, Octave, Python, etc. The implemented models for the components are more general than the usual ones in the field of unobserved components modeling, including different types of trend, cycle, seasonal and irregular components, input variables and outlier detection. The automatic character of the algorithms required the development of many complementary algorithms to control performance and make it applicable to as many different time series as possible. The library is open source and available in different formats in public repositories. The performance of the library is illustrated working on real data in several varied examples.
PubDate: Wed, 17 Aug 2022 00:00:00 +000
- Robust Mediation Analysis: The R Package robmed
Authors: Andreas Alfons; Nüfer Y. Ateş, Patrick J. F. Groenen
Abstract: Mediation analysis is one of the most widely used statistical techniques in the social, behavioral, and medical sciences. Mediation models allow to study how an independent variable affects a dependent variable indirectly through one or more intervening variables, which are called mediators. The analysis is often carried out via a series of linear regressions, in which case the indirect effects can be computed as products of coefficients from those regressions. Statistical significance of the indirect effects is typically assessed via a bootstrap test based on ordinary least-squares estimates. However, this test is sensitive to outliers or other deviations from normality assumptions, which poses a serious threat to empirical testing of theory about mediation mechanisms. The R package robmed implements a robust procedure for mediation analysis based on the fast-and-robust bootstrap methodology for robust regression estimators, which yields reliable results even when the data deviate from the usual normality assumptions. Various other procedures for mediation analysis are included in package robmed as well. Moreover, robmed introduces a new formula interface that allows to specify mediation models with a single formula, and provides various plots for diagnostics or visual representation of the results.
PubDate: Wed, 17 Aug 2022 00:00:00 +000
- irtplay: An R Package for Unidimensional Item Response Theory Modeling
Authors: Hwanggyu Lim; Craig S. Wells
Abstract: Item response theory (IRT) is a general framework in which mathematical models are formulated to explain the relationship between an examinee's observable response on an item and the latent ability measured by a test. The application of IRT models and related statistical methods are commonly found in educational and psychological research. An important step in applying IRT models to test data is estimating the IRT model parameters. Accordingly, the successful application of IRT rests on the satisfactory statistical techniques and software for accurately estimating the model parameters. The irtplay R package was developed to provide users with a user-friendly experience and convenience when analyzing test data using unidimensional IRT models. The package can be used to fit the IRT models to a mixture of dichotomous and polytomous item data using marginal maximum likelihood estimation via the expectation-maximization, calibrate pretest items, and estimate examinees' latent ability parameters. In addition, the package provides practical tools that conveniently enable users to conduct many analyses related to IRT such as evaluating IRT model-data fit, analyzing differential item functioning, computing asymptotic variance-covariance matrices of item parameter estimates, calculating the conditional probability distribution of observed scores using the Lord and Wingersky (1984) formula, and importing item and ability parameter estimates from the output of popular IRT software. The main features of the irtplay package are illustrated using three data examples.
PubDate: Tue, 16 Aug 2022 00:00:00 +000
- HighFrequencyCovariance: A Julia Package for Estimating Covariance
Matrices Using High Frequency Financial Data
Authors: Stuart Baumann; Margaryta Klymak
Abstract: High frequency data typically exhibit asynchronous trading and microstructure noise, which can bias the covariances estimated by standard estimators. While a number of specialized estimators have been proposed, they have had limited availability in open source software. HighFrequencyCovariance is the ﬁrst Julia package which implements specialized estimators for volatility, correlation and covariance using high frequency ﬁnancial data. It also implements complementary algorithms for matrix regularization. This paper presents the issues associated with exploiting high frequency ﬁnancial data and describes the volatility, covariance and regularization algorithms that have been implemented. We then demonstrate the use of the package using foreign exchange market tick data to estimate the covariance of the exchange rates between diﬀerent currencies. We also perform a Monte Carlo experiment, which shows the accuracy gains that are possible over simpler covariance estimation techniques.
PubDate: Mon, 15 Aug 2022 00:00:00 +000
- Bambi: A Simple Interface for Fitting Bayesian Linear Models in Python
Authors: Tomás Capretto; Camen Piho, Ravin Kumar, Jacob Westfall, Tal Yarkoni, Osvaldo A Martin
Abstract: The popularity of Bayesian statistical methods has increased dramatically in recent years across many research areas and industrial applications. This is the result of a variety of methodological advances with faster and cheaper hardware as well as the development of new software tools. Here we introduce an open source Python package named Bambi (BAyesian Model Building Interface) that is built on top of the PyMC probabilistic programming framework and the ArviZ package for exploratory analysis of Bayesian models. Bambi makes it easy to specify complex generalized linear hierarchical models using a formula notation similar to those found in R. We demonstrate Bambi's versatility and ease of use with a few examples spanning a range of common statistical models including multiple regression, logistic regression, and mixed-effects modeling with crossed group specific effects. Additionally we discuss how automatic priors are constructed. Finally, we conclude with a discussion of our plans for the future development of Bambi.
PubDate: Mon, 15 Aug 2022 00:00:00 +000
- On the Programmatic Generation of Reproducible Documents
Authors: Michael Kane; Xun (Tony) Jiang, Simon Urbanek
Abstract: Reproducible document standards, like R Markdown, facilitate the programmatic creation of documents whose content is itself programmatically generated. While programmatic content alone may not be sufficient for a rendered document since it does not include prose (content generated by an author to provide context, a narrative, etc.) programmatic generation can provide substantial efficiencies for structuring and constructing documents. This paper explores the programmatic generation of reproducible documents by distinguishing components that can be created by computational means from those requiring human-generation, providing guidelines for the generation of these documents, and identifying a use case in clinical trial reporting. These concepts and use case are illustrated through the listdown package for the R programming environment, which is is currently available on the Comprehensive R Archive Network.
PubDate: Wed, 20 Jul 2022 00:00:00 +000
- plot3logit: Ternary Plots for Interpreting Trinomial Regression Models
Authors: Flavio Santi; Maria Michela Dickson, Giuseppe Espa, Diego Giuliani
Abstract: This paper presents the R package plot3logit which enables the covariate effects of trinomial regression models to be represented graphically by means of a ternary plot. The aim of the plot is helping the interpretation of regression coefficients in terms of the effects that a change in values of regressors has on the probability distribution of the dependent variable. Such changes may involve either a single regressor, or a group of them (composite changes), and the package permits both cases to be handled in a user-friendly way. Moreover, plot3logit can compute and draw confidence regions of the effects of covariate changes and enables multiple changes and profiles to be represented and compared jointly. Upstream and downstream compatibility makes the package able to work with other R packages or applications other than R.
PubDate: Tue, 19 Jul 2022 00:00:00 +000
- Feller-Pareto and Related Distributions: Numerical Implementation and
Actuarial Applications
Authors: Christophe Dutang; Vincent Goulet, Nicholas Langevin
Abstract: Actuaries model insurance claim amounts using heavy tailed probability distributions. They routinely need to evaluate quantities related to these distributions such as quantiles in the far right tail, moments or limited moments. Furthermore, actuaries often resort to simulation to solve otherwise untractable risk evaluation problems. The paper discusses our implementation of support functions for the Feller-Pareto distribution for the R package actuar. The Feller-Pareto defines a large family of heavy tailed distributions encompassing the transformed beta family and many variants of the Pareto distribution.
PubDate: Sat, 16 Jul 2022 00:00:00 +000
- Learning Base R (2nd Edition)
Authors: James E. Helmreich
PubDate: Wed, 13 Jul 2022 00:00:00 +000
- Python and R for the Modern Data Scientist
Authors: Christopher J. Lortie
PubDate: Wed, 13 Jul 2022 00:00:00 +000
- modelsummary: Data and Model Summaries in R
Authors: Vincent Arel-Bundock
Abstract: modelsummary is a package to summarize data and statistical models in R. It supports over one hundred types of models out-of-the-box, and allows users to report the results of those models side-by-side in a table, or in coefficient plots. It makes it easy to execute common tasks such as computing robust standard errors, adding significance stars, and manipulating coefficient and model labels. Beyond model summaries, the package also includes a suite of tools to produce highly flexible data summary tables, such as dataset overviews, correlation matrices, (multi-level) cross-tabulations, and balance tables (also known as "Table 1"). The appearance of the tables produced by modelsummary can be customized using external packages such as kableExtra, gt, flextable, or huxtable; the plots can be customized using ggplot2. Tables can be exported to many output formats, including HTML, LaTeX, Text/Markdown, Microsoft Word, Powerpoint, Excel, RTF, PDF, and image files. Tables and plots can be embedded seamlessly in rmarkdown, knitr, or Sweave dynamic documents. The modelsummary package is designed to be simple, robust, modular, and extensible.
PubDate: Mon, 11 Jul 2022 00:00:00 +000
- stringi: Fast and Portable Character String Processing in R
Authors: Marek Gagolewski
Abstract: Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician's or data scientist's repertoire to complement their numerical computing and data wrangling skills.
PubDate: Mon, 11 Jul 2022 00:00:00 +000
- evgam: An R Package for Generalized Additive Extreme Value Models
Authors: Benjamin D. Youngman
Abstract: This article introduces the R package evgam. The package provides functions for fitting extreme value distributions. These include the generalized extreme value and generalized Pareto distributions. The former can also be fitted through a point process representation. Package evgam supports quantile regression via the asymmetric Laplace distribution, which can be useful for estimating high thresholds, sometimes used to discriminate between extreme and non-extreme values. The main addition of package evgam is to let extreme value distribution parameters have generalized additive model forms, the smoothness of which can be objectively estimated using Laplace's method. Illustrative examples fitting various distributions with various specifications are given. These include daily precipitation accumulations for part of Colorado, US, used to illustrate spatial models, and daily maximum temperatures for Fort Collins, Colorado, US, used to illustrate temporal models.
PubDate: Mon, 11 Jul 2022 00:00:00 +000
- scikit-mobility: A Python Library for the Analysis, Generation, and Risk
Assessment of Mobility Data
Authors: Luca Pappalardo; Filippo Simini, Gianni Barlacchi, Roberto Pellungrini
Abstract: The last decade has witnessed the emergence of massive mobility datasets, such as tracks generated by GPS devices, call detail records, and geo-tagged posts from social media platforms. These datasets have fostered a vast scientific production on various applications of mobility analysis, ranging from computational epidemiology to urban planning and transportation engineering. A strand of literature addresses data cleaning issues related to raw spatiotemporal trajectories, while the second line of research focuses on discovering the statistical "laws" that govern human movements. A significant effort has also been put on designing algorithms to generate synthetic trajectories able to reproduce, realistically, the laws of human mobility. Last but not least, a line of research addresses the crucial problem of privacy, proposing techniques to perform the re-identification of individuals in a database. A view on state-of-the-art cannot avoid noticing that there is no statistical software that can support scientists and practitioners with all the aspects mentioned above of mobility data analysis. In this paper, we propose scikit-mobility, a Python library that has the ambition of providing an environment to reproduce existing research, analyze mobility data, and simulate human mobility habits. scikit-mobility is efficient and easy to use as it extends pandas, a popular Python library for data analysis. Moreover, scikit-mobility provides the user with many functionalities, from visualizing trajectories to generating synthetic data, from analyzing statistical patterns to assessing the privacy risk related to the analysis of mobility datasets.
PubDate: Mon, 11 Jul 2022 00:00:00 +000
- spNNGP R Package for Nearest Neighbor Gaussian Process Models
Authors: Andrew O. Finley; Abhirup Datta, Sudipto Banerjee
Abstract: This paper describes and illustrates functionality of the spNNGP R package. The package provides a suite of spatial regression models for Gaussian and non-Gaussian pointreferenced outcomes that are spatially indexed. The package implements several Markov chain Monte Carlo (MCMC) and MCMC-free nearest neighbor Gaussian process (NNGP) models for inference about large spatial data. Non-Gaussian outcomes are modeled using a NNGP Pólya-Gamma latent variable. OpenMP parallelization options are provided to take advantage of multiprocessor systems. Package features are illustrated using simulated and real data sets.
PubDate: Mon, 11 Jul 2022 00:00:00 +000