Abstract: Publication date: Available online 9 November 2020Source: Handbook of StatisticsAuthor(s): Tae Jin Lee, Masayuki Kakehashi, Arni S.R. Srinivasa Rao
Abstract: Publication date: Available online 13 August 2020Source: Handbook of StatisticsAuthor(s): Sireesh Saride, Pranav R.T. Peddinti, B. Munwar Basha
Abstract: Publication date: 2020Source: Handbook of Statistics, Volume 42Author(s): Jianghao Chu, Tae-Hwy Lee, Aman UllahAbstractFreund and Schapire (1997) introduced “Discrete AdaBoost” (DAB) which has been mysteriously effective for the high-dimensional binary classification or binary prediction. In an effort to understand the myth, Friedman, Hastie, and Tibshirani (FHT, 2000) show that DAB can be understood as statistical learning which builds an additive logistic regression model via Newton-like updating minimization of the “exponential loss.” From this statistical point of view, FHT proposed three modifications of DAB, namely, Real AdaBoost (RAB), LogitBoost (LB), and Gentle AdaBoost (GAB). All of DAB, RAB, LB, GAB solve for the logistic regression via different algorithmic designs and different objective functions. The RAB algorithm uses class probability estimates to construct real-valued contributions of the weak learner, LB is an adaptive Newton algorithm by stagewise optimization of the Bernoulli likelihood, and GAB is an adaptive Newton algorithm via stagewise optimization of the exponential loss. The same authors of FHT published an influential textbook, The Elements of Statistical Learning (ESL, 2001 and 2008). A companion book An Introduction to Statistical Learning (ISL) by James et al. (2013) was published with applications in R. However, both ESL and ISL (e.g., sections 4.5 and 4.6) do not cover these four AdaBoost algorithms while FHT provided some simulation and empirical studies to compare these methods. Given numerous potential applications, we believe it would be useful to collect the R libraries of these AdaBoost algorithms, as well as more recently developed extensions to AdaBoost for probability prediction with examples and illustrations. Therefore, the goal of this chapter is to do just that, i.e., (i) to provide a user guide of these alternative AdaBoost algorithms with step-by-step tutorial of using R (in a way similar to ISL, e.g., section 4.6), (ii) to compare AdaBoost with alternative machine learning classification tools such as the Deep Neural Network (DNN), logistic regression with LASSO and SIM-RODEO, and (iii) to demonstrate the empirical applications in economics, such as prediction of business cycle turning points and directional prediction of stock price indexes. We revisit Ng (2014) who used DAB for prediction of the business cycle turning points by comparing the results from RAB, LB, GAB, DNN, logistic regression, and SIM-RODEO.
Abstract: Publication date: Available online 16 September 2019Source: Handbook of StatisticsAuthor(s): Naveen Naidu NarisettyAbstractHigh-dimensional data, where the number of features or covariates can even be larger than the number of independent samples, are ubiquitous and are encountered on a regular basis by statistical scientists both in academia and in industry. A majority of the classical research in statistics dealt with the settings where there is a small number of covariates. Due to the modern advancements in data storage and computational power, the high-dimensional data revolution has significantly occupied mainstream statistical research. In gene expression datasets, for instance, it is not uncommon to encounter datasets with observations on at most a few hundred independent samples (subjects) and with information on tens or hundreds of thousands of genes per each sample. An important and common question that arises quickly is—“which of the available covariates are relevant to the outcome of interest'” This concerns the problem of variable selection (and more generally model selection) in statistics and data science.This chapter will provide an overview of some of the most well-known model selection methods along with some of the more recent methods. While frequentist methods will be discussed, Bayesian approaches will be given a more elaborate treatment. The frequentist framework for model selection is primarily based on penalization, whereas the Bayesian framework relies on prior distributions for inducing shrinkage and sparsity. The chapter treats the Bayesian framework in the light of objective and empirical Bayesian viewpoints as the priors in the high-dimensional setting are typically not completely based subjective prior beliefs. An important practical aspect of high-dimensional model selection methods is computational scalability which will also be discussed.
Abstract: Publication date: Available online 3 September 2019Source: Handbook of StatisticsAuthor(s): Hongyan XuAbstractWith the recent development in biotechnology, especially next-generation sequencing in genomics, there is an explosion of genomic data generated. The data are big in terms of both volume and diversity. The big data contain much more information and also pose unprecedented challenges in data analysis. In this article, we discuss the big data challenges and opportunities in genomics research. We also discuss possible solutions for these challenges, which can serve as the basis for future research.
Abstract: Publication date: Available online 9 March 2019Source: Handbook of StatisticsAuthor(s): Eric Ghysels, Virmantas Kvedaras, Vaidotas Zemlys-BalevičiusAbstractMixed data sampling (MIDAS) regressions are now commonly used to deal with time series data sampled at different frequencies. This chapter focuses on single-equation MIDAS regression models involving stationary processes with the dependent variable observed at a lower frequency than the explanatory ones. We discuss in detail nonlinear and semiparametric MIDAS regression models, topics not covered in prior work. Moreover, fitting the theme of the handbook, we also elaborate on the R package midasr associated with the regression models using simulated and empirical examples. In the theory part, a stylized model is introduced in order to discuss specific issues relevant to the construction of MIDAS models, such as the use or nonuse of functional constraints on parameters, the types of constraints and their choice, and the selection of the lag order. We introduce various new MIDAS regression models, including quasi-linear MIDAS, models with nonparametric smoothing of weights, logistic smooth transition and min–mean–max effects MIDAS, and semiparametric specifications.
Abstract: Publication date: Available online 7 March 2019Source: Handbook of StatisticsAuthor(s): Roberto S. Mariano, Suleyman OzmucurAbstractRecognizing the need to utilize high-frequency indicators for more up-to-date forecasts, this chapter surveys alternative modeling approaches to combining mixed frequency data for forecasting purposes. The models covered in this chapter include data-parsimonious (but more computer-demanding) models such as the mixed-frequency dynamic latent factor model (MF-DLFM) as well as more data-intensive ones like the current quarterly model (CQM) and mixed data sampling (MIDAS) regressions. In all these models, the fact that the data set is of mixed frequencies raises technical issues in the estimation and forecasting phases of the exercise. In the case of MF-DLFM, the additional feature of unobserved common factors introduces additional complications in implementing the estimation and simulation strategy based on the derived observable state-space formulation of the model.The alternative models are estimated and constructed using Philippine data, to forecast GDP growth and inflation in the Philippines. For this numerical exercise, 10 monthly indicators are used for quarterly real GDP and 9 monthly indicators for the quarterly GDP deflator. The whole empirical analysis is implemented in R—starting from using R to access Philippine data from Philippine and international data sources to analyze the statistical properties of Philippine real GDP and GDP deflator and culminating in the estimation of the alternative forecasting models, where numerous variations of MIDAS are explored.As the next step in this research, it would be particularly important to compare the forecasting performance of the alternative procedures that are surveyed in this chapter. A more comprehensive study of this type will be presented in a future sequel to this chapter. Indicative comparison results that have just been recently reported point to the potentially superior performance of MF-DLFM for forecasting GDP growth, while for forecasting inflation, the performance of MF-DLFM is not significantly better than MIDAS. More work is required for a more definitive conclusion on this issue—requiring further analysis and empirical applications, especially in expanding the performance analysis to cover the wider span of alternative forecasting models and variations of MF-DLFM and MIDAS surveyed in this chapter. Dynamic simulations for multiperiod forecasting also should be considered, as well as more refinements in the estimated models, especially the dynamic latent factor models, and extension of the analysis to other countries, especially in Southeast Asia.
Abstract: Publication date: Available online 7 March 2019Source: Handbook of StatisticsAuthor(s): Matthieu StiglerAbstractThe notion of cointegration describes the case when two or more variables are each nonstationary, yet there exists a combination of these variables which is stationary. This statistical definition leads to a rich economic interpretation, where the variables can be thought of as sharing a stable relationship, and deviations from a long-run equilibrium are corrected. Implicit in the definition, however, is the requirement that any small deviation from the long-run equilibrium needs to be corrected instantaneously and symmetrically.Threshold cointegration relaxes the linear and instantaneous adjustment assumption by allowing the adjustment to occur only after the deviations exceed a critical threshold. Likewise, it can accommodate asymmetric adjustment, where positive or negative deviations are not necessarily corrected in the same way. This flexible framework can be used to model economic phenomena such as transaction costs, stickiness of prices, or asymmetry in agents’ reactions.In this chapter, I survey the concept of threshold cointegration, and show how to use this model within R with package tsDyn. In Section 1, I review briefly the concept of stationarity and cointegration, then explain the concept of threshold cointegration. In Section 2, I discuss in detail the econometrics of threshold cointegration, presenting the main tests and estimators. I describe then the package tsDyn in Section 3, and show how to use it with an empirical application on the term structure of interest rates in Section 4.
Abstract: Publication date: Available online 26 February 2019Source: Handbook of StatisticsAuthor(s): Giancarlo FerraraAbstractThe production function is usually assumed to specify the maximum output obtainable, from a given set of inputs, describing the boundary or frontier of the obtainable output from each feasible combination of input; it relates the production process of individual units to the efficient border of the production possibilities. The measure of the distance of each unit from the border is the most immediate way to assess its (in)efficiency. However, the production function is not generally known, but it has only a set of information on each production unit and it is therefore essential to develop techniques to estimate the production frontier. Starting from the packages already developed in the R environment, this work introduces the methodological aspects of the stochastic frontier models, including a brief introduction to the relative extensions in presence of contextual variables and spatial external factors, comparing the standard stochastic frontier analysis and the semiparametric one. Some simulation studies and an empirical application to agricultural data illustrate the different techniques.
Abstract: Publication date: Available online 18 February 2019Source: Handbook of StatisticsAuthor(s): Hrishikesh Vinod, Honey Karun, Lekha S. ChakrabortyAbstractA typical macroeconomic regression of private corporate investment on public infrastructure investment, rate of interest, private credit, capital flows and output gap involves a mixture of stationary and nonstationary variables. Moreover, the available data series (2011–16) for estimating the regression for India is too short for asymptotic statistical inference. Hence we use maximum entropy (ME) bootstrap from R package “meboot” to confirm positive role of public infrastructure investment. The significant result has policy implications in terms of the current debate whether public investment “crowds-in” rather than “crowds-out” private corporate investment in India. We use another R package “generalCorr” to study whether the right-hand side variables “approximately cause” private investment, or are subject to the endogeneity problem. While finding evidence supporting public infrastructure spending to encourage private investment in India, we highlight new R tools for estimation and inference in many macroeconomic regressions.
Abstract: Publication date: Available online 11 January 2019Source: Handbook of StatisticsAuthor(s): Robin C. Sickles, Wonho Song, Valentin ZelenyukAbstractOur chapter details a wide variety of approaches used in estimating productivity and efficiency based on methods developed to estimate frontier production using stochastic frontier analysis (SFA) and data envelopment analysis (DEA). The estimators utilize panel, single cross section, and time series data sets. The R programs include such approaches to estimate firm efficiency as the time-invariant fixed effects, correlated random effects, and uncorrelated random effects panel stochastic frontier estimators, time-varying fixed effects, correlated random effects, and uncorrelated random effects estimators, semiparametric efficient panel frontier estimators, factor models for cross-sectional and time-varying efficiency, bootstrapping methods to develop confidence intervals for index number-based productivity estimates and their decompositions, DEA and Free Disposable Hull estimators. The chapter provides the professional researcher, analyst, statistician, and regulator with the most up to date efficiency modeling methods in the easily accessible open source programming language R.
Abstract: Publication date: Available online 3 January 2019Source: Handbook of StatisticsAuthor(s): Arpita Mukherjee, Weijia Peng, Norman R. Swanson, Xiye YangAbstractIn recent years, the field of financial econometrics has seen tremendous gains in the amount of data available for use in modeling and prediction. Much of this data is very high frequency, and even “tick-based,” and hence falls into the category of what might be termed “big data.” The availability of such data, particularly that available at high frequency on an intra-day basis, has spurred numerous theoretical advances in the areas of volatility/risk estimation and modeling. In this chapter, we discuss key such advances, beginning with a survey of numerous nonparametric estimators of integrated volatility. Thereafter, we discuss testing for jumps using said estimators. Finally, we discuss recent advances in testing for co-jumps. Such co-jumps are important for a number of reasons. For example, the presence of co-jumps, in contexts where data has been partitioned into continuous and discontinuous (jump) components, is indicative of (near) instantaneous transmission of financial shocks across different sectors and companies in the markets; and hence represents a type of systemic risk. Additionally, the presence of co-jumps across sectors, say, suggests that if jumps can be predicted in one sector, then such predictions may have useful information for modeling variables such as returns and volatility in another sector. As an illustration of the methods discussed in this chapter, we carry out an empirical analysis of DOW and NASDAQ stock price returns.