Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Peida Zhan*, Kaiwen Man*, Stefanie A. Wind, Jonathan Malone Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Respondents’ problem-solving behaviors comprise behaviors that represent complicated cognitive processes that are frequently systematically tied to one another. Biometric data, such as visual fixation counts (FCs), which are an important eye-tracking indicator, can be combined with other types of variables that reflect different aspects of problem-solving behavior to quantify variability in problem-solving behavior. To provide comprehensive feedback and accurate diagnosis when using such multimodal data, the present study proposes a multimodal joint cognitive diagnosis model that accounts for latent attributes, latent ability, processing speed, and visual engagement by simultaneously modeling response accuracy (RA), response times, and FCs. We used two simulation studies to test the feasibility of the proposed model. Findings mainly suggest that the parameters of the proposed model can be well recovered and that modeling FCs, in addition to RA and response times, could increase the comprehensiveness of feedback on problem-solving-related cognitive characteristics as well as the accuracy of knowledge structure diagnosis. An empirical example is used to demonstrate the applicability and benefits of the proposed model. We discuss the implications of our findings as they relate to research and practice. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-07-29T06:59:48Z DOI: 10.3102/10769986221111085
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Weimeng Wang, Yang Liu, Hongyun Liu Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Differential item functioning (DIF) occurs when the probability of endorsing an item differs across groups for individuals with the same latent trait level. The presence of DIF items may jeopardize the validity of an instrument; therefore, it is crucial to identify DIF items in routine operations of educational assessment. While DIF detection procedures based on item response theory (IRT) have been widely used, a majority of IRT-based DIF tests assume predefined anchor (i.e., DIF-free) items. Not only is this assumption strong, but violations to it may also lead to erroneous inferences, for example, an inflated Type I error rate. We propose a general framework to define the effect sizes of DIF without a priori knowledge of anchor items. In particular, we quantify DIF by item-specific residuals from a regression model fitted to the true item parameters in respective groups. Moreover, the null distribution of the proposed test statistic using robust estimator can be derived analytically or approximated numerically even when there is a mix of DIF and non-DIF items, which yields asymptotically justified statistical inference. The Type I error rate and the power performance of the proposed procedure are evaluated and compared with the conventional likelihood-ratio DIF tests in a Monte Carlo experiment. Our simulation study has shown promising results in controlling Type I error rate and power of detecting DIF items. Even when there is a mix of DIF and non-DIF items, the true and false alarm rate can be well controlled when a robust regression estimator is used. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-07-19T05:08:49Z DOI: 10.3102/10769986221109208
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Dylan Molenaar, Mariana Cúri; Jorge L. Bazán Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Bounded continuous data are encountered in many applications of item response theory, including the measurement of mood, personality, and response times and in the analyses of summed item scores. Although different item response theory models exist to analyze such bounded continuous data, most models assume the data to be in an open interval and cannot accommodate data in a closed interval. As a result, ad hoc transformations are needed to prevent scores on the bounds of the observed variables. To motivate the present study, we demonstrate in real and simulated data that this practice of fitting open interval models to closed interval data can majorly affect parameter estimates even in cases with only 5% of the responses on one of the bounds of the observed variables. To address this problem, we propose a zero and one inflated item response theory modeling framework for bounded continuous responses in the closed interval. We illustrate how four existing models for bounded responses from the literature can be accommodated in the framework. The resulting zero and one inflated item response theory models are studied in a simulation study and a real data application to investigate parameter recovery, model fit, and the consequences of fitting the incorrect distribution to the data. We find that neglecting the bounded nature of the data biases parameters and that misspecification of the exact distribution may affect the results depending on the data generating model. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-07-15T07:14:54Z DOI: 10.3102/10769986221108455
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Su-Pin Hung, Hung-Yu Huang Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. To address response style or bias in rating scales, forced-choice items are often used to request that respondents rank their attitudes or preferences among a limited set of options. The rating scales used by raters to render judgments on ratees’ performance also contribute to rater bias or errors; consequently, forced-choice items have recently been employed for raters to rate how a ratee performs in certain defined traits. This study develops forced-choice ranking models (FCRMs) for data analysis when performance is evaluated by external raters or experts in a forced-choice ranking format. The proposed FCRMs consider different degrees of raters’ leniency/severity when modeling the selection probability in the generalized unfolding item response theory framework. They include an additional topic facet when multiple tasks are evaluated and incorporate variations in leniency parameters to capture the interactions between ratees and raters. The simulation results indicate that the parameters of the new models can be satisfactorily recovered and that better parameter recovery is associated with more item blocks, larger sample sizes, and a complete ranking design. A technological creativity assessment is presented as an empirical example with which to demonstrate the applicability and implications of the new models. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-07-07T07:44:00Z DOI: 10.3102/10769986221104207
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Wendy Chan, Larry Vernon Hedges Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Multisite field experiments using the (generalized) randomized block design that assign treatments to individuals within sites are common in education and the social sciences. Under this design, there are two possible estimands of interest and they differ based on whether sites or blocks have fixed or random effects. When the average treatment effect is assumed to be identical across sites, it is common to omit site by treatment interactions and “pool” them into the error term in classical experimental design. However, prior work has not addressed the consequences of pooling when site by treatment interactions are not zero. This study assesses the impact of pooling on inference in the presence of nonzero site by treatment interactions. We derive the small sample distributions of the test statistics for treatment effects under pooling and illustrate the impacts on rejection rates when interactions are not zero. We use the results to offer recommendations to researchers conducting studies based on the multisite design. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-07-05T05:45:32Z DOI: 10.3102/10769986221104800
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:J. R. Lockwood, Katherine E. Castellano, Daniel F. McCaffrey Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Many states and school districts in the United States use standardized test scores to compute annual measures of student achievement progress and then use school-level averages of these growth measures for various reporting and diagnostic purposes. These aggregate growth measures can vary consequentially from year to year for the same school, complicating their use and interpretation. We develop a method, based on the theory of empirical best linear prediction, to improve the accuracy and stability of aggregate growth measures by pooling information across grades, years, and tested subjects for individual schools. We demonstrate the performance of the method using both simulation and application to 6 years of annual growth measures from a large, urban school district. We provide code for implementing the method in the package schoolgrowth for the R environment. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-06-28T05:23:55Z DOI: 10.3102/10769986221101624
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Benjamin W. Domingue, Klint Kanopka, Ben Stenhaug, Michael J. Sulik, Tanesia Beverly, Matthieu Brinkhuis, Ruhan Circi, Jessica Faul, Dandan Liao, Bruce McCandliss, Jelena Obradović, Chris Piech, Tenelle Porter, Project iLEAD Consortium, James Soland, Jon Weeks, Steven L. Wise, Jason Yeatman Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. The speed–accuracy trade-off (SAT) suggests that time constraints reduce response accuracy. Its relevance in observational settings—where response time (RT) may not be constrained but respondent speed may still vary—is unclear. Using 29 data sets containing data from cognitive tasks, we use a flexible method for identification of the SAT (which we test in extensive simulation studies) to probe whether the SAT holds. We find inconsistent relationships between time and accuracy; marginal increases in time use for an individual do not necessarily predict increases in accuracy. Additionally, the speed–accuracy relationship may depend on the underlying difficulty of the interaction. We also consider the analysis of items and individuals; of particular interest is the observation that respondents who exhibit more within-person variation in response speed are typically of lower ability. We further find that RT is typically a weak predictor of response accuracy. Our findings document a range of empirical phenomena that should inform future modeling of RTs collected in observational settings. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-06-09T05:27:25Z DOI: 10.3102/10769986221099906
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Jin Liu Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Longitudinal data analysis has been widely employed to examine between-individual differences in within-individual changes. One challenge of such analyses is that the rate-of-change is only available indirectly when change patterns are nonlinear with respect to time. Latent change score models (LCSMs), which can be employed to investigate the change in rate-of-change at the individual level, have been developed to address this challenge. We extend an existing LCSM with the Jenss–Bayley growth curve and propose a novel expression for change scores that allows for (1) unequally spaced study waves and (2) individual measurement occasions around each wave. We also extend the existing model to estimate the individual ratio of the growth acceleration (that largely determines the trajectory shape and is viewed as the most important parameter in the Jenss–Bayley model). We present the proposed model by a simulation study and a real-world data analysis. Our simulation study demonstrates that the proposed model can estimate the parameters unbiasedly and precisely and exhibit target confidence interval coverage. The simulation study also shows that the proposed model with the novel expression for the change scores outperforms the existing model. An empirical example using longitudinal reading scores shows that the model can estimate the individual ratio of the growth acceleration and generate individual rate-of-change in practice. We also provide the corresponding code for the proposed model. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-06-07T05:26:39Z DOI: 10.3102/10769986221099919
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Peter Z. Schochet First page: 367 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. This article develops new closed-form variance expressions for power analyses for commonly used difference-in-differences (DID) and comparative interrupted time series (CITS) panel data estimators. The main contribution is to incorporate variation in treatment timing into the analysis. The power formulas also account for other key design features that arise in practice: autocorrelated errors, unequal measurement intervals, and clustering due to the unit of treatment assignment. We consider power formulas for both cross-sectional and longitudinal models and allow for covariates. An illustrative power analysis provides guidance on appropriate sample sizes. The key finding is that accounting for treatment timing increases required sample sizes. Further, DID estimators have considerably more power than standard CITS and ITS estimators. An available Shiny R dashboard performs the sample size calculations for the considered estimators. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-02-08T09:21:21Z DOI: 10.3102/10769986211070625
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Ernesto San Martín, Jorge González First page: 406 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. The nonequivalent groups with anchor test (NEAT) design is widely used in test equating. Under this design, two groups of examinees are administered different test forms with each test form containing a subset of common items. Because test takers from different groups are assigned only one test form, missing score data emerge by design rendering some of the score distributions unavailable. The partially observed score data formally lead to an identifiability problem, which has not been recognized as such in the equating literature and has been considered from different perspectives, all of them making different assumptions in order to estimate the unidentified score distributions. In this article, we formally specify the statistical model underlying the NEAT design and unveil the lack of identifiability of the parameters of interest that compose the equating transformation. We use the theory of partial identification to show alternatives to traditional practices that have been proposed to identify the score distributions when conducting equating under the NEAT design. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-04-29T08:58:22Z DOI: 10.3102/10769986221090609
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Douglas G. Bonett First page: 438 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. The limitations of Cohen’s κ are reviewed and an alternative G-index is recommended for assessing nominal-scale agreement. Maximum likelihood estimates, standard errors, and confidence intervals for a two-rater G-index are derived for one-group and two-group designs. A new G-index of agreement for multirater designs is proposed. Statistical inference methods for some important special cases of the multirater design also are derived. G-index meta-analysis methods are proposed and can be used to combine and compare agreement across two or more populations. Closed-form sample-size formulas to achieve desired confidence interval precision are proposed for two-rater and multirater designs. R functions are given for all results. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-04-29T08:56:19Z DOI: 10.3102/10769986221088561
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Youmi Suk, Peter M. Steiner, Jee-Seon Kim, Hyunseung Kang First page: 459 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Regression discontinuity (RD) designs are commonly used for program evaluation with continuous treatment assignment variables. But in practice, treatment assignment is frequently based on ordinal variables. In this study, we propose an RD design with an ordinal running variable to assess the effects of extended time accommodations (ETA) for English-language learners (ELLs). ETA eligibility is determined by ordinal ELL English-proficiency categories of National Assessment of Educational Progress data. We discuss the identification and estimation of the average treatment effect (ATE), intent-to-treat effect, and the local ATE at the cutoff. We also propose a series of sensitivity analyses to probe the effect estimates’ robustness to the choices of scaling functions and cutoff scores and remaining confounding. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-04-27T08:57:05Z DOI: 10.3102/10769986221090275
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Wim J. van der Linden First page: 485 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Two independent statistical tests of item compromise are presented, one based on the test takers’ responses and the other on their response times (RTs) on the same items. The tests can be used to monitor an item in real time during online continuous testing but are also applicable as part of post hoc forensic analysis. The two test statistics are simple intuitive quantities as the sum of the responses and RTs observed for the test takers on the item. Common features of the tests are ease of interpretation and computational simplicity. Both tests are uniformly most powerful under the assumption of known ability and speed parameters for the test takers. Examples of power functions for items with realistic parameter values suggest maximum power for 20–30 test takers with item preknowledge for the response-based test and 10–20 test takers for the RT-based test. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-05-12T06:36:34Z DOI: 10.3102/10769986221094789