Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Benjamin W. Domingue, Klint Kanopka, Ben Stenhaug, Michael J. Sulik, Tanesia Beverly, Matthieu Brinkhuis, Ruhan Circi, Jessica Faul, Dandan Liao, Bruce McCandliss, Jelena Obradović, Chris Piech, Tenelle Porter, Project iLEAD Consortium, James Soland, Jon Weeks, Steven L. Wise, Jason Yeatman Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. The speed–accuracy trade-off (SAT) suggests that time constraints reduce response accuracy. Its relevance in observational settings—where response time (RT) may not be constrained but respondent speed may still vary—is unclear. Using 29 data sets containing data from cognitive tasks, we use a flexible method for identification of the SAT (which we test in extensive simulation studies) to probe whether the SAT holds. We find inconsistent relationships between time and accuracy; marginal increases in time use for an individual do not necessarily predict increases in accuracy. Additionally, the speed–accuracy relationship may depend on the underlying difficulty of the interaction. We also consider the analysis of items and individuals; of particular interest is the observation that respondents who exhibit more within-person variation in response speed are typically of lower ability. We further find that RT is typically a weak predictor of response accuracy. Our findings document a range of empirical phenomena that should inform future modeling of RTs collected in observational settings. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-06-09T05:27:25Z DOI: 10.3102/10769986221099906
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Jin Liu Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Longitudinal data analysis has been widely employed to examine between-individual differences in within-individual changes. One challenge of such analyses is that the rate-of-change is only available indirectly when change patterns are nonlinear with respect to time. Latent change score models (LCSMs), which can be employed to investigate the change in rate-of-change at the individual level, have been developed to address this challenge. We extend an existing LCSM with the Jenss–Bayley growth curve and propose a novel expression for change scores that allows for (1) unequally spaced study waves and (2) individual measurement occasions around each wave. We also extend the existing model to estimate the individual ratio of the growth acceleration (that largely determines the trajectory shape and is viewed as the most important parameter in the Jenss–Bayley model). We present the proposed model by a simulation study and a real-world data analysis. Our simulation study demonstrates that the proposed model can estimate the parameters unbiasedly and precisely and exhibit target confidence interval coverage. The simulation study also shows that the proposed model with the novel expression for the change scores outperforms the existing model. An empirical example using longitudinal reading scores shows that the model can estimate the individual ratio of the growth acceleration and generate individual rate-of-change in practice. We also provide the corresponding code for the proposed model. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-06-07T05:26:39Z DOI: 10.3102/10769986221099919
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Wim J. van der Linden Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Two independent statistical tests of item compromise are presented, one based on the test takers’ responses and the other on their response times (RTs) on the same items. The tests can be used to monitor an item in real time during online continuous testing but are also applicable as part of post hoc forensic analysis. The two test statistics are simple intuitive quantities as the sum of the responses and RTs observed for the test takers on the item. Common features of the tests are ease of interpretation and computational simplicity. Both tests are uniformly most powerful under the assumption of known ability and speed parameters for the test takers. Examples of power functions for items with realistic parameter values suggest maximum power for 20–30 test takers with item preknowledge for the response-based test and 10–20 test takers for the RT-based test. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-05-12T06:36:34Z DOI: 10.3102/10769986221094789
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Ernesto San Martín, Jorge González Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. The nonequivalent groups with anchor test (NEAT) design is widely used in test equating. Under this design, two groups of examinees are administered different test forms with each test form containing a subset of common items. Because test takers from different groups are assigned only one test form, missing score data emerge by design rendering some of the score distributions unavailable. The partially observed score data formally lead to an identifiability problem, which has not been recognized as such in the equating literature and has been considered from different perspectives, all of them making different assumptions in order to estimate the unidentified score distributions. In this article, we formally specify the statistical model underlying the NEAT design and unveil the lack of identifiability of the parameters of interest that compose the equating transformation. We use the theory of partial identification to show alternatives to traditional practices that have been proposed to identify the score distributions when conducting equating under the NEAT design. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-04-29T08:58:22Z DOI: 10.3102/10769986221090609
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Douglas G. Bonett Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. The limitations of Cohen’s κ are reviewed and an alternative G-index is recommended for assessing nominal-scale agreement. Maximum likelihood estimates, standard errors, and confidence intervals for a two-rater G-index are derived for one-group and two-group designs. A new G-index of agreement for multirater designs is proposed. Statistical inference methods for some important special cases of the multirater design also are derived. G-index meta-analysis methods are proposed and can be used to combine and compare agreement across two or more populations. Closed-form sample-size formulas to achieve desired confidence interval precision are proposed for two-rater and multirater designs. R functions are given for all results. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-04-29T08:56:19Z DOI: 10.3102/10769986221088561
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Youmi Suk, Peter M. Steiner, Jee-Seon Kim, Hyunseung Kang Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Regression discontinuity (RD) designs are commonly used for program evaluation with continuous treatment assignment variables. But in practice, treatment assignment is frequently based on ordinal variables. In this study, we propose an RD design with an ordinal running variable to assess the effects of extended time accommodations (ETA) for English-language learners (ELLs). ETA eligibility is determined by ordinal ELL English-proficiency categories of National Assessment of Educational Progress data. We discuss the identification and estimation of the average treatment effect (ATE), intent-to-treat effect, and the local ATE at the cutoff. We also propose a series of sensitivity analyses to probe the effect estimates’ robustness to the choices of scaling functions and cutoff scores and remaining confounding. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-04-27T08:57:05Z DOI: 10.3102/10769986221090275
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Peter Z. Schochet Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. This article develops new closed-form variance expressions for power analyses for commonly used difference-in-differences (DID) and comparative interrupted time series (CITS) panel data estimators. The main contribution is to incorporate variation in treatment timing into the analysis. The power formulas also account for other key design features that arise in practice: autocorrelated errors, unequal measurement intervals, and clustering due to the unit of treatment assignment. We consider power formulas for both cross-sectional and longitudinal models and allow for covariates. An illustrative power analysis provides guidance on appropriate sample sizes. The key finding is that accounting for treatment timing increases required sample sizes. Further, DID estimators have considerably more power than standard CITS and ITS estimators. An available Shiny R dashboard performs the sample size calculations for the considered estimators. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-02-08T09:21:21Z DOI: 10.3102/10769986211070625
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Wim J. van der Linden First page: 353 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. The current literature on test equating generally defines it as the process necessary to obtain score comparability between different test forms. The definition is in contrast with Lord’s foundational paper which viewed equating as the process required to obtain comparability of measurement scale between forms. The distinction between the notions of scale and score is not trivial. The difference is explained by connecting these notions with standard statistical concepts as probability experiment, sample space, and random variable. The probability experiment underlying equating test forms with random scores immediately gives us the equating transformation as a function mapping the scale of one form into the other and thus supports the point of view taken by Lord. However, both Lord’s view and the current literature appear to rely on the idea of an experiment with random examinees which implies a different notion of test scores. It is shown how an explicit choice between the two experiments is not just important for our theoretical understanding of key notions in test equating but also has important practical consequences. Citation: Journal of Educational and Behavioral Statistics PubDate: 2022-02-08T09:23:00Z DOI: 10.3102/10769986211072308
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Sandip Sinharay First page: 263 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. Takers of educational tests often receive proficiency levels instead of or in addition to scaled scores. For example, proficiency levels are reported for the Advanced Placement (AP®) and U.S. Medical Licensing examinations. Technical difficulties and other unforeseen events occasionally lead to missing item scores and hence to incomplete data on these tests. The reporting of proficiency levels to the examinees with incomplete data requires estimation of the performance of the examinees on the missing part and essentially involves imputation of missing data. In this article, six approaches from the literature on missing data analysis are brought to bear on the problem of reporting of proficiency levels to the examinees with incomplete data. Data from several large-scale educational tests are used to compare the performances of the six approaches to the approach that is operationally used for reporting proficiency levels for these tests. A multiple imputation approach based on chained equations is shown to lead to the most accurate reporting of proficiency levels for data that were missing at random or completely at random, while the model-based approach of Holman and Glas performed the best for data that are missing not at random. Several recommendations are made on the reporting of proficiency levels to the examinees with incomplete data. Citation: Journal of Educational and Behavioral Statistics PubDate: 2021-10-25T03:39:27Z DOI: 10.3102/10769986211051379
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Kuan-Yu Jin, Yi-Jhen Wu, Hui-Fang Chen First page: 297 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. For surveys of complex issues that entail multiple steps, multiple reference points, and nongradient attributes (e.g., social inequality), this study proposes a new multiprocess model that integrates ideal-point and dominance approaches into a treelike structure (IDtree). In the IDtree, an ideal-point approach describes an individual’s attitude and then a dominance approach describes their tendency for using extreme response categories. Evaluation of IDtree performance via two empirical data sets showed that the IDtree fit these data better than other models. Furthermore, simulation studies showed a satisfactory parameter recovery of the IDtree. Thus, the IDtree model sheds light on the response processes of a multistage structure. Citation: Journal of Educational and Behavioral Statistics PubDate: 2021-12-10T03:06:47Z DOI: 10.3102/10769986211057160
Please help us test our new pre-print finding feature by giving the pre-print link a rating. A 5 star rating indicates the linked pre-print has the exact same content as the published article.
Authors:Yunxiao Chen, Yi-Hsuan Lee, Xiaoou Li First page: 322 Abstract: Journal of Educational and Behavioral Statistics, Ahead of Print. In standardized educational testing, test items are reused in multiple test administrations. To ensure the validity of test scores, the psychometric properties of items should remain unchanged over time. In this article, we consider the sequential monitoring of test items, in particular, the detection of abrupt changes to their psychometric properties, where a change can be caused by, for example, leakage of the item or change of the corresponding curriculum. We propose a statistical framework for the detection of abrupt changes in individual items. This framework consists of (1) a multistream Bayesian change point model describing sequential changes in items, (2) a compound risk function quantifying the risk in sequential decisions, and (3) sequential decision rules that control the compound risk. Throughout the sequential decision process, the proposed decision rule balances the trade-off between two sources of errors, the false detection of prechange items, and the nondetection of postchange items. An item-specific monitoring statistic is proposed based on an item response theory model that eliminates the confounding from the examinee population which changes over time. Sequential decision rules and their theoretical properties are developed under two settings: the oracle setting where the Bayesian change point model is completely known and a more realistic setting where some parameters of the model are unknown. Simulation studies are conducted under settings that mimic real operational tests. Citation: Journal of Educational and Behavioral Statistics PubDate: 2021-12-14T02:40:34Z DOI: 10.3102/10769986211059085