When can Multidimensional Item Response Theory (MIRT) Models be a Solution for Differential Item Functioning (DIF)? A Monte Carlo Simulation Study

Liaw, Yuan-LingWhen can Multidimensional Item Response Theory (MIRT) Models be a Solution for Differential Item Functioning (DIF)? A Monte Carlo Simulation StudyMy University2015differential item functioning; educational measurement; item response theory; monte carlo simulationEducational psychologyeducation - seattleMy UniversityMy UniversitySanders, Elizabeth A.2015-09-292015-09-292015en-USThesisLiaw_washington_0250E_15172.pdfhttp://hdl.handle.net/1773/33755application/pdfCopyright is held by the individual authors.Thesis (Ph.D.)--University of Washington, 2015The present study was designed to examine whether multidimensional item response theory (MIRT) models might be useful in controlling for differential item functioning (DIF) when estimating primary ability, or whether traditional (and simpler) unidimensional item response theory (UIRT) models with DIF items removed are sufficient for accurately estimating primary ability. Researchers have argued that the leading cause of DIF is the inclusion of “multidimensional” test items. That is, tests thought to be unidimensional—one latent (unobserved) construct or trait per item measured—are actually measuring at least one other latent trait besides the one of interest. Additionally, most “problem” DIF is likely due to items measuring multiple traits that are noncompensatory in nature: to get an item correct, an examinee needs a sufficient amount of all relevant traits (one trait cannot compensate for another trait). However, few studies have conducted empirical research on MIRT models; of the few that have, none examined the use of MIRT models for the purpose of controlling for DIF, and none empirically compared the performance of compensatory and noncompensatory MIRT models. The present study contributes new information on the performance of these methodologies for multidimensional test items by addressing the following main research question: How accurately do UIRT and MIRT models calibrate the primary ability estimate (θ1) for focal and reference groups? The data in this simulation study were generated for a test with 40 items and 2,000 examinees, and assumed a 2-parameter logistic (2PL), 2-dimensional, noncompensatory case. Five conditions were manipulated, including: between-dimension correlation (0 and 0.3), reference-to-focal group size balance (1:1 and 9:1), primary dimension discrimination level (0.5 and 0.8), secondary dimension discrimination level (0.2 and 0.5), and percentage of DIF items (0%, 10%, 20%, and 30%; all DIF favored the reference group). Five model approaches were then applied for IRT calibration, with results saved and averaged for each condition: Approach 1 (UIRTd): UIRT, no items removed from analysis; Approach 2 (UIRTnds): UIRT, after removing DIF-detected items (using Mantel–Haenszel with standard criterion p-value ≤ 0.05); Approach 3 (UIRTndl): UIRT, after removing DIF-detected items (using Mantel–Haenszel with standard criterion p-value ≤ 0.10); Approach 4 (MIRTc): compensatory MIRT, no items removed from analysis; and Approach 5 (MIRTnc): noncompensatory MIRT, no items removed from analysis. The impact of these modeling approaches and manipulated conditions on the accuracy of primary ability estimates was the focus of the investigation. Accuracy was judged by bias, which was calculated using the typical definition of the mean difference across the 500 replications between the estimated θ ̂1 and the true primary θ1 used to generate the data. Analyses of variance (ANOVAs) on model-derived mean ability estimates were then used to identify main effects and simple interactions among modeling approaches and conditions. As was expected, for the focal group, the ANOVA results showed that the UIRTd model (no items removed from analysis) yielded the worst bias (the focal group’s primary ability was consistently underestimated and reference group’s primary ability was consistently over-estimated) compared to all other models. Using UIRTnds and UIRTndl models (DIF-detected items removed from analyses, one with the standard alpha level and the other with a liberal alpha level) led to the smallest bias, and use of the two types of MIRT models (MIRTc and MIRTnc) led to slightly more bias than the two UIRT models with DIF removed, but these differences were not significant (i.e., the only model that differed from the others was the UIRT model that completely ignored DIF). In other words, the simple model UIRT approach works as well as the complex MIRT approaches, but only for researchers willing to remove items with DIF prior to calibration; for those with limited item pools, the MIRT approach works just as well without removing DIF items.