Posted by Austin Fossey
My last post introduced the concept of differential item functioning. Today, I would like to introduce two common methods for detecting DIF in a classical test theory framework: the Mantel-Haenszel method and the logistic regression method.
I will not go into the details of these two methods, but if you would like to know more, there are many great online resources. I also recommend de Ayala’s book, The Theory and Practice of Item Response Theory, for a great, easy-to-read chapter discussing these two methods.
The Mantel-Haenszel method determines whether or not there is a relationship between group membership and item performance, after accounting for participants’ abilities (as represented by total scores). The magnitude of the DIF is represented with a log odds estimate, known as αMH. In addition to the log odds ratio, we can calculate the Cochran-Mantel-Haenszel (CMH) statistic, which follows a chi squared distribution. CMH shows whether or not the observed DIF is significant, though there is no sense of magnitude as there is with αMH.
Unfortunately, the Mantel-Haenszel method is only consistent when investigating uniform DIF. If non-uniform DIF may be present, we can use logistic regression to investigate the presence of DIF. To do this, we run two logistic regression models where item performance is regressed on total scores (to account for the participants’ abilities) and group membership. One of the models will also include an interaction term between test score and group membership. We then can compare the fit of the two models. If the model with the interaction term fits better, then there is non-uniform DIF. If the model with no interaction term shows that group membership is a significant predictor of item performance, then there is uniform DIF. Otherwise, we can conclude that there is no DIF present.
Just because we find a statistical presence of DIF does not necessarily mean that we need to panic. In Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression, Monahan, McHorney, Stump, & Perkins note that it is useful to flag items based on the effect size of the DIF.
Both the Mantel-Haenszel method and the logistic regression method can be used to generate standardized effect sizes. Monahan et al. provide three categories of effect sizes: A, B, and C. These category labels are often generated in DIF or item calibration software, and we interpret them as follows: Level A is negligible levels of DIF, level B is slight to moderate levels of DIF, and level C is moderate to large levels of DIF. Flagging rules vary by organization, but it is common for test developers to only review items that fall into levels B and C.