Item Analysis – Differential Item Functioning (DIF)

Austin Fossey-42Posted by Austin Fossey

We have talked about item difficulty and item discrimination in a classical test theory framework, and we have discussed how these indices can be used to flag items that may potentially affect the validity of the inferences we make about the assessment results. Another area of item performance that is often used for item retention decisions is item bias, commonly referred to as differential item functioning (DIF).

DIF studies are generally implemented to see how the performances of two groups compare on a single item in an assessment (though studies can be done with more than two groups). One group is typically referred to as the reference group, and the other is the focal group. The focal group is the group that theory or previous research suggests may be disadvantaged by the item.

One simple method that some practitioners use is based on the four-fifths rule, which is detailed in the Uniform Guidelines for Employee Selection Procedures. This method involves comparing the correct response rates (p values) for the two groups. If the ratio of the smaller p value to the higher p value is less than 0.80, then the item may be adversely impacting the group with the lower p value. For example, if 50% of males answer an item correctly and 75% of females answer an item correctly, then 0.50/0.75 = 0.66 < 0.80, so we may be concerned that the item is adversely affecting the response patterns of males.

The four-fifths rule is attractive because it is easy to calculate, but it is prone to sampling error and misinterpretation. Continuing with our example, what if the population of males on average actually knows less about the content than the population of females? Then we would expect to see large differences in p values for the two groups because this reflects the actual differences in ability in the population.

In Differential Item Functioning (eds. Holland & Wainer, 1993), Angoff explains that DIF is occurring when an item displays different statistical properties between groups after those groups have been matched on a measure of proficiency. To put it another way, we need to first account for differences in the groups’ abilities, and then see if there are still differences in the item performance.

There are many ways to investigate DIF while accounting for participants’ abilities, and your decision may be influenced by whether or not you are using item response theory (IRT) for your student model, whether you have missing data, and whether or not the DIF is uniform or non-uniform.

Uniform DIF indicates that one group is (on average) always at a disadvantage when responding to the item. If we were to create item characteristic curves for the two groups, they would not intersect. Non-uniform DIF means that one group has an advantage for some proficiency levels, but is at a disadvantage at other proficiency levels. In this scenario, the two item characteristic curves would intersect.

Item Characteristic Curve

Item characteristic curves demonstrating examples of uniform and non-uniform DIF.

In my next post, I will introduce two common methods for detecting uniform and non-uniform DIF: the Mantel-Haenszel method and logistic regression. Unlike the four-fifths rule, these methods account for participants’ abilities (as represented by total scores) before making inferences about each group’s performance on an item.

Understanding Assessment Validity: New Perspectives

Posted by Greg Pope

In my last post I discussed specific aspects of construct validity. I’m capping off this series with a discussion of modern views and thinking on validity.

Dr. Bruno D. Zumbo

Recently my former graduate supervisor, Dr. Bruno D. Zumbo at the University of British Columbia, wrote a fascinating chapter in the new book, The Concept of Validity: Revisions, New Directions and Applications, edited by Dr. Robert W. Lissitz. Bruno’s chapter, “Validity as Contextualized and Pragmatic Explanation, and its Implications for Validation Practice,” provides a great modern perspective on validity.

The chapter has two aims: to provide an overview of what Bruno considers to be the concept of validity, and to discuss the implications for the process of validation.

Something I really liked about the chapter was its focus on why we conduct psychometric analyses digging into how our assessments perform. As Bruno discusses, the real purpose of all the psychometric analysis we do is to support or provide evidence for the claims that we make about the validity of the assessment measures we gather. For example, the reason we would do a Differential Functioning Analysis (DIF), in which we ensure that test questions are not biased against/towards a certain group, is not only to protect test developers against lawsuits but also to weed out invalidity in order to help us set where the inferential limits of assessment results are.

Bruno drives home the point that examining validity is an ongoing process of validation. One doesn’t just do a validity study or two and then be done: validation is an ongoing process in which multilevel construct validation occurs and procedures are tied in to program evaluation and assessment quality processes.

I would highly recommend that people interested in diving more into the theoretical and practical details of validity check out this book, which includes chapters from many highly respected psychometrics and testing industry experts.

I hope that this series on validity has been useful and interesting! Stay tuned for more psychometric tidbits in upcoming posts.


Editor’s Note: Greg will be doing a presentation at the Questionmark Users Conference on Conducting Validity Studies within Your Organization. The conference will take place in Miami March 14 – 17. Learn more at