Item Analysis – Two Methods for Detecting DIF

Posted by Austin FosseyAustin Fossey-42

My last post introduced the concept of differential item functioning. Today, I would like to introduce two common methods for detecting DIF in a classical test theory framework: the Mantel-Haenszel method and the logistic regression method.

I will not go into the details of these two methods, but if you would like to know more, there are many great online resources. I also recommend de Ayala’s book, The Theory and Practice of Item Response Theory, for a great, easy-to-read chapter discussing these two methods.


The Mantel-Haenszel method determines whether or not there is a relationship between group membership and item performance, after accounting for participants’ abilities (as represented by total scores). The magnitude of the DIF is represented with a log odds estimate, known as αMH. In addition to the log odds ratio, we can calculate the Cochran-Mantel-Haenszel (CMH) statistic, which follows a chi squared distribution. CMH shows whether or not the observed DIF is significant, though there is no sense of magnitude as there is with αMH.

Logistic Regression

Unfortunately, the Mantel-Haenszel method is only consistent when investigating uniform DIF. If non-uniform DIF may be present, we can use logistic regression to investigate the presence of DIF. To do this, we run two logistic regression models where item performance is regressed on total scores (to account for the participants’ abilities) and group membership. One of the models will also include an interaction term between test score and group membership. We then can compare the fit of the two models. If the model with the interaction term fits better, then there is non-uniform DIF. If the model with no interaction term shows that group membership is a significant predictor of item performance, then there is uniform DIF. Otherwise, we can conclude that there is no DIF present.

Just because we find a statistical presence of DIF does not necessarily mean that we need to panic. In Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression, Monahan, McHorney, Stump, & Perkins note that it is useful to flag items based on the effect size of the DIF.

Both the Mantel-Haenszel method and the logistic regression method can be used to generate standardized effect sizes. Monahan et al. provide three categories of effect sizes: A, B, and C. These category labels are often generated in DIF or item calibration software, and we interpret them as follows: Level A is negligible levels of DIF, level B is slight to moderate levels of DIF, and level C is moderate to large levels of DIF. Flagging rules vary by organization, but it is common for test developers to only review items that fall into levels B and C.

Item Analysis – Differential Item Functioning (DIF)

Austin Fossey-42Posted by Austin Fossey

We have talked about item difficulty and item discrimination in a classical test theory framework, and we have discussed how these indices can be used to flag items that may potentially affect the validity of the inferences we make about the assessment results. Another area of item performance that is often used for item retention decisions is item bias, commonly referred to as differential item functioning (DIF).

DIF studies are generally implemented to see how the performances of two groups compare on a single item in an assessment (though studies can be done with more than two groups). One group is typically referred to as the reference group, and the other is the focal group. The focal group is the group that theory or previous research suggests may be disadvantaged by the item.

One simple method that some practitioners use is based on the four-fifths rule, which is detailed in the Uniform Guidelines for Employee Selection Procedures. This method involves comparing the correct response rates (p values) for the two groups. If the ratio of the smaller p value to the higher p value is less than 0.80, then the item may be adversely impacting the group with the lower p value. For example, if 50% of males answer an item correctly and 75% of females answer an item correctly, then 0.50/0.75 = 0.66 < 0.80, so we may be concerned that the item is adversely affecting the response patterns of males.

The four-fifths rule is attractive because it is easy to calculate, but it is prone to sampling error and misinterpretation. Continuing with our example, what if the population of males on average actually knows less about the content than the population of females? Then we would expect to see large differences in p values for the two groups because this reflects the actual differences in ability in the population.

In Differential Item Functioning (eds. Holland & Wainer, 1993), Angoff explains that DIF is occurring when an item displays different statistical properties between groups after those groups have been matched on a measure of proficiency. To put it another way, we need to first account for differences in the groups’ abilities, and then see if there are still differences in the item performance.

There are many ways to investigate DIF while accounting for participants’ abilities, and your decision may be influenced by whether or not you are using item response theory (IRT) for your student model, whether you have missing data, and whether or not the DIF is uniform or non-uniform.

Uniform DIF indicates that one group is (on average) always at a disadvantage when responding to the item. If we were to create item characteristic curves for the two groups, they would not intersect. Non-uniform DIF means that one group has an advantage for some proficiency levels, but is at a disadvantage at other proficiency levels. In this scenario, the two item characteristic curves would intersect.

Item Characteristic Curve

Item characteristic curves demonstrating examples of uniform and non-uniform DIF.

In my next post, I will introduce two common methods for detecting uniform and non-uniform DIF: the Mantel-Haenszel method and logistic regression. Unlike the four-fifths rule, these methods account for participants’ abilities (as represented by total scores) before making inferences about each group’s performance on an item.

How should we measure an organization’s level of psychometric expertise?


Posted by Greg Pope

A colleague recently asked for my opinion on an organization’s level of knowledge, experience, and sophistication applying psychometrics to their assessment program. I came to realize that it was difficult to summarize in words, which got me thinking why. I concluded that it was because there currently is not a common language to describe how advanced an organization is regarding the psychometric expertise they have and the rigour they apply to their assessment program. I thought maybe if there were such a common vocabulary, it would make conversations like the one I had a whole lot easier.

I thought it might be fun (and perhaps helpful) to come up with a proposed first cut of a shared vocabulary around the levels of psychometric expertise. I wanted to keep it simple, yet effective in allowing people to quickly and easily communicate about where an organization would fall in terms of their level of psychometric sophistication. I thought it might make sense to break it out by areas (I thought of seven) and assign points according to the expertise/rigour an organization contains/applies. Not all areas are always led by psychometricians directly, but usually psychometricians play a role.

1.    Item and test level psychometric analysis

  • Classical Test Theory (CTT) and/or Item Response Theory (IRT)
  • Pre hoc analysis (beta testing analysis)
  • Ad hoc analysis (actual assessment)
  • Post hoc analysis (regular reviews over time)

2.    Psychometric analysis of bias and dimensionality

  • Factor analysis or principal component analysis to evaluate dimensionality
  • Differential Item Functioning (DIF) analysis to ensure that items are performing similarly across groups (e.g., gender, race, age, etc.)

3.    Form assembly processes

  • Blueprinting
  • Expert review of forms or item banks
  • Fixed forms, computerized adaptive testing (CAT), automated test assembly

4.    Equivalence of scores and performance standards

  • Standard setting
  • Test equating
  • Scaling scores

5.    Test security

  • Test security plan in place
  • Regular security audits are conducted
  • Statistical analyses are conducted regularly (e.g., collusion and plagiarism detection analysis)

6.    Validity studies

  • Validity studies conducted on new assessment programs and ongoing programs
  • Industry experts review and provide input on study design and finding
  • Improvements are made to the program if required as a result of studies

7.    Reporting

  • Provide information clearly and meaningfully to all stakeholders (e.g., students, parents, instructors, etc.)
  • High quality supporting documentation designed for non-experts (interpretation guides)
  • Frequently reviewed by assessment industry experts and improved as required

Expertise/rigour points
0.    None: Not rigorous, no expertise whatsoever within the organization
1.    Some: Some rigour, marginal expertise within the organization
2.    Full: Highly rigorous, organization has a large amount of experience

So an organization that has decades of expertise in each area would be at the top level of 14 (7 areas x 2 for expertise/rigour in each area = 14). An elementary school doing simple formative assessment would probably be at the lowest level (7 areas x 0 expertise/rigour = 0). I have provided some examples of how organizations might fall into various ranges in the illustration below.

There are obviously lots of caveats and considerations here. One thing to keep in mind is that not all organizations need to have full expertise in all areas. For example, an elementary school that administers formative tests to facilitate learning doesn’t need to have 20 psychometricians working for them doing DIF analysis and equipercentile test equating. Their organization being low on the scale is expected. Another consideration is expense: To achieve the highest level requires a major investment (and maintaining an army of psychometricians isn’t cheap!). Therefore, one would expect an organization that is conducting high stakes testing where people’s lives or futures are at stake based on assessment scores to be at the highest level. It’s also important to remember that some areas are more basic than others and are a starting place. For example, it would be pretty rare for an organization to have a great deal of expertise in the psychometric analysis of bias and dimensionality but no expertise in item and test analysis.

I would love to get feedback on this idea and start a dialog. Does this seem roughly on target? Would it would be useful? Is something similar out there that is better that I don’t know about? Or am I just plain out to lunch? Please feel free to comment to me directly or on this blog.

On a related note, Questonmark CEO Eric Shepherd has given considerable thought to the concept of an “Assessment Maturity Model,” which focuses on a broader assessment context. Interested readers should check out: