Item Analysis – Two Methods for Detecting DIF

Posted by Austin FosseyAustin Fossey-42

My last post introduced the concept of differential item functioning. Today, I would like to introduce two common methods for detecting DIF in a classical test theory framework: the Mantel-Haenszel method and the logistic regression method.

I will not go into the details of these two methods, but if you would like to know more, there are many great online resources. I also recommend de Ayala’s book, The Theory and Practice of Item Response Theory, for a great, easy-to-read chapter discussing these two methods.


The Mantel-Haenszel method determines whether or not there is a relationship between group membership and item performance, after accounting for participants’ abilities (as represented by total scores). The magnitude of the DIF is represented with a log odds estimate, known as αMH. In addition to the log odds ratio, we can calculate the Cochran-Mantel-Haenszel (CMH) statistic, which follows a chi squared distribution. CMH shows whether or not the observed DIF is significant, though there is no sense of magnitude as there is with αMH.

Logistic Regression

Unfortunately, the Mantel-Haenszel method is only consistent when investigating uniform DIF. If non-uniform DIF may be present, we can use logistic regression to investigate the presence of DIF. To do this, we run two logistic regression models where item performance is regressed on total scores (to account for the participants’ abilities) and group membership. One of the models will also include an interaction term between test score and group membership. We then can compare the fit of the two models. If the model with the interaction term fits better, then there is non-uniform DIF. If the model with no interaction term shows that group membership is a significant predictor of item performance, then there is uniform DIF. Otherwise, we can conclude that there is no DIF present.

Just because we find a statistical presence of DIF does not necessarily mean that we need to panic. In Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression, Monahan, McHorney, Stump, & Perkins note that it is useful to flag items based on the effect size of the DIF.

Both the Mantel-Haenszel method and the logistic regression method can be used to generate standardized effect sizes. Monahan et al. provide three categories of effect sizes: A, B, and C. These category labels are often generated in DIF or item calibration software, and we interpret them as follows: Level A is negligible levels of DIF, level B is slight to moderate levels of DIF, and level C is moderate to large levels of DIF. Flagging rules vary by organization, but it is common for test developers to only review items that fall into levels B and C.

Conceptual Assessment Framework: Building the Evidence Model

Austin FosseyPosted by Austin Fossey

In my previous posts, I introduced the student model and the task model—two of the three sections of the Conceptual Assessment Framework (CAF) in Evidence-Centered Design (ECD).

The student and task models are linked by the evidence model. The evidence model has two components: the evaluation / evidence identification component and the measurement model / evidence accumulation component (e.g., Design and Discovery in Educational Assessment: Evidence-Centered Design, Psychometrics, and Educational Data Mining; Mislevy, Behrens, Dicerbo, & Levy, 2012).

The evaluation component defines how we identify and collect evidence in the responses or work products produced by the participant in the context of the task model. For the evaluation component, we must ask ourselves what is it we are looking for as evidence about the participant’s ability, and how will we store that evidence?

In a multiple choice item, the evaluation component is simply whether or not the participant selected the item key, but evidence identification can be more complex. Consider drag-and-drop items where you may need to track the options the participant chose as well as their order. In hot spot items, the evaluation component consists of capturing the coordinates of the participant’s selection in relation to a set of item key coordinates.

Some simulation assessments will collect information about the context of the participant’s response (i.e., was it the correct response given the state of the simulation at that moment?), and others consider aspects of the participant’s response patterns, such as sequence and efficiency (i.e., what order did the participant perform the response steps, and were there any extraneous steps?).

In the measurement model component, we define how evidence is scored and how those scores are aggregated into measures that can be used in the student model.

In a multiple choice assessment using Classical Test Theory (CTT), the measurement model may be simple: if the participant selects the item key, we award one point, then create an overall score measure by adding up the points. Partial credit scoring is another option for a measurement model. Raw scores may be transformed into a percentage score, which is the aggregation method used for many assessments built with Questionmark. Questionmark also provides a Scoring Tool for external measurement models, such as rubric scoring of
essay items.

Measurement models can also be more complex depending on the assessment design. Item Response Theory (IRT) is another commonly used measurement model that provides probabilistic estimates of participants’ abilities based on each participant’s response pattern and the difficulty and discrimination of the items. Some simulation assessments also use logical scoring trees, regression models, Bayes Nets, network analyses or a combination of these methods to score work products and aggregate results.

caf 3

Example of a simple evidence model structure showing the relationships between evidence identification and accumulation.

Using Questionmark’s OData API to Create a Response Matrix

Austin FosseyPosted by Austin Fossey

A response matrix is a table of data in which each row represents a participant’s assessment attempt and each column represents an item. The cells show the score that each participant received for each item – valuable information that can help you with psychometric analysis.

The Questionmark OData API enables you to create this and other custom data files by giving you flexible, direct access to raw item-level response data.

You can already see participant’s item-level response data in Questionmark reports, but the Questionmark reports group data together for one assessment at a time.

If you have a large-scale assessment design with multiple equated forms, you may want to generate a matrix that shows response data for common items that are used across the forms.

The example below shows a response matrix created with OData in Microsoft Excel 2013 using the PowerPivot add-in. The cells in a response matrix are coded with the score that the participant received for each item (e.g., 1 = correct and 0 = incorrect). (If an item was not delivered to a participant, the cell will be returned blank, though you can impute other values as needed.)


You can use OData to create a response matrix that can be used for form equating or as input files for item calibration in Item Response Theory (IRT) software. These data are also helpful if you want to check a basic item-level calculation, like the p-value for the item across all assessments. (Note that item-total correlations can only be calculated if the total score has been equated for all forms.)

Visit Questionmark’s website for more information about the OData API. (If you are a Questionmark software support plan customer, you can get step-by-step instructions for using OData to create a response matrix in the Premium Videos section of the Questionmark Learning Café.)

Standard Setting: Bookmark Method Overview

Austin FosseyPosted by Austin Fossey

In my last post, I spoke about using the Angoff Method to determine cut scores in a criterion-referenced assessment. Another commonly used method is the Bookmark Method. While both can be applied to a criterion-referenced assessment, Bookmark is often used in large-scale assessments with multiple forms or vertical score scales, such as some state education tests.

In their chapter entitled “Setting Performance Standards” in Educational Measurement (4th ed.), Ronald Hambleton and Mary Pitoniak discuss describe many commonly used standard setting procedures. Hambleton and Pitoniak classify the Bookmark as an “item mapping method,” which means that standard setters are presented with an ordered item booklet that is used to map the relationship between item difficulty and participant performance.

In Bookmark, item difficulty must be determined a priori. Note that the Angoff Method does not require us to have item statistics for the standard setting to take place, but we usually will have the item statistics to use as impact data. With Bookmark, item difficulty must be calculated with an item response theory (IRT) model before the standard setting.

Once the items’ difficulty parameters have been established, the psychometricians will assemble the items into an ordered item booklet. Each item gets its own page in the booklet, and the items are ordered from easiest to hardest, such that the hardest item is on the last page.standard book

Each rater receives an ordered item booklet. The raters go through the entire booklet once to read every item. They then go back through and place a bookmark between the two items in the booklet that represent the cut point for what minimally qualified participants should know and be able to do.

Psychometricians will often ask raters to place the bookmark at the item where 67% of minimally qualified participants will get the item right. 67% is called the response probability, and it is an easy value for raters to use because they just pick the item where about two-thirds of minimally qualified participants will get the item right. Other response probabilities can be used (e.g., 50% of minimally qualified participants), and Hambleton and Pitoniak describe some of the issues around this decision in more detail.

After each rater has placed a bookmark, the process is similar to Angoff. The item difficulties corresponding to each bookmark are averaged, the raters discuss the result, impact data can be reviewed, and then raters re-set their bookmark before the final cut score is determined. I have also seen larger  programs break raters into groups of five people, and each group has their own discussion before bringing their recommended cut score to the larger group. This cuts down on discussion time and keeps any one rater from hijacking the whole group.

The same process can be followed if we have more than two classifications for the assessment. For example, instead of Pass and Fail, we may have Novice, Proficient, and Advanced. We would need to determine what makes a participant Advanced instead of Proficient, but the same response probability should be used when placing the bookmarks for these two categories.

How should we measure an organization’s level of psychometric expertise?


Posted by Greg Pope

A colleague recently asked for my opinion on an organization’s level of knowledge, experience, and sophistication applying psychometrics to their assessment program. I came to realize that it was difficult to summarize in words, which got me thinking why. I concluded that it was because there currently is not a common language to describe how advanced an organization is regarding the psychometric expertise they have and the rigour they apply to their assessment program. I thought maybe if there were such a common vocabulary, it would make conversations like the one I had a whole lot easier.

I thought it might be fun (and perhaps helpful) to come up with a proposed first cut of a shared vocabulary around the levels of psychometric expertise. I wanted to keep it simple, yet effective in allowing people to quickly and easily communicate about where an organization would fall in terms of their level of psychometric sophistication. I thought it might make sense to break it out by areas (I thought of seven) and assign points according to the expertise/rigour an organization contains/applies. Not all areas are always led by psychometricians directly, but usually psychometricians play a role.

1.    Item and test level psychometric analysis

  • Classical Test Theory (CTT) and/or Item Response Theory (IRT)
  • Pre hoc analysis (beta testing analysis)
  • Ad hoc analysis (actual assessment)
  • Post hoc analysis (regular reviews over time)

2.    Psychometric analysis of bias and dimensionality

  • Factor analysis or principal component analysis to evaluate dimensionality
  • Differential Item Functioning (DIF) analysis to ensure that items are performing similarly across groups (e.g., gender, race, age, etc.)

3.    Form assembly processes

  • Blueprinting
  • Expert review of forms or item banks
  • Fixed forms, computerized adaptive testing (CAT), automated test assembly

4.    Equivalence of scores and performance standards

  • Standard setting
  • Test equating
  • Scaling scores

5.    Test security

  • Test security plan in place
  • Regular security audits are conducted
  • Statistical analyses are conducted regularly (e.g., collusion and plagiarism detection analysis)

6.    Validity studies

  • Validity studies conducted on new assessment programs and ongoing programs
  • Industry experts review and provide input on study design and finding
  • Improvements are made to the program if required as a result of studies

7.    Reporting

  • Provide information clearly and meaningfully to all stakeholders (e.g., students, parents, instructors, etc.)
  • High quality supporting documentation designed for non-experts (interpretation guides)
  • Frequently reviewed by assessment industry experts and improved as required

Expertise/rigour points
0.    None: Not rigorous, no expertise whatsoever within the organization
1.    Some: Some rigour, marginal expertise within the organization
2.    Full: Highly rigorous, organization has a large amount of experience

So an organization that has decades of expertise in each area would be at the top level of 14 (7 areas x 2 for expertise/rigour in each area = 14). An elementary school doing simple formative assessment would probably be at the lowest level (7 areas x 0 expertise/rigour = 0). I have provided some examples of how organizations might fall into various ranges in the illustration below.

There are obviously lots of caveats and considerations here. One thing to keep in mind is that not all organizations need to have full expertise in all areas. For example, an elementary school that administers formative tests to facilitate learning doesn’t need to have 20 psychometricians working for them doing DIF analysis and equipercentile test equating. Their organization being low on the scale is expected. Another consideration is expense: To achieve the highest level requires a major investment (and maintaining an army of psychometricians isn’t cheap!). Therefore, one would expect an organization that is conducting high stakes testing where people’s lives or futures are at stake based on assessment scores to be at the highest level. It’s also important to remember that some areas are more basic than others and are a starting place. For example, it would be pretty rare for an organization to have a great deal of expertise in the psychometric analysis of bias and dimensionality but no expertise in item and test analysis.

I would love to get feedback on this idea and start a dialog. Does this seem roughly on target? Would it would be useful? Is something similar out there that is better that I don’t know about? Or am I just plain out to lunch? Please feel free to comment to me directly or on this blog.

On a related note, Questonmark CEO Eric Shepherd has given considerable thought to the concept of an “Assessment Maturity Model,” which focuses on a broader assessment context. Interested readers should check out:

How the sample of participants being tested affects item analysis information


Posted by Greg Pope

Ever think about who took the test when you are interpreting your item analysis report? Maybe you should! Classical Test Theory (CTT) item analysis information is very much based on the sample of participants who took the test.

Hold on a second, what is a sample? What is the difference between a sample and a population? Well, a sample is a selection from a population. If your population is composed of all the 1.5 million of people in the United States who will write a college entrance exam in a year, a sample of this population could be 1,000 people selected based on certain criteria (e.g., age, gender, ethnicity, etc.). If we were to beta test questions that we hope to include on an upcoming college entrance exam it is usually not possible or practical to beta test all 1.5 million people in the population, so one or more representative samples are selected to beta test the questions.

As I mentioned, the sample of participants taking an assessment has an impact on the difficulty and discrimination statistics that you will obtain in your CTT item analysis. For example, if you administered the college entrance exam beta test to a sample of gifted students who are the best and brightest, the Item Analysis Report is going to come back showing that all your questions are easy (p-values close to 1) and you probably won’t get very high discrimination statistics. However, we know that the population of people taking college entrance exams is not all composed of the best and brightest, so this sample is not an accurate representation of the population (we say the sample is not representative). It would not be wise to try to build the actual college entrance exam form from the beta test results from only this one sample of bright students because the item statistics would not reflect the population of students that will be tested.

Using strong sampling methods will help ensure that the statistics you get are appropriate. Typing in a search word like “Sampling” in your favorite online book store will yield numerous suggestions for some fun reading on this subject. If you don’t have the time or inclination to do some light reading on sampling methods in your spare time, start with the obvious: Think about the target population of test takers that are going to take a test and if you are beta testing questions try to obtain samples that reflect that population of test takers. In a previous blog post I talked more about beta testing.

As an aside, Item Response Theory (IRT) advocates will be quick to point out that IRT doesn’t have the same sample dependency challenges as CTT. I’ll discuss that at another time!