Develop Better Tests with Item Analysis [New eBook]

Posted by Chloe Mendonca

Item Analysis is probably the most important tool for increasing test effectiveness.  In order to write items that accurately and reliably measure what they’re intended to, you need to examine participant responses to each item. You can use this information to improve test items and identify unfair or biased items.

So what’s the process for conducting an item analysis? What should you be looking for? How do you determine if a question is “good enough”?

Questionmark has just published a new eBook “Item Analysis Analytics, which answers these questions. The eBook shares many examples of varying statistics that you may come across item analysis ebookin your own analyses.

Download this eBook to learn about these aspects of analytics:

  • the basics of classical test theory and item analysis
  • the process of conducting an item analysis
  • essential things to look for in a typical item analysis report
  • whether a question “makes the grade” in terms of psychometric quality

This eBook is available as a PDF and ePUB suitable for viewing on a variety of mobile devices and eReaders.

I hope you enjoy reading it!

G Theory and Reliability for Assessments with Randomly Selected Items

Austin Fossey-42Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

  • Error variance.
  • Variance in the item mean scores.
  • Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

Item Analysis – Two Methods for Detecting DIF

Posted by Austin FosseyAustin Fossey-42

My last post introduced the concept of differential item functioning. Today, I would like to introduce two common methods for detecting DIF in a classical test theory framework: the Mantel-Haenszel method and the logistic regression method.

I will not go into the details of these two methods, but if you would like to know more, there are many great online resources. I also recommend de Ayala’s book, The Theory and Practice of Item Response Theory, for a great, easy-to-read chapter discussing these two methods.


The Mantel-Haenszel method determines whether or not there is a relationship between group membership and item performance, after accounting for participants’ abilities (as represented by total scores). The magnitude of the DIF is represented with a log odds estimate, known as αMH. In addition to the log odds ratio, we can calculate the Cochran-Mantel-Haenszel (CMH) statistic, which follows a chi squared distribution. CMH shows whether or not the observed DIF is significant, though there is no sense of magnitude as there is with αMH.

Logistic Regression

Unfortunately, the Mantel-Haenszel method is only consistent when investigating uniform DIF. If non-uniform DIF may be present, we can use logistic regression to investigate the presence of DIF. To do this, we run two logistic regression models where item performance is regressed on total scores (to account for the participants’ abilities) and group membership. One of the models will also include an interaction term between test score and group membership. We then can compare the fit of the two models. If the model with the interaction term fits better, then there is non-uniform DIF. If the model with no interaction term shows that group membership is a significant predictor of item performance, then there is uniform DIF. Otherwise, we can conclude that there is no DIF present.

Just because we find a statistical presence of DIF does not necessarily mean that we need to panic. In Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression, Monahan, McHorney, Stump, & Perkins note that it is useful to flag items based on the effect size of the DIF.

Both the Mantel-Haenszel method and the logistic regression method can be used to generate standardized effect sizes. Monahan et al. provide three categories of effect sizes: A, B, and C. These category labels are often generated in DIF or item calibration software, and we interpret them as follows: Level A is negligible levels of DIF, level B is slight to moderate levels of DIF, and level C is moderate to large levels of DIF. Flagging rules vary by organization, but it is common for test developers to only review items that fall into levels B and C.

Item Analysis – Differential Item Functioning (DIF)

Austin Fossey-42Posted by Austin Fossey

We have talked about item difficulty and item discrimination in a classical test theory framework, and we have discussed how these indices can be used to flag items that may potentially affect the validity of the inferences we make about the assessment results. Another area of item performance that is often used for item retention decisions is item bias, commonly referred to as differential item functioning (DIF).

DIF studies are generally implemented to see how the performances of two groups compare on a single item in an assessment (though studies can be done with more than two groups). One group is typically referred to as the reference group, and the other is the focal group. The focal group is the group that theory or previous research suggests may be disadvantaged by the item.

One simple method that some practitioners use is based on the four-fifths rule, which is detailed in the Uniform Guidelines for Employee Selection Procedures. This method involves comparing the correct response rates (p values) for the two groups. If the ratio of the smaller p value to the higher p value is less than 0.80, then the item may be adversely impacting the group with the lower p value. For example, if 50% of males answer an item correctly and 75% of females answer an item correctly, then 0.50/0.75 = 0.66 < 0.80, so we may be concerned that the item is adversely affecting the response patterns of males.

The four-fifths rule is attractive because it is easy to calculate, but it is prone to sampling error and misinterpretation. Continuing with our example, what if the population of males on average actually knows less about the content than the population of females? Then we would expect to see large differences in p values for the two groups because this reflects the actual differences in ability in the population.

In Differential Item Functioning (eds. Holland & Wainer, 1993), Angoff explains that DIF is occurring when an item displays different statistical properties between groups after those groups have been matched on a measure of proficiency. To put it another way, we need to first account for differences in the groups’ abilities, and then see if there are still differences in the item performance.

There are many ways to investigate DIF while accounting for participants’ abilities, and your decision may be influenced by whether or not you are using item response theory (IRT) for your student model, whether you have missing data, and whether or not the DIF is uniform or non-uniform.

Uniform DIF indicates that one group is (on average) always at a disadvantage when responding to the item. If we were to create item characteristic curves for the two groups, they would not intersect. Non-uniform DIF means that one group has an advantage for some proficiency levels, but is at a disadvantage at other proficiency levels. In this scenario, the two item characteristic curves would intersect.

Item Characteristic Curve

Item characteristic curves demonstrating examples of uniform and non-uniform DIF.

In my next post, I will introduce two common methods for detecting uniform and non-uniform DIF: the Mantel-Haenszel method and logistic regression. Unlike the four-fifths rule, these methods account for participants’ abilities (as represented by total scores) before making inferences about each group’s performance on an item.

Item Analysis Report Revisited

Austin FosseyPosted by Austin Fossey

If you are a fanatical follower of our Questionmark blog, then you already know that we have written more than a dozen articles relating to item analysis in a Classical Test Theory framework. So you may ask, “Austin, why does Questionmark write so much about item analysis statistics? Don’t you ever get out?”

Item analysis statistics are some of the easiest-to-use indicators of item quality, and these are tools that any test developer should be using in their work . By helping people understand these tools, we can help them get the most out of our technologies. And yes, I do get out. I went out to get some coffee once last April.

So why are we writing about item analysis statistics again? Since publishing many of the original blog articles about item analysis, Questionmark has built a new version of the Item Analysis Report in Questionmark Analytics, adding filtering capabilities beyond those of the original Question Statistics Report in Enterprise Reporter.

In my upcoming posts, I will revisit the concepts of item difficulty, item-total score correlation, and high-low discrimination in the context of the Item Analysis Report in Analytics. I will also provide an overview of item reliability and how it would be used operationally in test development.

item analysis report

Screenshot of the Item Analysis Report (Summary View) in Questionmark Analytics

A New Workshop on Interpreting Item and Test Analyses

Joan Phaup

Posted by Joan Phaup

Item and test analyses bring the most value when understood interpreted in an organizational context. Questionmark Analytics and Psychometrics Manager Greg Pope’s upcoming workshop on this subject will help participants make the most effective use of the valuable information they get from test and item analysis. The workshop will combine classical test theory with hands-on learning using Questionmark reporting tools to analyze exemplar assessments and test questions. Attendees are welcome to bring in their own item and test analysis reports to discuss during the session.

I spent a few minutes with Greg the other day, asking him for more details about this workshop, which will take place the morning of Tuesday, March 15th —  one of two workshops preceding the Questionmark 2011 Users Conference.

Greg Pope

Q: What value do organizations get from item analysis and test analysis reports?

A: Item and test analysis report provide invaluable psychometric information regarding the performance of assessments and the building blocks of assessments, items. Creating assessments that are composed of questions that all perform well benefits the organization funding the assessment program as well as the participant taking the assessment. The organization benefits by providing assessments that are valid and reliable (and therefore legally defensible) and potentially the organization is able to use fewer questions on assessments to get the same measurement power. Organizations and participants can have confidence that the scores that they obtain from the assessments reflect to a high degree what participants know and can do. Item and test analyses allow organizations to know which questions are performing well, which questions are not performing well, and most importantly, WHY.

Q: What are the challenges in using these reports effectively?

I think the main challenges center around a psychological barrier to entry. Many people feel anxiety at the thought of having to read and interpret something they have likely had little to no exposure to in their life. Psychometrics is a specialized area, to be sure, but to apply the basic foundations of it does not need to be akin to summiting Everest. I feel strongly that it is possible to give people the basic knowledge around item and test analysis in only a few hours to break down the psychological firewalls that often hinder using these reports effectively.

Q: How can individuals and organizations surmount these challenges?

A: I feel a gentle introduction to the subject area with lots of practical examples in plain English does the trick nicely. Sometimes psychometricians are accused of being pedantic, whether it is intentional or unintentional, making this information inaccessible for more people to understand and apply. I want to break down these barriers because I feel that the more people  who understand and can use psychometrics to improve assessment, the better off we all will be. I have tried to increase people’s understanding through my blog posts and I am really looking forward to personalizing this approach further in the workshop at the users conference.

Q: How have you structured the workshop?

A: I have structured the workshop to provide some of the basic theory behind item and test analysis and then get hands on to look at practical examples in different contexts. When I have done these workshops in the past, I have found that at first people can be sceptical of their own capacity to learn and apply knowledge in this area. However, by the end of the workshops I see people excited and energized by their newfound knowledge base and getting really involved in picking apart questions based on the item analysis report information. It is really inspiring for me to see people walk away with new found confidence and motivation to apply what they have learned when they get back to their jobs.

Q: What do you want people to take away with them from this session?

A: I want people to take away a newfound comfort level with the basics of psychometrics so that they can go back to their desks, run their item and test analysis reports, have confidence that they know how to identify good and bad items, and do something with that knowledge to improve the quality of their assessments.

You can sign up for this workshop at the same time you register for the conference (remembering that this Friday, January 21st, is the last day for earlybird savings). If you’re already registered for the conference, email to arrange for participation in the workshop. Click here to see the conference schedule.