Item analysis: Selecting items for the test form – Part 1
Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.
But don’t worry, this is actually a new topic for the blog.
Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15.
Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.
When building a test form, item statistics can help us in two ways.
- They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
- They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.
In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.
Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.
Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.
Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).
In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.
Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.