In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.
The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.
Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.
In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.
Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).
The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.
If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.
Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.
For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.
The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.
Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.