Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful?

Posted by John Kleeman

I’m running a session at the Questionmark user conference next month on Item Analysis for Beginners and thought I’d share the answer to an interesting question in this blog.

Item analysis fragment showing a question with difficulty of 0.998 and discrimination of 0.034When you run an Item Analysis report, one of the useful statistics you get on a question is its “p-value” or “item difficulty”. This is a number from 0 to 1, with the higher the value the easier the question. An easy question might have a p-value of 0.9 to 1.0, meaning 90% to 100% of participants answer the question correctly. A difficult question might have a p-value of 0.0 to 0.25 meaning less than 25% of participants answer the question correctly. For example, the report fragment to the right shows a question with p-value 0.998 which means it is very easy and almost everyone gets it right.

Whether such questions are appropriate depends on the purpose of the assessment. Most participants will get difficult questions wrong and easy questions right. In general, very easy and very difficult questions will not be as helpful as other questions in helping you discriminate between participants and so use the assessment for measurement purposes.

Here are three reasons why you might decide to include very difficult questions in an assessment:

  1. Sometimes your test blueprint requires questions on a topic and the only ones you have available are difficult ones – if so, you need to use them until you can write more.
  2. If a job has high performance needs and you need to filter out a few participants from many, then very difficult questions can be useful. This might apply for example if you are selecting potential astronauts or special forces team members.
  3. If you need to assess a wide range of ability within a single assessment, then you may need some very difficult questions to be able to assess abilities within the top performing participants.

And here are five reasons why you might decide to include very easy questions in an assessment:

  1. Answering questions gives retrieval practice and helps participants remember things in future – so including easy questions still helps reduce people’s forgetting.
  2. In compliance or health and safety, you may choose to include basic questions that almost everyone gets right. This is because if someone gets it wrong, you want to know and be able to intervene.
  3. More broadly, sometimes a test blueprint requires you to cover some topics that almost everyone knows, and it’s not practical to write difficult questions about.
  4. Easy questions at the start of an assessment can build confidence and reduce test anxiety. See my blog post Ten tips on reducing test anxiety for online test-takers for other ways to deal with test anxiety.
  5. If the purpose of your assessment is to measure someone’s ability to process information quickly and accurately at speed, then including many low difficulty questions that need to be answered in a short time might be appropriate.

If you want to learn more about Item Analysis, search this blog for other articles. You might also find the Questionmark user conference useful, since as well as my session on Item Analysis, there are also many other useful sessions including setting the cut-score in a fair, defensible way and identifying knowledge gaps. The conference also gives opportunity to learn and network with other assessment practitioners – I look forward to seeing some of you there.

Develop Better Tests with Item Analysis [New eBook]

Posted by Chloe Mendonca

Item Analysis is probably the most important tool for increasing test effectiveness.  In order to write items that accurately and reliably measure what they’re intended to, you need to examine participant responses to each item. You can use this information to improve test items and identify unfair or biased items.

So what’s the process for conducting an item analysis? What should you be looking for? How do you determine if a question is “good enough”?

Questionmark has just published a new eBook “Item Analysis Analytics, which answers these questions. The eBook shares many examples of varying statistics that you may come across item analysis ebookin your own analyses.

Download this eBook to learn about these aspects of analytics:

  • the basics of classical test theory and item analysis
  • the process of conducting an item analysis
  • essential things to look for in a typical item analysis report
  • whether a question “makes the grade” in terms of psychometric quality

This eBook is available as a PDF and ePUB suitable for viewing on a variety of mobile devices and eReaders.

I hope you enjoy reading it!

Item analysis: Selecting items for the test form – Part 2

Austin Fossey-42Posted by Austin Fossey

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

IT 2

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.


The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 1

Austin Fossey-42Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.

But don’t worry, this is actually a new topic for the blog.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. 

Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.

When building a test form, item statistics can help us in two ways.

  • They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
  • They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.

In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.

Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.

Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.

Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).

In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.



Field Test Studies: Taking your items for a test drive

Austin FosseyPosted by Austin Fossey

In large-scale assessment, a significant amount of work goes into writing items before a participant ever sees them. Items are drafted, edited, reviewed for accuracy, checked for bias, and usually rewritten several times before they are ready to be deployed. Despite all this work, a true test of an item’s performance will come when it is first delivered to participants.

Even though we work so hard to write high-quality items, some bad items may slip past our review committees. To be safe, most large-scale assessment programs will try out their items with a field test.

A field test delivers items to participants under the same conditions used in live testing, but the items do not count toward the participants’ scores. This allows test developers and psychometricians to harvest statistics that can be used in an item analysis to flag poorly performing items.

There are two methods for field testing items. The first method is to embed your new items into an assessment that is already operational. The field test items will not count against the participants’ scores, but the participants will not know which items are scored items and which items are field test items.

The second method is to give participants an assessment that includes only field test items. The participants will not receive a score at the end of the assessment since none of the items have yet been approved to be used for live scoring, though the form may be scored later once the final set of items has  been approved for operational use.

In their chapter in Educational Measurement (4 th ed.), Schmeiser and Welch explain that embedding the items into an operational assessment is generally preferred. When items are field tested in an operational assessment, participants are more motivated to perform well on the items. The item data are also collected while the operational assessment is being delivered, which can help improve the reliability of the item statistics.

When participants take an assessment that only consists of field test items, they may not be motivated to try as hard as they would in an operational assessment, especially if the assessment will not be scored. However, field testing a whole form’s worth of items will give you better content coverage with the items so that you have more items that can be reviewed in the item analysis. If field testing an entire form, Shmeiser and Welch suggest using twice as many items as you will need for the operational form. Many items may need to be discarded or rewritten as a result of the item analysis, so you want to make sure you will still have enough to build an operational form at the end of the process.

Since the value of field testing items is to collect item statistics, it is also important to make sure that a representative sample of participants responds to the field test items. If the sample of participant responses is too small or not representative, then the item statistics may not be generalizable to the entire population.

Questionmark’s authoring solutions allow test developers to field test items by setting the item’s status to “Experimental.” The item will still be scored, and the statistics will be
generated in the Item Analysis Report, but the item will not count toward the participant’s final score.

qm Properties

Setting an item’s status to “Experimental” in Questionmark Live so that it can be field tested.

Item Analysis Report – Item Reliability

Austin FosseyPosted by Austin Fossey

In this series of posts, we have been discussing the statistics that are reported on the Item Analysis Report, including the difficulty index, correlational discrimination, and high-low discrimination.

The final statistic reported on the Item Analysis Report is the item reliability. Item reliability is simply the product of the standard deviation of item scores and a correlational discrimination index (Item-Total Correlation Discrimination in the Item Analysis Report). So item reliability reflects how much the item is contributing to total score variance. As with assessment reliability, higher values represent better reliability.

Like the other statistics in the Item Analysis Report, item reliability is used primarily to inform decisions about item retention. Crocker and Algina (Introduction to Classical and Modern Test Theory) describe three ways that test developers might use the item reliability index.

1) Choosing Between Two Items in Form Construction

If two items have similar discrimination values, but one item has a higher standard deviation of item scores, then that item will have higher item reliability and will contribute more to the assessment’s reliability. All else being equal, the test developer might decide to retain the item with higher reliability and save the lower reliability item in the bank as backup.

2) Building a Form with a Required Assessment Reliability Threshold

As Crocker and Algina demonstrate, Cronbach’s Alpha can be calculated as a function of the standard deviations of items’ scores and items’ reliabilities. If the test developer desires a certain minimum for the assessment’s reliability (as measured by Cronbach’s Alpha), they can use these two item statistics to build a form that will yield the desired level of internal consistency.

3) Building a Form with a Required Total Score Variance Threshold

Crocker and Algina explain that the total score variance is equivalent to the square of the sum of item reliability indices, so test developers may continue to add items to a form based on their item reliability values until they meet their desired threshold for total score variance.


Item reliability from Questionmark’s Item Analysis Report (item detail page)