Item Analysis for Beginners – Getting Started

Do you use assessments to make decisions about people? If so, then you should regularly run Item Analysis on your results.  Item Analysis can help find questions which are ambiguous, mis-keyed or which have choices that are rarely chosen. Improving or removing such questions will improve the validity and reliability of your assessment, and so help you use assessment results to make better decisions. If you don’t use Item Analysis, you risk using poor questions that make your assessments less accurate.

Sometimes people can be fearful of Item Analysis because they are worried it involves too much statistics. This blog post introduces Item Analysis for people who are unfamiliar with it, and I promise no maths or stats! I’m also giving a free webinar on Item Analysis with the same promise.

An assessment contains many items (another name for questions) as figuratively shown below. You can use Item Analysis to look at how each item performs within the assessment and flag potentially weak items for review. By keeping only stronger questions in the assessment, the assessment will be more effective.

Picture of a series of items with one marked as being weak

Item Analysis looks at the performance of all your participants on the items, and calculates how easy or hard people find the items (“item difficulty” or “p-value”) and how well the scores on items correlate with or show a relationship with the scores on the assessment as a whole (“item discrimination” or correlation). Some of problematic questions that Item Analysis can identify are:

  • Questions almost all participants get right, and so which are very easy. You might want to review to these to see if they are appropriate for the assessment. See my earlier post Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful? for more information.
  • Questions which are difficult, where a lot of participants get the questionwrong. You should check such questions in case they are mis-keyed or ambiguous.
  • Multiple choice questions where some choices are rarely picked. You might want to improve such questions to make the wrong choices more plausible.
  • Questions where there is a poor correlation between participants who get the question right and who do well on the assessment. For example it will flag questions that high performing participants perform poorly on. You should look at such questions in case they are ambiguous, mis-keyed or off-topic.

There is a huge wealth of information available in an Item Analysis report, and assessment experts will delve into the report in detail. But much of the key information in an Item Analysis report is useful to anyone creating and delivering quizzes, tests and exams.

The Questionmark Item Analysis report includes a graph which shows the difficulty of items compared against their discrimination, like in the example below. It flags questions by marking them amber or red if they fall into categories which may need review. For example, in the illustration below, four questions are marked in amber as having low discrimination and so potentially be worth looking at.

Illustration of Questionmark item analysis report showing some questions green and some amber

If you are running an assessment program, and not using Item Analysis regularly, then this throws doubt on the trustworthiness of your results. By using it to identify and improve weak questions you should be able to improve your validity and reliability.

Item Analysis is surprisingly effective in practice. I’m one of the team responsible at Questionmark for managing our data security test which all employees have to take annually to check their understanding of information security and data protection. We recently reviewed the test and ran Item Analysis and very quickly found a question with poor stats where the technology had changed but we’d not updated the wording, and another question where two of the choices could be considered right, which made it hard to answer. It made our review faster and more effective and helped us improve the quality of the test.

If you want to learn a little more about Item Analysis, I’m running a free webinar on the subject “Item Analysis for Beginners” on May 2nd. You can see details and register for the webinar at https://www.questionmark.com/questionmark_webinars. I look forward to seeing some of you there!

 

Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful?

Posted by John Kleeman

I’m running a session at the Questionmark user conference next month on Item Analysis for Beginners and thought I’d share the answer to an interesting question in this blog.

Item analysis fragment showing a question with difficulty of 0.998 and discrimination of 0.034When you run an Item Analysis report, one of the useful statistics you get on a question is its “p-value” or “item difficulty”. This is a number from 0 to 1, with the higher the value the easier the question. An easy question might have a p-value of 0.9 to 1.0, meaning 90% to 100% of participants answer the question correctly. A difficult question might have a p-value of 0.0 to 0.25 meaning less than 25% of participants answer the question correctly. For example, the report fragment to the right shows a question with p-value 0.998 which means it is very easy and almost everyone gets it right.

Whether such questions are appropriate depends on the purpose of the assessment. Most participants will get difficult questions wrong and easy questions right. In general, very easy and very difficult questions will not be as helpful as other questions in helping you discriminate between participants and so use the assessment for measurement purposes.

Here are three reasons why you might decide to include very difficult questions in an assessment:

  1. Sometimes your test blueprint requires questions on a topic and the only ones you have available are difficult ones – if so, you need to use them until you can write more.
  2. If a job has high performance needs and you need to filter out a few participants from many, then very difficult questions can be useful. This might apply for example if you are selecting potential astronauts or special forces team members.
  3. If you need to assess a wide range of ability within a single assessment, then you may need some very difficult questions to be able to assess abilities within the top performing participants.

And here are five reasons why you might decide to include very easy questions in an assessment:

  1. Answering questions gives retrieval practice and helps participants remember things in future – so including easy questions still helps reduce people’s forgetting.
  2. In compliance or health and safety, you may choose to include basic questions that almost everyone gets right. This is because if someone gets it wrong, you want to know and be able to intervene.
  3. More broadly, sometimes a test blueprint requires you to cover some topics that almost everyone knows, and it’s not practical to write difficult questions about.
  4. Easy questions at the start of an assessment can build confidence and reduce test anxiety. See my blog post Ten tips on reducing test anxiety for online test-takers for other ways to deal with test anxiety.
  5. If the purpose of your assessment is to measure someone’s ability to process information quickly and accurately at speed, then including many low difficulty questions that need to be answered in a short time might be appropriate.

If you want to learn more about Item Analysis, search this blog for other articles. You might also find the Questionmark user conference useful, since as well as my session on Item Analysis, there are also many other useful sessions including setting the cut-score in a fair, defensible way and identifying knowledge gaps. The conference also gives opportunity to learn and network with other assessment practitioners – I look forward to seeing some of you there.

Develop Better Tests with Item Analysis [New eBook]

Posted by Chloe Mendonca

Item Analysis is probably the most important tool for increasing test effectiveness.  In order to write items that accurately and reliably measure what they’re intended to, you need to examine participant responses to each item. You can use this information to improve test items and identify unfair or biased items.

So what’s the process for conducting an item analysis? What should you be looking for? How do you determine if a question is “good enough”?

Questionmark has just published a new eBook “Item Analysis Analytics, which answers these questions. The eBook shares many examples of varying statistics that you may come across item analysis ebookin your own analyses.

Download this eBook to learn about these aspects of analytics:

  • the basics of classical test theory and item analysis
  • the process of conducting an item analysis
  • essential things to look for in a typical item analysis report
  • whether a question “makes the grade” in terms of psychometric quality

This eBook is available as a PDF and ePUB suitable for viewing on a variety of mobile devices and eReaders.

I hope you enjoy reading it!

Item analysis: Selecting items for the test form – Part 2

Austin Fossey-42Posted by Austin Fossey

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

IT 2

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.

IT2

The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 1

Austin Fossey-42Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.

But don’t worry, this is actually a new topic for the blog.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. 

Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.

When building a test form, item statistics can help us in two ways.

  • They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
  • They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.

In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.

Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.

Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.

Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).

In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

 

 

Field Test Studies: Taking your items for a test drive

Austin FosseyPosted by Austin Fossey

In large-scale assessment, a significant amount of work goes into writing items before a participant ever sees them. Items are drafted, edited, reviewed for accuracy, checked for bias, and usually rewritten several times before they are ready to be deployed. Despite all this work, a true test of an item’s performance will come when it is first delivered to participants.

Even though we work so hard to write high-quality items, some bad items may slip past our review committees. To be safe, most large-scale assessment programs will try out their items with a field test.

A field test delivers items to participants under the same conditions used in live testing, but the items do not count toward the participants’ scores. This allows test developers and psychometricians to harvest statistics that can be used in an item analysis to flag poorly performing items.

There are two methods for field testing items. The first method is to embed your new items into an assessment that is already operational. The field test items will not count against the participants’ scores, but the participants will not know which items are scored items and which items are field test items.

The second method is to give participants an assessment that includes only field test items. The participants will not receive a score at the end of the assessment since none of the items have yet been approved to be used for live scoring, though the form may be scored later once the final set of items has  been approved for operational use.

In their chapter in Educational Measurement (4 th ed.), Schmeiser and Welch explain that embedding the items into an operational assessment is generally preferred. When items are field tested in an operational assessment, participants are more motivated to perform well on the items. The item data are also collected while the operational assessment is being delivered, which can help improve the reliability of the item statistics.

When participants take an assessment that only consists of field test items, they may not be motivated to try as hard as they would in an operational assessment, especially if the assessment will not be scored. However, field testing a whole form’s worth of items will give you better content coverage with the items so that you have more items that can be reviewed in the item analysis. If field testing an entire form, Shmeiser and Welch suggest using twice as many items as you will need for the operational form. Many items may need to be discarded or rewritten as a result of the item analysis, so you want to make sure you will still have enough to build an operational form at the end of the process.

Since the value of field testing items is to collect item statistics, it is also important to make sure that a representative sample of participants responds to the field test items. If the sample of participant responses is too small or not representative, then the item statistics may not be generalizable to the entire population.

Questionmark’s authoring solutions allow test developers to field test items by setting the item’s status to “Experimental.” The item will still be scored, and the statistics will be
generated in the Item Analysis Report, but the item will not count toward the participant’s final score.

qm Properties

Setting an item’s status to “Experimental” in Questionmark Live so that it can be field tested.