Item Analysis for Beginners – Getting Started

Do you use assessments to make decisions about people? If so, then you should regularly run Item Analysis on your results.  Item Analysis can help find questions which are ambiguous, mis-keyed or which have choices that are rarely chosen. Improving or removing such questions will improve the validity and reliability of your assessment, and so help you use assessment results to make better decisions. If you don’t use Item Analysis, you risk using poor questions that make your assessments less accurate.

Sometimes people can be fearful of Item Analysis because they are worried it involves too much statistics. This blog post introduces Item Analysis for people who are unfamiliar with it, and I promise no maths or stats! I’m also giving a free webinar on Item Analysis with the same promise.

An assessment contains many items (another name for questions) as figuratively shown below. You can use Item Analysis to look at how each item performs within the assessment and flag potentially weak items for review. By keeping only stronger questions in the assessment, the assessment will be more effective.

Picture of a series of items with one marked as being weak

Item Analysis looks at the performance of all your participants on the items, and calculates how easy or hard people find the items (“item difficulty” or “p-value”) and how well the scores on items correlate with or show a relationship with the scores on the assessment as a whole (“item discrimination” or correlation). Some of problematic questions that Item Analysis can identify are:

  • Questions almost all participants get right, and so which are very easy. You might want to review to these to see if they are appropriate for the assessment. See my earlier post Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful? for more information.
  • Questions which are difficult, where a lot of participants get the questionwrong. You should check such questions in case they are mis-keyed or ambiguous.
  • Multiple choice questions where some choices are rarely picked. You might want to improve such questions to make the wrong choices more plausible.
  • Questions where there is a poor correlation between participants who get the question right and who do well on the assessment. For example it will flag questions that high performing participants perform poorly on. You should look at such questions in case they are ambiguous, mis-keyed or off-topic.

There is a huge wealth of information available in an Item Analysis report, and assessment experts will delve into the report in detail. But much of the key information in an Item Analysis report is useful to anyone creating and delivering quizzes, tests and exams.

The Questionmark Item Analysis report includes a graph which shows the difficulty of items compared against their discrimination, like in the example below. It flags questions by marking them amber or red if they fall into categories which may need review. For example, in the illustration below, four questions are marked in amber as having low discrimination and so potentially be worth looking at.

Illustration of Questionmark item analysis report showing some questions green and some amber

If you are running an assessment program, and not using Item Analysis regularly, then this throws doubt on the trustworthiness of your results. By using it to identify and improve weak questions you should be able to improve your validity and reliability.

Item Analysis is surprisingly effective in practice. I’m one of the team responsible at Questionmark for managing our data security test which all employees have to take annually to check their understanding of information security and data protection. We recently reviewed the test and ran Item Analysis and very quickly found a question with poor stats where the technology had changed but we’d not updated the wording, and another question where two of the choices could be considered right, which made it hard to answer. It made our review faster and more effective and helped us improve the quality of the test.

If you want to learn a little more about Item Analysis, I’m running a free webinar on the subject “Item Analysis for Beginners” on May 2nd. You can see details and register for the webinar at https://www.questionmark.com/questionmark_webinars. I look forward to seeing some of you there!

 

Psychometrics 101: Sample size and question difficulty (p-values)

greg_pope-150x1502

Posted by Greg Pope

With just a week to go before the Questionmark Users Conference, here’s a little taste of the presentation I will be doing on  psychometrics. I will also be running a session on Item Analysis and Test Analysis.

So, let’s talk about sample size and question difficulty!

How does the number of participants that take a question relate to the robustness/stability of the question difficulty statistic (p-value)? Basically the smaller the number of participants tested the less robust/stable the statistic. So if 30 participants take a question and the p-value that appears in the Item Analysis Report is 0.600 the range that the theoretical “true” p-value (if all participants in the world took the question) could fall into 95% of the time is between 0.775 and 0.425. This means that if another 30 participants were tested you could get a p-value on the Item Analysis Report anywhere from 0.775 to 0.425 (95% confidence range). The take away is that if high stakes decisions are being made using p-values (e.g., whether to drop a question from a certification exam) the more participants that can be tested the better to get more robust results. Another example is that if you are conducting beta testing and you want to know which questions to include in your test form based on the beta test results the more participants you can beta test the better in terms of the confidence you will have in the stability of the statistics. Below is a graph that illustrates this relationship.sample-size-influences-p-value-chart1

This relationship between sample size and the stability of other statistics applies to other common statistics used in psychometrics. For example the item-total correlation (point biserial correlation coefficient) can vary a great deal when small sample sizes are used to calculate it. In the example below we see that an observed correlation of 0 can actual vary by over 0.8 (plus or minus).sample-sixe-influences-chart1