Item Analysis for Beginners – Getting Started

Do you use assessments to make decisions about people? If so, then you should regularly run Item Analysis on your results.  Item Analysis can help find questions which are ambiguous, mis-keyed or which have choices that are rarely chosen. Improving or removing such questions will improve the validity and reliability of your assessment, and so help you use assessment results to make better decisions. If you don’t use Item Analysis, you risk using poor questions that make your assessments less accurate.

Sometimes people can be fearful of Item Analysis because they are worried it involves too much statistics. This blog post introduces Item Analysis for people who are unfamiliar with it, and I promise no maths or stats! I’m also giving a free webinar on Item Analysis with the same promise.

An assessment contains many items (another name for questions) as figuratively shown below. You can use Item Analysis to look at how each item performs within the assessment and flag potentially weak items for review. By keeping only stronger questions in the assessment, the assessment will be more effective.

Picture of a series of items with one marked as being weak

Item Analysis looks at the performance of all your participants on the items, and calculates how easy or hard people find the items (“item difficulty” or “p-value”) and how well the scores on items correlate with or show a relationship with the scores on the assessment as a whole (“item discrimination” or correlation). Some of problematic questions that Item Analysis can identify are:

  • Questions almost all participants get right, and so which are very easy. You might want to review to these to see if they are appropriate for the assessment. See my earlier post Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful? for more information.
  • Questions which are difficult, where a lot of participants get the questionwrong. You should check such questions in case they are mis-keyed or ambiguous.
  • Multiple choice questions where some choices are rarely picked. You might want to improve such questions to make the wrong choices more plausible.
  • Questions where there is a poor correlation between participants who get the question right and who do well on the assessment. For example it will flag questions that high performing participants perform poorly on. You should look at such questions in case they are ambiguous, mis-keyed or off-topic.

There is a huge wealth of information available in an Item Analysis report, and assessment experts will delve into the report in detail. But much of the key information in an Item Analysis report is useful to anyone creating and delivering quizzes, tests and exams.

The Questionmark Item Analysis report includes a graph which shows the difficulty of items compared against their discrimination, like in the example below. It flags questions by marking them amber or red if they fall into categories which may need review. For example, in the illustration below, four questions are marked in amber as having low discrimination and so potentially be worth looking at.

Illustration of Questionmark item analysis report showing some questions green and some amber

If you are running an assessment program, and not using Item Analysis regularly, then this throws doubt on the trustworthiness of your results. By using it to identify and improve weak questions you should be able to improve your validity and reliability.

Item Analysis is surprisingly effective in practice. I’m one of the team responsible at Questionmark for managing our data security test which all employees have to take annually to check their understanding of information security and data protection. We recently reviewed the test and ran Item Analysis and very quickly found a question with poor stats where the technology had changed but we’d not updated the wording, and another question where two of the choices could be considered right, which made it hard to answer. It made our review faster and more effective and helped us improve the quality of the test.

If you want to learn a little more about Item Analysis, I’m running a free webinar on the subject “Item Analysis for Beginners” on May 2nd. You can see details and register for the webinar at I look forward to seeing some of you there!


7 actionable steps for making your assessments more trustable

John Kleeman HeadshotPosted by John Kleeman

Questionmark has recently published a white paper on trustable assessment,  and we blog about this topic frequently. See Reliability and validity are the keys to trust and The key to reliability and validity is authoring for some recent blog posts about the white paper.

But what can you do today if you want to make your assessments more trustable? Obviously you can read the white paper! But here are seven actionable steps that if you’re not doing already you could do today or at least reasonably quickly to improve your assessments.

1. Organize questions in an item bank with topic structure

If you are already using Questionmark software, you are likely doing this already.  But putting questions in an item bank structured by hierarchical topics facilitates an easy management view of all questions and assessments under development. It allows you to use the same question in multiple assessments, easily add questions and retire them and easily search questions, for example to find the ones that need update when laws change or a product is retired.

2. Use questions that apply knowledge in the job context

It is better to ask questions that check how people can apply knowledge in the job context than just to find out whether they have specific knowledge. See my earlier post Test above knowledge: Use scenario questions for some tips on this. If you currently just test on knowledge and not on how to apply that knowledge, make today the day that you start to change!

3. Have your subject matter experts directly involved in authoring

Especially in an area where there is rapid change, you need subject matter experts directly involved in authoring and reviewing questions. Whether you use Questionmark Live or another system, start involving them.

4. Set a pass score fairly

Setting a pass score fairly is critical to being able to trust an assessment’s results. See Is a compliance test better with a higher pass score? and Standard Setting: A Keystone to Legal Defensibility for some starting points on setting good pass scores. And if you don’t think you’re following good practice, start to change.

5. Use topic scoring and feedback

As Austin Fossey explained in his ground-breaking post Is There Value in Reporting Subscores?, you do need to check whether it is sensible to report topic scores. But in most cases, topic scores and topic feedback can be very useful and actionable – they direct people to where there are problems or where improvement is needed.

6. Define a participant code of conduct

If people cheat, it makes assessment results much less trustable. As I explained in my post What is the best way to reduce cheating? , setting up a participant code of conduct (or honesty code) is an easy and effective way of reducing cheating. What can you do today to encourage your test takers to believe your program is fair and be on your side in reducing cheating?

7. Run item analysis and weed out poor items

This is something that all Questionmark users could do today. Run an item analysis report – it takes just a minute or two from our interfaces and look at the questions that are flagged as needing review (usually amber or red). Review them to check appropriateness and potentially retire them from your pool or else improve them.

Questionmark item analysis report


Many of you will probably be doing all the above and more, but I hope that for some of you this post could be a spur to action to make your assessments more trustable. Why not start today?

The key to reliability and validity is authoring

John Kleeman HeadshotPosted by John Kleeman

In my earlier post I explained how reliability and validity are the keys to trustable assessments results. A reliable assessment means that it is consistent and a valid assessment means that it measures what you need it to measure.

The key to validity and reliability starts with the authoring process. If you do not have a repeatable, defensible process for authoring questions and assessments, then however good the other parts of your process are, you will not have valid and reliable assessments.

The critical value that Questionmark brings is its structured authoring processes, which enable effective planning, authoring, Questionmark Liveand reviewing of questions and assessments and makes them more likely to be valid.

Questionmark’s white paper “Assessment Results You Can Trust” suggests 18 key authoring measures for making trustable assessments – here are three of the most important.

Organize items in an item bank with topic structure

There are huge benefits to using an assessment management system with an item bank that structures items by hierarchical topics as this facilitates:

  • An easy management view of all items and assessments under development
  • Mapping of topics to relevant organizational areas of importance
  • Clear references from items to topics
  • Use of the same item in multiple assessments
  • Simple addition of new items within a topic
  • Easy retiring of items when they are no longer needed
  • Version history maintained for legal defensibility
  • Search capabilities to identify questions that need updating when laws change or a product is retired

Some stand alone e-Learning creation tools and some LMSs do not provide you with an item bank and require you to insert questions individually within an assessment. If you only have a handful of assessments or you rarely need to update assessments, such systems can work, but for anyone with more than a few assessments, you need an item bank to be able to make effective assessments.

Authoring tool subject matter experts can use directly

One of the critical factors in making successful items is to get effective input from subject matter experts (SMEs), as they are usually more knowledgeable and better able to construct and review questions than learning technology specialists or general trainers.

If you can use a system like Questionmark Live to harvest or “crowdsource” items from SMEs and have learning or assessment specialists review them, your items will be of better quality.

Easy collaboration for item reviewers to help make items more valid

Items will be more valid if they have been properly reviewed. They will also be more defensible if the past changes are auditable. A track-changes capability, like that shown in the example screenshot below, is invaluable to aid the review process. It allows authors to see what changes are being proposed and to check they make sense.

Screenshot of track changes functionality in Questionmark Live

These three capabilities – having an item bank, having an authoring tools SMEs can access directly and allowing easy collaboration with “track changes” are critical for obtaining reliable and valid, and therefore trustable assessments.

For more information on how to make trustable assessments, see our white paper “Assessment Results You can Trust” 

How much do you know about defensible assessments?

Julie Delazyn HeadshotPosted by Julie Delazyn

This quiz is a re-post from a very popular blog entry published by John Kleeman.

Readers told us that it was instructive and engaging to take quizzes on using assessments, and we like to listen to you! So here is the second quiz in a pre-published series of quizzes on assessment topics. This one was authored in conjunction with Neil Bachelor of Pure Questions. You can see the first quiz on Cut Scores here.

As always, we regard resources like this quiz as a way of contributing to the ongoing process of learning about assessment. In that spirit, please enjoy the quiz below and feel free to comment if you have any suggestions to improve the questions or the feedback.

Is a longer test likely to be more defensible than a shorter one? Take the quiz and find out. Be sure to look for your feedback after you have completed it!

NOTE: Some people commented on the first quiz that they were surprised to lose marks for getting questions wrong. This quiz uses True/False questions and it is easy to guess at answers, so we’ve set it to subtract a point for each question you get wrong, to illustrate that this is possible. Negative scoring like this encourages you to answer “Don’t Know” rather than guess; this is particularly helpful in diagnostic tests where you want participants to be as honest as possible about what they do or don’t think they know.

Reliability and validity are the keys to trust


Not reliable

John Kleeman HeadshotPosted by John Kleeman

How can you trust assessment results? The two keys are reliability and validity.

Reliability explained

An assessment is reliable if it measures the same thing consistently and reproducibly. If you were to deliver an assessment with high reliability to the same participant on two occasions, you would be very likely to reach the same conclusions about the participant’s knowledge or skills. A test with poor reliability might result in very different scores across the two instances.An unreliable assessment does not measure anything consistently and cannot be used for any trustable measure of competency. It is useful visually to think of a dartboard; in the diagram to the right, darts have landed all over the board—they are not reliably in any one place.In order for an assessment to be reliable, there needs to be a predictable authoring process, effective beta testing of items, trustworthy delivery to all the devices used to give the assessment, good-quality post-assessment reporting and effective analytics.

Validity explained


Reliable but not valid

Being reliable is not good enough on its own. The darts in the dartboard in the figure to the right are in the same place, but not in the right place. A test can be reliable but not measure what it is meant to measure. For example, you could have a reliable assessment that tested for skill in word processing, but this would not be valid if used to test machine operators, as writing is not one of the key tasks in their jobs.

An assessment is valid if it measures what it is supposed to measure. So if you are measuring competence in a job role, a valid assessment must align with the knowledge, skills and abilities required to perform the tasks expected of a job role. In order to show that an assessment is valid, there must be some formal analysis of the tasks in a job role and the assessment must be structured to match those tasks. A common method of performing such analysis is a job task analysis, which surveys subject matter experts or people in the job role to identify the importance of different tasks.

Assessments must be reliable AND valid

Trustable assessments must be reliable AND valid.


Reliable and valid

The darts in the figure to the right are in the same place and at the right place.

When you are constructing an assessment for competence, you are looking for it to consistently measure the competence required for the job.

 Comparison with blood tests

It is helpful to consider what happens if you go to the doctor with an illness. The doctor goes through a process of discovery, analysis, diagnosis and prescription. As part of the discovery process, sometimes the doctor will order a blood test to identify if a particular condition is present, which can diagnose the illness or rule out a diagnosis.

It takes time and resources to do a blood test, but it can be an invaluable piece of information. A great deal of effort goes into making sure that blood tests are both reliable (consistent) and valid (measure what they are supposed to measure). For example, just like exam results, blood samples are labelled carefully, as shown in the picture, to ensure that patient identification is retained.

image_thumb.pngA blood test that was not reliable would be dangerous—a doctor might think that a disease is not present when it is. Furthermore, a reliable blood test used for the wrong purpose is not useful—for example, there is no point in having a test for blood glucose level if the doctor is trying to see if a heart attack is imminent.

The blood test results are a single piece of information that helps the doctor make the diagnosis in conjunction with other data from the doctor’s discovery process.

In exactly the same way, a test of competence is an important piece of information to determine if someone is competent in their job role.

Using the blood test metaphor, it is easy to understand the personnel and organizational risks that can result from making decisions based on untrustworthy results. If an organization assesses someone’s knowledge, skill or competence for health and safety or regulatory compliance purposes, you need to ensure the assessment is designed correctly and runs consistently, which means that they must be reliable and valid.

For assessments to be reliable and valid, it is necessary that you follow structured processes at each step from planning through authoring to delivery and reporting. These processes are explained in our new white paper “Assessment Results You can Trust” and I’ll be sharing some of the content in future articles in this blog.

For fuller information, you can download the white paper, click here

Standard Setting: A Keystone to Legal Defensibility

Austin Fossey-42Since the last Questionmark Users Conference, I have heard several clients discuss new measures at their companies requiring them to provide evidence of the legal defensibility of their assessment. Legal defensibility and validity are closely intertwined, but they are not synonymous. An assessment can be legally defensible, yet still have flaws that impact its validity. The distinction between the two is often the difference between how you developed the instrument versus how well you developed the instrument.

Regardless of whether you are concerned with legal defensibility or validity, careful attention should be paid to the evaluative component of your assessment program. What if  someone asks, “What does this score mean?” How do you answer? How do you justify your response? The answers to these questions impact how your stakeholders will interpret and use the results, and this may have consequences for your participants. Many factors go into supporting the legal defensibility and validity of assessment results, but one could argue that the keystone is the standard-setting process.

Standard setting is the process of dividing score scales so that scores can be interpreted and actioned (AERA, APA, NCME, 2014). The dividing points between sections of the scales  are called “cut scores,” and in criterion-referenced assessment, they typically correspond to performance levels that are defined a priori. These cut scores and their corresponding performance levels help test users make the cognitive leap from a participant’s response pattern to what can be a complex inference about the participant’s knowledge, skills, and abilities.

In their chapter in Educational Measurement (4th Ed.), Hambleton and Pitoniak explain that standard-setting studies need to consider many factors, and that they also can have major implications for participants and test users. For this reason, standard-setting studies are often rigorous, well-documented projects.

At this year’s Questionmark Users Conference, I will be delivering a session that introduces the basics of standard setting. We will discuss standard-setting methods for criterion- referenced and norm-referenced assessments, and we will touch on methods used in both large-scale assessments and in classroom settings. This will be a useful session for anyone who is working on documenting the legal defensibility of their assessment program or who is planning their first standard-setting study and wants to learn about different methods that are available. Participants are encouraged to bring their own questions and stories to share with the group.

Register today for the full conference, but if you cannot make it, make sure to catch the live webcast!