Six tips to increase reliability in competence tests and exams

Posted by John Kleeman

Reliability (how consistent an assessment is in measuring something) is a vital criterion on which to judge a test, exam or quiz. This blog post explains what reliability is, why it matters and gives a few tips on how to increase it when using competence tests and exams within regulatory compliance and other work settings

What is reliability?

Picture of a kitchen scaleAn assessment is reliable if it measures the same thing consistently and reproducibly.

If you were to deliver an assessment with high reliability to the same participant on two occasions, you would be very likely to reach the same conclusions about the participant’s knowledge or skills. A test with poor reliability might result in very different scores across the two instances.

It’s useful to think of a kitchen scale. If the scale is reliable, then when you put a bag of flour on the scale today and the same bag of flour on tomorrow, then it will show the same weight. But if the scale is not working properly and is not reliable, it could give you a different weight each time.

Why does reliability matter?

Just like a kitchen scale that doesn’t work, an unreliable assessment does not measure anything consistently and cannot be used for any trustable measure of competency.

As well as reliability, it’s also important that an assessment is valid, i.e. measures what it is supposed to. Continuing the kitchen scale metaphor, a scale might consistently show the wrong weight; in such a case, the scale is reliable but not valid. To learn more about validity, see my earlier post Six tips to increase content validity in competence tests and exams.

How can you increase the reliability of your assessments?

Here are six practical tips to help increase the reliability of your assessment:

  1. Use enough questions to assess competence. Although you need a sensible balance to avoid tests being too long, reliability increases with test length. In their excellent book, Criterion-Referenced Test Development, Shrock and Coscarelli suggest a rule of thumb is 4-6 questions per objective, with more for critical objectives. You can also get guidance from an earlier post on this blog How many questions do I need on my assessment?
  2.  Have a consistent environment for participants. For test results to be consistent, it’s important that the test environment is consistent – try to ensure that all participants have the same amount of time to take the test in and have a similar environment. For example, if some participants are taking the test in a hurry in a public and noisy place and others are taking it at leisure in their office, this could impact reliability.
  3. Ensure participants are familiar with the assessment user interface. If a participant is new to the user interface or the question types, then they may not show their true competence due to the unfamiliarity. It’s common to provide practice tests to participants to allow them to become familiar with the assessment user interface. This can also reduce test anxiety which also influences reliability.
  4. If using human raters, train them well. If you are using human raters, for example in grading essays or in observational assessments that check practical skills, make sure to define your scoring rules very clearly and as objectively as possible. Train your observers/raters, review their performance, give practice sessions and provide exemplars.
  5. Measure reliability. There are a number of ways of doing this, but the most common way is to calculate what is called “Cronbach’s Alpha” which measures internal consistency reliability (the higher it is, the better). It’s particularly useful if all questions on the assessment measure the same construct. You can easily calculate this for Questionmark assessments using our Test Analysis Report.
  6. Conduct regular item analysis to weed out ambiguous or poor performing questions. Item analysis is an automated way of flagging weak questions for review and improvement. If questions are developed through sound procedures and so well crafted and non-ambiguously worded they are more likely to discriminate well and so contribute to a reliable test. Running regular item analysis is the best way to identify poorly performing questions. If you want to learn more about item analysis, I recently gave a webinar on “Item Analysis for Beginners”, and you can access the recording of this here.

 

I hope this blog post reminds you why reliability matters and gives some ideas on how to improve reliability. There is lots more information on how to improve reliability and write better assessments on the Questionmark website – check out our resources at www.questionmark.com/learningresources.

7 actionable steps for making your assessments more trustable

John Kleeman HeadshotPosted by John Kleeman

Questionmark has recently published a white paper on trustable assessment,  and we blog about this topic frequently. See Reliability and validity are the keys to trust and The key to reliability and validity is authoring for some recent blog posts about the white paper.

But what can you do today if you want to make your assessments more trustable? Obviously you can read the white paper! But here are seven actionable steps that if you’re not doing already you could do today or at least reasonably quickly to improve your assessments.

1. Organize questions in an item bank with topic structure

If you are already using Questionmark software, you are likely doing this already.  But putting questions in an item bank structured by hierarchical topics facilitates an easy management view of all questions and assessments under development. It allows you to use the same question in multiple assessments, easily add questions and retire them and easily search questions, for example to find the ones that need update when laws change or a product is retired.

2. Use questions that apply knowledge in the job context

It is better to ask questions that check how people can apply knowledge in the job context than just to find out whether they have specific knowledge. See my earlier post Test above knowledge: Use scenario questions for some tips on this. If you currently just test on knowledge and not on how to apply that knowledge, make today the day that you start to change!

3. Have your subject matter experts directly involved in authoring

Especially in an area where there is rapid change, you need subject matter experts directly involved in authoring and reviewing questions. Whether you use Questionmark Live or another system, start involving them.

4. Set a pass score fairly

Setting a pass score fairly is critical to being able to trust an assessment’s results. See Is a compliance test better with a higher pass score? and Standard Setting: A Keystone to Legal Defensibility for some starting points on setting good pass scores. And if you don’t think you’re following good practice, start to change.

5. Use topic scoring and feedback

As Austin Fossey explained in his ground-breaking post Is There Value in Reporting Subscores?, you do need to check whether it is sensible to report topic scores. But in most cases, topic scores and topic feedback can be very useful and actionable – they direct people to where there are problems or where improvement is needed.

6. Define a participant code of conduct

If people cheat, it makes assessment results much less trustable. As I explained in my post What is the best way to reduce cheating? , setting up a participant code of conduct (or honesty code) is an easy and effective way of reducing cheating. What can you do today to encourage your test takers to believe your program is fair and be on your side in reducing cheating?

7. Run item analysis and weed out poor items

This is something that all Questionmark users could do today. Run an item analysis report – it takes just a minute or two from our interfaces and look at the questions that are flagged as needing review (usually amber or red). Review them to check appropriateness and potentially retire them from your pool or else improve them.

Questionmark item analysis report

 

Many of you will probably be doing all the above and more, but I hope that for some of you this post could be a spur to action to make your assessments more trustable. Why not start today?

Is There Value in Reporting Subscores?

Austin Fossey-42Posted by Austin Fossey

The decision to report subscores (reported as Topic Scores in Questionmark’s software) can be a difficult one, and test developers often need to respond to demands from stakeholders who want to bleed as much information out of an instrument as they can. High-stakes test development is lengthy and costly, and the instruments themselves consume and collect a lot of data that can be valuable for instruction or business decisions. It makes sense that stakeholders want to get as much mileage as they can out of the instrument.

It can be anticlimactic when all of the development work results in just one score or a simple pass/fail decision. But that is after all what many instruments are designed to do. Many assessment models assume unidimensionality, so a single score or classification representing the participant’s ability is absolutely appropriate. Nevertheless, organizations often find themselves in the position of trying to wring out more information. What are my participants’ strengths and weaknesses? How effective were my instructors? There are many ways in which people will try to repurpose an assessment.

The question of whether or not to report subscores certainly falls under this category. Test blueprints often organize the instrument around content areas (e.g., Topics), and these lend themselves well to calculating subscores for each of the content areas. From a test user perspective, these scores are easy to interpret, and they are considered valuable because they show content areas where participants perform well or poorly, and because it is believed that this information can help inform instruction.

But how useful are these subscores? In their article, A Simple Equation to Predict a Subscore’s Value, Richard Feinberg and Howard Wainer explain that there are two criteria that must be met to justify reporting a subscore:

  • The subscore must be reliable.
  • The subscore must contain information that is sufficiently different from the information that is contained by the assessment’s total score.

If a subscore (or any score) is not reliable, there is no value in reporting it. The subscore will lack precision, and any decisions made on an unreliable score might not be valid. There is also little value if the subscore does not provide any new information. If the subscores are effectively redundant to the total score, then there is no need to report them. The flip side of the problem is that if subscores do not correlate with the total score, then the assessment may not be unidimensional, and then it may not make sense to report the total score. These are the problems that test developers wrestle with when they lie awake at night.

Excerpt from Questionmark’s Test Analysis Report showing low reliability of three topic scores.

As you might have guessed from the title of their article, Feinberg and Wainer have proposed a simple, empirically-based equation for determining whether or not a subscore should be reported. The equation yields a value that Sandip Sinharay and Shelby Haberman called the Value Added Ratio (VAR). If a subscore on an assessment has a VAR value greater than one, then they suggest that this justifies reporting it. All of the VAR values that are less than one, should not be reported. I encourage interested readers to check out Feinberg and Wainer’s article (which is less than two pages, so you can handle it) for the formula and step-by-step instructions for its application.

 

Workshop on Test Development Fundamentals: Q&A with Melissa Fein

Posted by Joan PhaupJoan Phaup 2013 (3)

We will be packing three days of intensive learning and networking into the Questionmark 2014 Users Conference in San Antonio March 4 – 7.

From Bryan Chapman’s keynote on Transforming Open Data into Meaning and Action to case studies, best practice advice, discussions, demos and instruction in the use of Questionmark technologies, there will be plenty to learn!

Even before the conference starts, some delegates will be immersed in pre-conference workshops. This year we’re offering one  full-day workshop and two half-day workshops.

Here’s the line-up:

Today’s conversation is with Melissa Fein, an industrial-organizational psychology consultant and the author of Test Development:  Fundamentals for Certification and Evaluation.

Melissa’s workshop will help participants create effective criterion-reference tests (CRT). It’s designed for people involved in everything from workplace testing and training program evaluation to certifications and academic testing.

What would you say is the most prevalent misconception about CRT?
…that a passing score should be 70 percent. The cutoff for passing might end up to be 70 percent, but that needs to be determined through a standard-setting process. Often people decide on 70 percent because 70 percent is traditional.

What is most important thing to understand about CRT?
It’s crucial to understand how to produce and interpret scores in a way that is fair to all examinees and to those who interpret and use the scores in making decisions, such as hiring people, promoting people, and awarding grades. Scores are imperfect by nature; they will never be perfect. Our goal is to produce quality scores given the limitations that we face.

How does CRT differ in the worlds of workplace testing, training, certification and academic assessment?
The process used to identify testing objectives differs for these different contexts.  However, there are more similarities than differencesin developing CRTs for workplace testing, training, certification and academic assessment.  The principles underlying the construction of quality assessments — such as validity, reliability, and standard setting — don’t differ.

When is CRT the most appropriate choice, as opposed to norm-referenced testing?
Anytime test scores are being compared to a standard, you want to use criterion-referenced testing. With norm referenced tests, you just want to compare one examinee’s scores with another. If you had police officers who have to pass fitness standards — maybe they have to run a mile in a certain amount of time – you would use CRT. But if the officers are running a benefit 5K race, that’s norm-referenced. You just want to find out who comes in first, second and third.

I understand you will be covering testing enigmas during the workshop. What do you have in mind?
Testing enigmas reflect best practices that seem to defy common sense until you look more closely. The biggest enigma occurs in standard setting. When most people think of setting standards for certifications, they like to think of a maximally proficient person. When I ask them to think of a minimally competent person, they think I’m pulling the rug out from under them! But in standard setting, you are trying to determine the difference between passing and failing, so you are looking to identify the minimally competent person: you want to define the line that distinguishes the minimally competent person from someone who is not competent.

What do you hope people will take away from their morning with you?
I hope people will walk away with at least one new idea that they can apply to their testing program. I also hope that they walk away knowing that something they are already doing is a good idea – that the workshop validates something they are doing in their test development work. Sometimes we don’t know why we do certain thing, so it’s good to get some reassurance.


Click here to read a conversation with Rick Ault about Boot Camp. My next post will be a Q&A with item writing workshop instructor Mary Lorenz.

You will save $100 if you register for the conference by January 30th. You can add a workshop to your conference registration or choose your workshop later.

Putting Theory Into Practice — Item Writing Guide, Part 3

Doug Peterson HeadshotPosted By Doug Peterson

In part 1 of this series we looked at the importance of fairness, validity and reliability in assessment items. In part 2 we looked at the different parts of an item and discussed some basic requirements for writing a good stimulus and good distractors.

Now it’s time to put all of this into practice. I’d like to present some poorly written items, understand what’s wrong with them, and look at how they could be improved. I’ll be the first to admit that these examples tend to be a little over-the-top, but I’ve never been known for my subtlety (!), and a little exaggeration helps make the problems I’m pointing out a little more clear. Let’s start with a simple Yes/No question.

part 3 aEven if the stimulus of this item didn’t contain nonsense words, it would still be impossible to answer. Why? Because the stimulus basically asks two questions – should you beedle the diggle OR should you zix the frondle.

The stimulus is confusing because it is not clear and concise. Can you only take one of the two actions, and is the question asking which one? Or is it asking if you should take either of the two actions? An item like this is not fair to the test-taker because it doesn’t allow them to display their knowledge. We can fix this item by splitting it out into two questions.

  • Yes or No: When loading your snarkleblaster, should you beedle the diggle?
  • Yes or No: When loading your snarkleblaster, should you zix the frondle?

(And as long as we’re looking at a Yes/No question, bear in mind that a True/False or Yes/No question has a 50% chance of being answered correctly simply by guessing. It’s better to have at least 4 choices to reduce the probability of guessing the correct answer. For more thoughts on this, read my post Are True/False Questions Useless?)

Let’s take a look at another one:

part 3 bYou don’t have to be an expert on child-rearing to get the answer to this question. Choices a, c and d are ludicrous, and all that’s left is the correct choice.

This item is not fair to the stakeholders in the testing process because it gives away the answer and doesn’t differentiate between test-takers who have the required knowledge and those who don’t. Make sure all of the distracters are reasonable within the context of the question.

So what might be some plausible, yet still incorrect, distractors?

 

How about:

  • Engage your child in vigorous exercise to “wear them out”.
  • Raise your voice and reprimand your child if they get out of bed.
  • Discuss evacuation plans in case there is a fire or tornado during the night.

We’ll continue our review of poorly written items in the next post in this series. Until then, feel free to leave your comments below.

On-demand or on-premise: Which is better for talent management?

John Kleeman HeadshotPosted by John Kleeman

Is it better to run assessment, learning and other talent management software on-demand, in the Cloud? Or is it wiser to run software on-premise, within your organization’s firewall?

I recently wrote about this on the SAP community and received a lot of feedback; I‘d like to share the topic with readers of the Questionmark Blog.

In this post I will share 6 reasons why the Cloud is usually better. And in my next post I’ll give you 4 reasons why it may not be.

1. On-demand gives you access to innovation and use of mobile devices

questionmark-iphone

A critical advantage of on-demand deployment – or software as a service (SaaS) — is that you get the latest version of software. Most providers upgrade all their customers at the same time to the latest version, and you get bug fixes, feature improvements, security fixes and innovation as part of the service. With on-premise, you are in control of when you install updates. But due to the resources required to upgrade, it’s commonplace to only upgrade once every year or two, and therefore be several versions behind an on-demand system. Support for the latest mobile devices is an obvious casualty..

To quote Ed Cohen of SuccessFactors:

“If you look at the rate of innovation that can occur with a SaaS product as against a company maintaining a behind the firewall instance of something, it becomes super important for learning and talent.”

 

2. Deployment is easier with on-demand and allows quick pilots

An on-premise system needs setup of servers and software installation. This takes planning, time and resources, whereas an on-demand system can usually be deployed within hours of ordering it. An on-demand system is also easier to scale up and expand. You can start small with one project and add users or departments as needed.

3. On-demand requires less corporate IT bandwidth

This is often the strongest reason to go on-demand in the learning and assessment space. Corporate IT departments are typically overloaded, and talent management software is not their top priority. This creates a bottleneck, which in turn delays deployment.

On-demand still needs the involvement of corporate IT, but you can usually make headway and provide improved functionality quicker than when deploying on-premise.

4. You don’t need to worry about scalability with on-demand

With an on-premise solution, you have to scale servers to cope with the busiest times (e.g. an end-of-year deadline, exam season or a compliance milestone). But if you use on-demand software, you delegate this to the Cloud provider, who will usually be able to expand to handle your highest load.

5. On-demand is easier to make secure

Both on-premise and on-demand can be very secure, but achieving a high level of security is expensive and involves constant vigilance. Unless you invest heavily in security, Cloud providers will usually provide higher security than the typical on-premise solutions.

This point is well described by SAP’s  Prashanth Padmanabhan in his blog article Why Do We Keep Our Valuables In A Bank Locker?. He states:

“… one of the SAP – SuccessFactors Hybrid customers announced publicly that their own security audit found that SuccessFactors cloud infrastructure was more secure than their own fire wall.”

And the respected UK Universities and Colleges Information Systems Association says in its Cloud briefing paper:

“In practice, data is probably more secure in cloud services than can be provided by in house solutions.”

6. On-demand is usually more reliable

Usually, providing your users have good Internet connectivity, an on-demand system will also be more reliable and have higher up-time.

Stylized picture of bridge

Unless you invest heavily in your on-premise infrastructure, a professionally maintained on-demand server is likely to provide a higher level of 24/7 availability and uptime than a locally maintained system. A professional system is likely to have redundancy in every component and will not fail if a piece of hardware fails, whereas it may not be cost-effective to have such redundancy in an on-premise system. Redundancy makes sure, just like in a bridge over a river, that if one piece fails, the rest of the bridge survives.

In a follow-up post, I’ll explain some reasons why on-premise can be better.