7 actionable steps for making your assessments more trustable

John Kleeman HeadshotPosted by John Kleeman

Questionmark has recently published a white paper on trustable assessment,  and we blog about this topic frequently. See Reliability and validity are the keys to trust and The key to reliability and validity is authoring for some recent blog posts about the white paper.

But what can you do today if you want to make your assessments more trustable? Obviously you can read the white paper! But here are seven actionable steps that if you’re not doing already you could do today or at least reasonably quickly to improve your assessments.

1. Organize questions in an item bank with topic structure

If you are already using Questionmark software, you are likely doing this already.  But putting questions in an item bank structured by hierarchical topics facilitates an easy management view of all questions and assessments under development. It allows you to use the same question in multiple assessments, easily add questions and retire them and easily search questions, for example to find the ones that need update when laws change or a product is retired.

2. Use questions that apply knowledge in the job context

It is better to ask questions that check how people can apply knowledge in the job context than just to find out whether they have specific knowledge. See my earlier post Test above knowledge: Use scenario questions for some tips on this. If you currently just test on knowledge and not on how to apply that knowledge, make today the day that you start to change!

3. Have your subject matter experts directly involved in authoring

Especially in an area where there is rapid change, you need subject matter experts directly involved in authoring and reviewing questions. Whether you use Questionmark Live or another system, start involving them.

4. Set a pass score fairly

Setting a pass score fairly is critical to being able to trust an assessment’s results. See Is a compliance test better with a higher pass score? and Standard Setting: A Keystone to Legal Defensibility for some starting points on setting good pass scores. And if you don’t think you’re following good practice, start to change.

5. Use topic scoring and feedback

As Austin Fossey explained in his ground-breaking post Is There Value in Reporting Subscores?, you do need to check whether it is sensible to report topic scores. But in most cases, topic scores and topic feedback can be very useful and actionable – they direct people to where there are problems or where improvement is needed.

6. Define a participant code of conduct

If people cheat, it makes assessment results much less trustable. As I explained in my post What is the best way to reduce cheating? , setting up a participant code of conduct (or honesty code) is an easy and effective way of reducing cheating. What can you do today to encourage your test takers to believe your program is fair and be on your side in reducing cheating?

7. Run item analysis and weed out poor items

This is something that all Questionmark users could do today. Run an item analysis report – it takes just a minute or two from our interfaces and look at the questions that are flagged as needing review (usually amber or red). Review them to check appropriateness and potentially retire them from your pool or else improve them.

Questionmark item analysis report


Many of you will probably be doing all the above and more, but I hope that for some of you this post could be a spur to action to make your assessments more trustable. Why not start today?

Is There Value in Reporting Subscores?

Austin Fossey-42Posted by Austin Fossey

The decision to report subscores (reported as Topic Scores in Questionmark’s software) can be a difficult one, and test developers often need to respond to demands from stakeholders who want to bleed as much information out of an instrument as they can. High-stakes test development is lengthy and costly, and the instruments themselves consume and collect a lot of data that can be valuable for instruction or business decisions. It makes sense that stakeholders want to get as much mileage as they can out of the instrument.

It can be anticlimactic when all of the development work results in just one score or a simple pass/fail decision. But that is after all what many instruments are designed to do. Many assessment models assume unidimensionality, so a single score or classification representing the participant’s ability is absolutely appropriate. Nevertheless, organizations often find themselves in the position of trying to wring out more information. What are my participants’ strengths and weaknesses? How effective were my instructors? There are many ways in which people will try to repurpose an assessment.

The question of whether or not to report subscores certainly falls under this category. Test blueprints often organize the instrument around content areas (e.g., Topics), and these lend themselves well to calculating subscores for each of the content areas. From a test user perspective, these scores are easy to interpret, and they are considered valuable because they show content areas where participants perform well or poorly, and because it is believed that this information can help inform instruction.

But how useful are these subscores? In their article, A Simple Equation to Predict a Subscore’s Value, Richard Feinberg and Howard Wainer explain that there are two criteria that must be met to justify reporting a subscore:

  • The subscore must be reliable.
  • The subscore must contain information that is sufficiently different from the information that is contained by the assessment’s total score.

If a subscore (or any score) is not reliable, there is no value in reporting it. The subscore will lack precision, and any decisions made on an unreliable score might not be valid. There is also little value if the subscore does not provide any new information. If the subscores are effectively redundant to the total score, then there is no need to report them. The flip side of the problem is that if subscores do not correlate with the total score, then the assessment may not be unidimensional, and then it may not make sense to report the total score. These are the problems that test developers wrestle with when they lie awake at night.

Excerpt from Questionmark’s Test Analysis Report showing low reliability of three topic scores.

As you might have guessed from the title of their article, Feinberg and Wainer have proposed a simple, empirically-based equation for determining whether or not a subscore should be reported. The equation yields a value that Sandip Sinharay and Shelby Haberman called the Value Added Ratio (VAR). If a subscore on an assessment has a VAR value greater than one, then they suggest that this justifies reporting it. All of the VAR values that are less than one, should not be reported. I encourage interested readers to check out Feinberg and Wainer’s article (which is less than two pages, so you can handle it) for the formula and step-by-step instructions for its application.


Workshop on Test Development Fundamentals: Q&A with Melissa Fein

Posted by Joan PhaupJoan Phaup 2013 (3)

We will be packing three days of intensive learning and networking into the Questionmark 2014 Users Conference in San Antonio March 4 – 7.

From Bryan Chapman’s keynote on Transforming Open Data into Meaning and Action to case studies, best practice advice, discussions, demos and instruction in the use of Questionmark technologies, there will be plenty to learn!

Even before the conference starts, some delegates will be immersed in pre-conference workshops. This year we’re offering one  full-day workshop and two half-day workshops.

Here’s the line-up:

Today’s conversation is with Melissa Fein, an industrial-organizational psychology consultant and the author of Test Development:  Fundamentals for Certification and Evaluation.

Melissa’s workshop will help participants create effective criterion-reference tests (CRT). It’s designed for people involved in everything from workplace testing and training program evaluation to certifications and academic testing.

What would you say is the most prevalent misconception about CRT?
…that a passing score should be 70 percent. The cutoff for passing might end up to be 70 percent, but that needs to be determined through a standard-setting process. Often people decide on 70 percent because 70 percent is traditional.

What is most important thing to understand about CRT?
It’s crucial to understand how to produce and interpret scores in a way that is fair to all examinees and to those who interpret and use the scores in making decisions, such as hiring people, promoting people, and awarding grades. Scores are imperfect by nature; they will never be perfect. Our goal is to produce quality scores given the limitations that we face.

How does CRT differ in the worlds of workplace testing, training, certification and academic assessment?
The process used to identify testing objectives differs for these different contexts.  However, there are more similarities than differencesin developing CRTs for workplace testing, training, certification and academic assessment.  The principles underlying the construction of quality assessments — such as validity, reliability, and standard setting — don’t differ.

When is CRT the most appropriate choice, as opposed to norm-referenced testing?
Anytime test scores are being compared to a standard, you want to use criterion-referenced testing. With norm referenced tests, you just want to compare one examinee’s scores with another. If you had police officers who have to pass fitness standards — maybe they have to run a mile in a certain amount of time – you would use CRT. But if the officers are running a benefit 5K race, that’s norm-referenced. You just want to find out who comes in first, second and third.

I understand you will be covering testing enigmas during the workshop. What do you have in mind?
Testing enigmas reflect best practices that seem to defy common sense until you look more closely. The biggest enigma occurs in standard setting. When most people think of setting standards for certifications, they like to think of a maximally proficient person. When I ask them to think of a minimally competent person, they think I’m pulling the rug out from under them! But in standard setting, you are trying to determine the difference between passing and failing, so you are looking to identify the minimally competent person: you want to define the line that distinguishes the minimally competent person from someone who is not competent.

What do you hope people will take away from their morning with you?
I hope people will walk away with at least one new idea that they can apply to their testing program. I also hope that they walk away knowing that something they are already doing is a good idea – that the workshop validates something they are doing in their test development work. Sometimes we don’t know why we do certain thing, so it’s good to get some reassurance.

Click here to read a conversation with Rick Ault about Boot Camp. My next post will be a Q&A with item writing workshop instructor Mary Lorenz.

You will save $100 if you register for the conference by January 30th. You can add a workshop to your conference registration or choose your workshop later.

Putting Theory Into Practice — Item Writing Guide, Part 3

Doug Peterson HeadshotPosted By Doug Peterson

In part 1 of this series we looked at the importance of fairness, validity and reliability in assessment items. In part 2 we looked at the different parts of an item and discussed some basic requirements for writing a good stimulus and good distractors.

Now it’s time to put all of this into practice. I’d like to present some poorly written items, understand what’s wrong with them, and look at how they could be improved. I’ll be the first to admit that these examples tend to be a little over-the-top, but I’ve never been known for my subtlety (!), and a little exaggeration helps make the problems I’m pointing out a little more clear. Let’s start with a simple Yes/No question.

part 3 aEven if the stimulus of this item didn’t contain nonsense words, it would still be impossible to answer. Why? Because the stimulus basically asks two questions – should you beedle the diggle OR should you zix the frondle.

The stimulus is confusing because it is not clear and concise. Can you only take one of the two actions, and is the question asking which one? Or is it asking if you should take either of the two actions? An item like this is not fair to the test-taker because it doesn’t allow them to display their knowledge. We can fix this item by splitting it out into two questions.

  • Yes or No: When loading your snarkleblaster, should you beedle the diggle?
  • Yes or No: When loading your snarkleblaster, should you zix the frondle?

(And as long as we’re looking at a Yes/No question, bear in mind that a True/False or Yes/No question has a 50% chance of being answered correctly simply by guessing. It’s better to have at least 4 choices to reduce the probability of guessing the correct answer. For more thoughts on this, read my post Are True/False Questions Useless?)

Let’s take a look at another one:

part 3 bYou don’t have to be an expert on child-rearing to get the answer to this question. Choices a, c and d are ludicrous, and all that’s left is the correct choice.

This item is not fair to the stakeholders in the testing process because it gives away the answer and doesn’t differentiate between test-takers who have the required knowledge and those who don’t. Make sure all of the distracters are reasonable within the context of the question.

So what might be some plausible, yet still incorrect, distractors?


How about:

  • Engage your child in vigorous exercise to “wear them out”.
  • Raise your voice and reprimand your child if they get out of bed.
  • Discuss evacuation plans in case there is a fire or tornado during the night.

We’ll continue our review of poorly written items in the next post in this series. Until then, feel free to leave your comments below.

On-demand or on-premise: Which is better for talent management?

John Kleeman HeadshotPosted by John Kleeman

Is it better to run assessment, learning and other talent management software on-demand, in the Cloud? Or is it wiser to run software on-premise, within your organization’s firewall?

I recently wrote about this on the SAP community and received a lot of feedback; I‘d like to share the topic with readers of the Questionmark Blog.

In this post I will share 6 reasons why the Cloud is usually better. And in my next post I’ll give you 4 reasons why it may not be.

1. On-demand gives you access to innovation and use of mobile devices


A critical advantage of on-demand deployment – or software as a service (SaaS) — is that you get the latest version of software. Most providers upgrade all their customers at the same time to the latest version, and you get bug fixes, feature improvements, security fixes and innovation as part of the service. With on-premise, you are in control of when you install updates. But due to the resources required to upgrade, it’s commonplace to only upgrade once every year or two, and therefore be several versions behind an on-demand system. Support for the latest mobile devices is an obvious casualty..

To quote Ed Cohen of SuccessFactors:

“If you look at the rate of innovation that can occur with a SaaS product as against a company maintaining a behind the firewall instance of something, it becomes super important for learning and talent.”


2. Deployment is easier with on-demand and allows quick pilots

An on-premise system needs setup of servers and software installation. This takes planning, time and resources, whereas an on-demand system can usually be deployed within hours of ordering it. An on-demand system is also easier to scale up and expand. You can start small with one project and add users or departments as needed.

3. On-demand requires less corporate IT bandwidth

This is often the strongest reason to go on-demand in the learning and assessment space. Corporate IT departments are typically overloaded, and talent management software is not their top priority. This creates a bottleneck, which in turn delays deployment.

On-demand still needs the involvement of corporate IT, but you can usually make headway and provide improved functionality quicker than when deploying on-premise.

4. You don’t need to worry about scalability with on-demand

With an on-premise solution, you have to scale servers to cope with the busiest times (e.g. an end-of-year deadline, exam season or a compliance milestone). But if you use on-demand software, you delegate this to the Cloud provider, who will usually be able to expand to handle your highest load.

5. On-demand is easier to make secure

Both on-premise and on-demand can be very secure, but achieving a high level of security is expensive and involves constant vigilance. Unless you invest heavily in security, Cloud providers will usually provide higher security than the typical on-premise solutions.

This point is well described by SAP’s  Prashanth Padmanabhan in his blog article Why Do We Keep Our Valuables In A Bank Locker?. He states:

“… one of the SAP – SuccessFactors Hybrid customers announced publicly that their own security audit found that SuccessFactors cloud infrastructure was more secure than their own fire wall.”

And the respected UK Universities and Colleges Information Systems Association says in its Cloud briefing paper:

“In practice, data is probably more secure in cloud services than can be provided by in house solutions.”

6. On-demand is usually more reliable

Usually, providing your users have good Internet connectivity, an on-demand system will also be more reliable and have higher up-time.

Stylized picture of bridge

Unless you invest heavily in your on-premise infrastructure, a professionally maintained on-demand server is likely to provide a higher level of 24/7 availability and uptime than a locally maintained system. A professional system is likely to have redundancy in every component and will not fail if a piece of hardware fails, whereas it may not be cost-effective to have such redundancy in an on-premise system. Redundancy makes sure, just like in a bridge over a river, that if one piece fails, the rest of the bridge survives.

In a follow-up post, I’ll explain some reasons why on-premise can be better.

Planning the Test – Test Design & Delivery Part 1

Posted By Doug Peterson

A lot more goes into planning a test than just writing a few questions. Reliability and validity should be established right from the start. An assessment’s results are considered reliable if they are dependable, repeatable, and consistent. The assessment is deemed to be valid if it measures the specific knowledge and skills that it is meant to measure. Take a look at the following graphic from the Questionmark white paper, Assessments through the Learning Process.

An assessment can be consistent, meaning that a participant will receive the same basic score over multiple deliveries of the assessment, and that participants with similar knowledge levels will receive similar scores, yet not be valid if it doesn’t measure what it’s supposed to measure (Figure 1). This assessment contains well-written questions, but the questions don’t actually measure the desired knowledge, skill or attitude. An example would be a geometry exam that contains questions about European history. They could be absolutely excellent questions, very well-written with a perfect level of difficulty … but they don’t measure the participant’s knowledge of geometry.

If an assessment is not reliable, it can’t be valid (Figure 2). If five participants with similar levels of knowledge receive five very different scores, the questions are poorly written and probably confusing or misleading. In this situation, there’s no way the assessment can be considered to be measuring what it’s supposed to be measuring.

Figure 3 represents the goal of assessment writing – an assessment made up of well-written questions that deliver consistent scores AND accurately measure the knowledge they are meant to measure. In this situation, our geometry exam would contain well-written questions about geometry, and a participant who passes with flying colors would, indeed, possess a high level of knowledge about geometry.

For an assessment to be valid, the assessment designer needs to know not just the specific purpose of the assessment (e.g., geometry knowledge), they must understand the target population of participants as well.  Understanding the target population will help the designer ensure that the assessment is assessing what is supposed to be assessed and not extraneous information. Some things to take into account:

  • Job qualifications
  • Local laws/regulations
  • Company policies
  • Language localization
  • Reading level
  • Geographic dispersion
  •  Comfort with technology

For example, let’s say you’re developing an assessment that will be used in several different countries. You don’t want to include American slang in a test being delivered in France; at that point you’re not measuring subject matter knowledge, you’re measuring knowledge of American slang. Another example would be if you were developing an assessment to be taken by employees whose positions only require minimal reading ability. Using “fancy words” and complicated sentence structure would not be appropriate; the test should be written at the level of the participants to ensure that their knowledge of the subject matter is being tested, and not their reading comprehension skills.

In my next installment, we’ll take a look at identifying content areas to be tested.