Is There Value in Reporting Subscores?

Austin Fossey-42Posted by Austin Fossey

The decision to report subscores (reported as Topic Scores in Questionmark’s software) can be a difficult one, and test developers often need to respond to demands from stakeholders who want to bleed as much information out of an instrument as they can. High-stakes test development is lengthy and costly, and the instruments themselves consume and collect a lot of data that can be valuable for instruction or business decisions. It makes sense that stakeholders want to get as much mileage as they can out of the instrument.

It can be anticlimactic when all of the development work results in just one score or a simple pass/fail decision. But that is after all what many instruments are designed to do. Many assessment models assume unidimensionality, so a single score or classification representing the participant’s ability is absolutely appropriate. Nevertheless, organizations often find themselves in the position of trying to wring out more information. What are my participants’ strengths and weaknesses? How effective were my instructors? There are many ways in which people will try to repurpose an assessment.

The question of whether or not to report subscores certainly falls under this category. Test blueprints often organize the instrument around content areas (e.g., Topics), and these lend themselves well to calculating subscores for each of the content areas. From a test user perspective, these scores are easy to interpret, and they are considered valuable because they show content areas where participants perform well or poorly, and because it is believed that this information can help inform instruction.

But how useful are these subscores? In their article, A Simple Equation to Predict a Subscore’s Value, Richard Feinberg and Howard Wainer explain that there are two criteria that must be met to justify reporting a subscore:

  • The subscore must be reliable.
  • The subscore must contain information that is sufficiently different from the information that is contained by the assessment’s total score.

If a subscore (or any score) is not reliable, there is no value in reporting it. The subscore will lack precision, and any decisions made on an unreliable score might not be valid. There is also little value if the subscore does not provide any new information. If the subscores are effectively redundant to the total score, then there is no need to report them. The flip side of the problem is that if subscores do not correlate with the total score, then the assessment may not be unidimensional, and then it may not make sense to report the total score. These are the problems that test developers wrestle with when they lie awake at night.

Excerpt from Questionmark’s Test Analysis Report showing low reliability of three topic scores.

As you might have guessed from the title of their article, Feinberg and Wainer have proposed a simple, empirically-based equation for determining whether or not a subscore should be reported. The equation yields a value that Sandip Sinharay and Shelby Haberman called the Value Added Ratio (VAR). If a subscore on an assessment has a VAR value greater than one, then they suggest that this justifies reporting it. All of the VAR values that are less than one, should not be reported. I encourage interested readers to check out Feinberg and Wainer’s article (which is less than two pages, so you can handle it) for the formula and step-by-step instructions for its application.

 

Intro to high-stakes assessment

Lance bio pic  Posted by

Hello, and welcome to my first blog post for Questionmark. I joined Questionmark in May of 2014 but have just recently become Product Owner for Authoring. That means I oversee the tools we build to help people like you write their questions and assessments.

My professional background in assessment is mostly in the realm of high-stakes testing. That means I’ve worked with organizations that license, certify, or otherwise credential individuals. These include medical/nursing boards, driving standards organizations, software/hardware companies, financial services, and all sorts of sectors where determining competency is important.

With that in mind, I thought I’d kick off my blogging career at Questionmark with a series of posts on the topic of high-stakes assessment.

Now that I’ve riveted your attention with that awesome and no-at-all tedious opening you’re naturally chomping at the bit to learn more, right?

Right?

Read on!

High-stakes assessment defined

I think of a high-stakes assessment as having the following traits:

It strongly influences or determines an individual’s ability to practice a profession

Organizations that administer high-stakes assessments operate along a continuum of influence. For example, certifications from SAP or other IT organizations are typically viewed as desirable by employers and may be used as a differentiator when hiring or setting compensation, but are not necessarily required for employment. At the other end of the continuum we have organizations that actually determine a person’s ability to practice a profession. An example is that you must be licensed by a state bar association to practice law in a US state. In between these extremes lie many shades of influence. The key concept here is that the influence is real…from affecting hiring/promotion decisions to flat-out determining if a person can be hired or continue to work in their chosen profession.

It awards credentials that belong to the individual

This is all about scope and ownership. These credentialing organizations almost always award a license/certification to the individual. If you get certified by SAP, that certification is yours even if your employer paid for it.

Speaking about scope, that certificate represents skills that are employer-neutral, and in the case of most IT certifications, the skills are generally unbounded by region as well. A certification acquired in the United States means the same thing in Canada, in Russia, in Mongolia, in Indonesia in…you get the point.

Stan Lee

Okay, so these organizations influence who can work in professions and who can’t. Big whoop, right? Right! It really is a big whoop.

As Stan Lee has told us repeatedly, “Excelsior!”

Hmmm. That’s not the quote I wanted.

I meant, as Stan Lee has told us repeatedly, “With great power comes great responsibility!”*

And these orgs do have great power. They also, in many cases, have powerful members. For example, medical boards in the United States certify elite medical professionals. In all cases, these orgs are simultaneously making determinations about the public good and people’s livelihoods. As a result, they tend to take the process very seriously.

Ok… But what does it all mean?

Glad you asked. Stay tuned for my next post to find out.

Till then, Excelsior!

* So, it turns out that some guy named “Voltaire” said this first. But really, who’s had a bigger impact on the world? Voltaire – if that’s even his real
name – or Stan Lee? :)

The key to reliability and validity is authoring

John Kleeman HeadshotPosted by John Kleeman

In my earlier post I explained how reliability and validity are the keys to trustable assessments results. A reliable assessment means that it is consistent and a valid assessment means that it measures what you need it to measure.

The key to validity and reliability starts with the authoring process. If you do not have a repeatable, defensible process for authoring questions and assessments, then however good the other parts of your process are, you will not have valid and reliable assessments.

The critical value that Questionmark brings is its structured authoring processes, which enable effective planning, authoring, Questionmark Liveand reviewing of questions and assessments and makes them more likely to be valid.

Questionmark’s white paper “Assessment Results You Can Trust” suggests 18 key authoring measures for making trustable assessments – here are three of the most important.

Organize items in an item bank with topic structure

There are huge benefits to using an assessment management system with an item bank that structures items by hierarchical topics as this facilitates:

  • An easy management view of all items and assessments under development
  • Mapping of topics to relevant organizational areas of importance
  • Clear references from items to topics
  • Use of the same item in multiple assessments
  • Simple addition of new items within a topic
  • Easy retiring of items when they are no longer needed
  • Version history maintained for legal defensibility
  • Search capabilities to identify questions that need updating when laws change or a product is retired

Some stand alone e-Learning creation tools and some LMSs do not provide you with an item bank and require you to insert questions individually within an assessment. If you only have a handful of assessments or you rarely need to update assessments, such systems can work, but for anyone with more than a few assessments, you need an item bank to be able to make effective assessments.

Authoring tool subject matter experts can use directly

One of the critical factors in making successful items is to get effective input from subject matter experts (SMEs), as they are usually more knowledgeable and better able to construct and review questions than learning technology specialists or general trainers.

If you can use a system like Questionmark Live to harvest or “crowdsource” items from SMEs and have learning or assessment specialists review them, your items will be of better quality.

Easy collaboration for item reviewers to help make items more valid

Items will be more valid if they have been properly reviewed. They will also be more defensible if the past changes are auditable. A track-changes capability, like that shown in the example screenshot below, is invaluable to aid the review process. It allows authors to see what changes are being proposed and to check they make sense.

Screenshot of track changes functionality in Questionmark Live

These three capabilities – having an item bank, having an authoring tools SMEs can access directly and allowing easy collaboration with “track changes” are critical for obtaining reliable and valid, and therefore trustable assessments.

For more information on how to make trustable assessments, see our white paper “Assessment Results You can Trust” 

Item Development – Summary and Conclusions

Austin Fossey-42Posted by Austin Fossey

This post concludes my series on item development in large-scale assessment. I’ve discussed some key processes in developing items, including drafting items, reviewing items, editing items, and conducting an item analysis. The goal of this process is to fine-tune a set of items so that test developers have an item pool from which they can build forms for scored assessment while being confident about the quality, reliability, and validity of the items. While the series covered a variety of topics, there are a couple of key themes that were relevant to almost every step.

First, documentation is critical, and even though it seems like extra work, it does pay off. Documenting your item development process helps keep things organized and helps you reproduce processes should you need to conduct development again. Documentation is also important for organization and accountability. As noted in the posts about content review and bias review, checklists can help ensure that committee members consider a minimal set of criteria for every item, but they also provide you with documentation of each committee member’s ratings should the item ever be challenged. All of this documentation can be thought of as validity evidence—it helps support your claims about the results and refute rebuttals about possible flaws in the assessment’s content.

The other key theme is the importance of recruiting qualified and representative subject matter experts (SMEs). SMEs should be qualified to participate in their assigned task, but diversity is also an important consideration. You may want to select item writers with a variety of experience levels, or content experts who have different backgrounds. Your bias review committee should be made up of experts who can help identify both content and response bias across the demographic areas that are pertinent to your population. Where possible, it is best to keep your SME groups independent so that you do not have the same people responsible for different parts of the development cycle. As always, be sure to document the relevant demographics and qualifications of your SMEs, even if you need to keep their identities anonymous.

This series is an introduction for organizing an item development cycle, but I encourage readers to refer to the resources mentioned in the articles for
more information. This series also served as the basis for a session at the 2015 Questionmark Users Conference, which Questionmark customers can watch in the Premium section of the Learning Café.

You can link back to all of the posts in this series by clicking on the links below, and if you have any questions, please comment below!

Item Development – Managing the Process for Large-Scale Assessments

Item Development – Training Item Writers

Item Development – Five Tips for Organizing Your Drafting Process

Item Development – Benefits of editing items before the review process

Item Development – Organizing a content review committee (Part 1)

Item Development – Organizing a content review committee (Part 2)

Item Development – Organizing a bias review committee (Part 1)

Item Development – Organizing a bias review committee (Part 2)

Item Development – Conducting the final editorial review

Item Development – Planning your field test study

Item Development – Psychometric review

How much do you know about defensible assessments?

Julie Delazyn HeadshotPosted by Julie Delazyn

This quiz is a re-post from a very popular blog entry published by John Kleeman.

Readers told us that it was instructive and engaging to take quizzes on using assessments, and we like to listen to you! So here is the second quiz in a pre-published series of quizzes on assessment topics. This one was authored in conjunction with Neil Bachelor of Pure Questions. You can see the first quiz on Cut Scores here.

As always, we regard resources like this quiz as a way of contributing to the ongoing process of learning about assessment. In that spirit, please enjoy the quiz below and feel free to comment if you have any suggestions to improve the questions or the feedback.

Is a longer test likely to be more defensible than a shorter one? Take the quiz and find out. Be sure to look for your feedback after you have completed it!

NOTE: Some people commented on the first quiz that they were surprised to lose marks for getting questions wrong. This quiz uses True/False questions and it is easy to guess at answers, so we’ve set it to subtract a point for each question you get wrong, to illustrate that this is possible. Negative scoring like this encourages you to answer “Don’t Know” rather than guess; this is particularly helpful in diagnostic tests where you want participants to be as honest as possible about what they do or don’t think they know.

Reliability and validity are the keys to trust

image.png

Not reliable

John Kleeman HeadshotPosted by John Kleeman

How can you trust assessment results? The two keys are reliability and validity.

Reliability explained

An assessment is reliable if it measures the same thing consistently and reproducibly. If you were to deliver an assessment with high reliability to the same participant on two occasions, you would be very likely to reach the same conclusions about the participant’s knowledge or skills. A test with poor reliability might result in very different scores across the two instances.An unreliable assessment does not measure anything consistently and cannot be used for any trustable measure of competency. It is useful visually to think of a dartboard; in the diagram to the right, darts have landed all over the board—they are not reliably in any one place.In order for an assessment to be reliable, there needs to be a predictable authoring process, effective beta testing of items, trustworthy delivery to all the devices used to give the assessment, good-quality post-assessment reporting and effective analytics.

Validity explained

image.png

Reliable but not valid

Being reliable is not good enough on its own. The darts in the dartboard in the figure to the right are in the same place, but not in the right place. A test can be reliable but not measure what it is meant to measure. For example, you could have a reliable assessment that tested for skill in word processing, but this would not be valid if used to test machine operators, as writing is not one of the key tasks in their jobs.

An assessment is valid if it measures what it is supposed to measure. So if you are measuring competence in a job role, a valid assessment must align with the knowledge, skills and abilities required to perform the tasks expected of a job role. In order to show that an assessment is valid, there must be some formal analysis of the tasks in a job role and the assessment must be structured to match those tasks. A common method of performing such analysis is a job task analysis, which surveys subject matter experts or people in the job role to identify the importance of different tasks.

Assessments must be reliable AND valid

Trustable assessments must be reliable AND valid.

image.png

Reliable and valid

The darts in the figure to the right are in the same place and at the right place.

When you are constructing an assessment for competence, you are looking for it to consistently measure the competence required for the job.

 Comparison with blood tests

It is helpful to consider what happens if you go to the doctor with an illness. The doctor goes through a process of discovery, analysis, diagnosis and prescription. As part of the discovery process, sometimes the doctor will order a blood test to identify if a particular condition is present, which can diagnose the illness or rule out a diagnosis.

It takes time and resources to do a blood test, but it can be an invaluable piece of information. A great deal of effort goes into making sure that blood tests are both reliable (consistent) and valid (measure what they are supposed to measure). For example, just like exam results, blood samples are labelled carefully, as shown in the picture, to ensure that patient identification is retained.

image_thumb.pngA blood test that was not reliable would be dangerous—a doctor might think that a disease is not present when it is. Furthermore, a reliable blood test used for the wrong purpose is not useful—for example, there is no point in having a test for blood glucose level if the doctor is trying to see if a heart attack is imminent.

The blood test results are a single piece of information that helps the doctor make the diagnosis in conjunction with other data from the doctor’s discovery process.

In exactly the same way, a test of competence is an important piece of information to determine if someone is competent in their job role.

Using the blood test metaphor, it is easy to understand the personnel and organizational risks that can result from making decisions based on untrustworthy results. If an organization assesses someone’s knowledge, skill or competence for health and safety or regulatory compliance purposes, you need to ensure the assessment is designed correctly and runs consistently, which means that they must be reliable and valid.

For assessments to be reliable and valid, it is necessary that you follow structured processes at each step from planning through authoring to delivery and reporting. These processes are explained in our new white paper “Assessment Results You can Trust” and I’ll be sharing some of the content in future articles in this blog.

For fuller information, you can download the white paper, click here

Next Page »
SAP Microsoft Oracle HR-XML AAIC