Is There Value in Reporting Subscores?

Austin Fossey-42Posted by Austin Fossey

The decision to report subscores (reported as Topic Scores in Questionmark’s software) can be a difficult one, and test developers often need to respond to demands from stakeholders who want to bleed as much information out of an instrument as they can. High-stakes test development is lengthy and costly, and the instruments themselves consume and collect a lot of data that can be valuable for instruction or business decisions. It makes sense that stakeholders want to get as much mileage as they can out of the instrument.

It can be anticlimactic when all of the development work results in just one score or a simple pass/fail decision. But that is after all what many instruments are designed to do. Many assessment models assume unidimensionality, so a single score or classification representing the participant’s ability is absolutely appropriate. Nevertheless, organizations often find themselves in the position of trying to wring out more information. What are my participants’ strengths and weaknesses? How effective were my instructors? There are many ways in which people will try to repurpose an assessment.

The question of whether or not to report subscores certainly falls under this category. Test blueprints often organize the instrument around content areas (e.g., Topics), and these lend themselves well to calculating subscores for each of the content areas. From a test user perspective, these scores are easy to interpret, and they are considered valuable because they show content areas where participants perform well or poorly, and because it is believed that this information can help inform instruction.

But how useful are these subscores? In their article, A Simple Equation to Predict a Subscore’s Value, Richard Feinberg and Howard Wainer explain that there are two criteria that must be met to justify reporting a subscore:

  • The subscore must be reliable.
  • The subscore must contain information that is sufficiently different from the information that is contained by the assessment’s total score.

If a subscore (or any score) is not reliable, there is no value in reporting it. The subscore will lack precision, and any decisions made on an unreliable score might not be valid. There is also little value if the subscore does not provide any new information. If the subscores are effectively redundant to the total score, then there is no need to report them. The flip side of the problem is that if subscores do not correlate with the total score, then the assessment may not be unidimensional, and then it may not make sense to report the total score. These are the problems that test developers wrestle with when they lie awake at night.

Excerpt from Questionmark’s Test Analysis Report showing low reliability of three topic scores.

As you might have guessed from the title of their article, Feinberg and Wainer have proposed a simple, empirically-based equation for determining whether or not a subscore should be reported. The equation yields a value that Sandip Sinharay and Shelby Haberman called the Value Added Ratio (VAR). If a subscore on an assessment has a VAR value greater than one, then they suggest that this justifies reporting it. All of the VAR values that are less than one, should not be reported. I encourage interested readers to check out Feinberg and Wainer’s article (which is less than two pages, so you can handle it) for the formula and step-by-step instructions for its application.


Conference Close-up: Perfecting the Test Through Question Analysis

Posted by Joan Phaup

Neelov Kar

Neelov Kar

Neelov Kar, Project Management Program Owner for Dell Services (previously Perot Systems) is getting ready to attend the Questionmark Users Conference in Miami this month. He will be delivering a case study about how he and his team have used statistical analysis to improve their test questions. I spent some time talking with Neelov the other day and wanted to share what I learned from him.

Q: Tell me a little about your company.

A:  We are a one-stop shop for IT Services and have people working all over the world, in 183 countries.

Q: What does your work entail?

A: I’m the project management program owner, so I am in charge of all the project management courses we offer. I help identify which courses are appropriate for people to take, based on training need analysis, and I work with our project management steering committee  to work out what courses we need to develop. Then we prioritize the requirements, design and develop the courses, pilot them and finally implement them as a regular course. As a Learning and Development department we also look after leadership courses and go through a similar process for those. I moved into this role about a year ago. Prior to that I was leading the evaluation team, and it was during my time on that team that we began using Questionmark.

Q: How you do you use online assessments?

A: We use Questionmark Perception for Level 2 assessment of our project management and leadership courses. We started with a hosted version of Questionmark Perception and it was I who actually internalized the tool. We offer leadership courses and project management courses internally within the organization across all geographies. Some of the project management courses already had tests, so we converted those to Questionmark.  We started designing the end-of-course assessments for our newly introduced leadership and project management courses once we started using Perception.

Q: What you will be talking about during your conference presentation?

A:  Last year we introduced a new course named P3MM Fundamentals, and because it was a new course we had to pilot the course with some of our senior members. In the pilot we asked the students to take the end-of-course test, and we found that many people had trouble passing the test. So we analyzed the results and refined the questions based on the responses. Analysis of results within Perception — particularly the Assessment Overview Report, Question Statistics Report, Test Analysis Report and Item Analysis Report — helped us in identifying the bad questions. We also saw that there were things we could do to improve the instruction within the course in order to better prepare people for the test. Using the Questionmark reports, we really perfected the test. This course has been going for over a year now, and it’s pretty stable. Now, every time we launch a course we do a pilot, administer the test and then use the Questionmark tools to analyze the questions to find out if we are doing justice to the people who are taking the test.

Q: What are you looking forward to at the conference?

A: I want to find out what Perception version 5 offers and how we can use it for our benefit.  Also, I saw that there are quite a few good papers to be presented, so I’m looking forward to attending those. And I want to get involved in the discussion about the future of SCORM.

Neelov’s is just one of 11 case studies to be presented at the conference, which will also include technical training, best practice presentations, peer discussions and more. Online registration for the conference ends on Tuesday, March 9th, so if you would like to attend, be sure to sign up soon!

Psychometrics 101: How do I know if an assessment is reliable? (Part 3)


Posted by Greg Pope

Following up from my posts last week on reliability I thought I would finish up on this theme by explaining the internal consistency reliability measure: Cronbach’s Alpha.

Cronbach’s Alpha produces the same results as the Kuder-Richardson Formula 20 (KR-20) internal consistency reliability for dichotomously scored questions (right/wrong, 1/0), but  Cronbach’s Alpha  also allows for the analysis of polytomously scored questions (partial credit, 0 to 5). This is why Questionmark products (e.g., Test Analysis Report, RMS) use Cronbach’s Alpha rather than KR-20.

People sometimes ask me about KR-21. This is a quick and dirty reliability estimate formula that almost always produces lower values than KR-20. KR-21 assumes that all questions have equal difficulty (p-value) to make hand calculations easier. This assumption of all questions having the same difficulty is usually not very close to reality where questions on an assessment generally have a range of difficulty. This is why few people in the industry use KR-21 over KR-20 or Cronbach’s Alpha.

My colleagues and I generally recommend that Cronbach’s Alpha values of 0.90 or greater are excellent and acceptable for high-stakes tests, while values of 0.7 to 0.90 are considered to be acceptable/good and appropriate for medium-stakes tests. Generally values below 0.5 are considered unacceptable. With this said, in low stakes testing situations it may not be possible to obtain high internal consistency reliability coefficient values. In this context one might be better off evaluating the performance of an assessment on an item-by-item basis rather than focusing on the overall assessment reliability value.