Is There Value in Reporting Subscores?

Austin Fossey-42Posted by Austin Fossey

The decision to report subscores (reported as Topic Scores in Questionmark’s software) can be a difficult one, and test developers often need to respond to demands from stakeholders who want to bleed as much information out of an instrument as they can. High-stakes test development is lengthy and costly, and the instruments themselves consume and collect a lot of data that can be valuable for instruction or business decisions. It makes sense that stakeholders want to get as much mileage as they can out of the instrument.

It can be anticlimactic when all of the development work results in just one score or a simple pass/fail decision. But that is after all what many instruments are designed to do. Many assessment models assume unidimensionality, so a single score or classification representing the participant’s ability is absolutely appropriate. Nevertheless, organizations often find themselves in the position of trying to wring out more information. What are my participants’ strengths and weaknesses? How effective were my instructors? There are many ways in which people will try to repurpose an assessment.

The question of whether or not to report subscores certainly falls under this category. Test blueprints often organize the instrument around content areas (e.g., Topics), and these lend themselves well to calculating subscores for each of the content areas. From a test user perspective, these scores are easy to interpret, and they are considered valuable because they show content areas where participants perform well or poorly, and because it is believed that this information can help inform instruction.

But how useful are these subscores? In their article, A Simple Equation to Predict a Subscore’s Value, Richard Feinberg and Howard Wainer explain that there are two criteria that must be met to justify reporting a subscore:

  • The subscore must be reliable.
  • The subscore must contain information that is sufficiently different from the information that is contained by the assessment’s total score.

If a subscore (or any score) is not reliable, there is no value in reporting it. The subscore will lack precision, and any decisions made on an unreliable score might not be valid. There is also little value if the subscore does not provide any new information. If the subscores are effectively redundant to the total score, then there is no need to report them. The flip side of the problem is that if subscores do not correlate with the total score, then the assessment may not be unidimensional, and then it may not make sense to report the total score. These are the problems that test developers wrestle with when they lie awake at night.

Excerpt from Questionmark’s Test Analysis Report showing low reliability of three topic scores.

As you might have guessed from the title of their article, Feinberg and Wainer have proposed a simple, empirically-based equation for determining whether or not a subscore should be reported. The equation yields a value that Sandip Sinharay and Shelby Haberman called the Value Added Ratio (VAR). If a subscore on an assessment has a VAR value greater than one, then they suggest that this justifies reporting it. All of the VAR values that are less than one, should not be reported. I encourage interested readers to check out Feinberg and Wainer’s article (which is less than two pages, so you can handle it) for the formula and step-by-step instructions for its application.

 

4 Responses to “Is There Value in Reporting Subscores?”

  1. Onno Tomson says:

    Austin,
    Thanks for sharing this….. very interesting!
    A topic could indeed stand for a content area, but content areas can also be linked to Metatag values assigned to questions. This approach sometimes results in a interesting new view on your itembank and/or a much more efficient Itembank.
    A topic is also metadata linked to a question.
    So I would like to expand “Reporting on topic level” to “Report on metadata level”.
    Is this something QM will facilitate in the near future?

  2. Gail Watson says:

    Good morning, Austin! All of our assessments (high stakes, required for promotion) are based on 2-4 sub topics, and in our assessment feedback I have included scores on each sub-topic. This is to assist someone who fails. They can see their weakest topics and focus their study for the next attempt.

    I do a sub-topic report for SMEs quarterly. Since the numbers are in the feedback I can easily extract them and put them into a pivot table for displaying averages. One thing I have emphasized is something I do not see in your blog. I display the assessment average next to each of the sub-topic averages, and I thought the red flag was if any sub-topic average is significantly lower than the assessment average. Is this approach flawed?

    We take a detailed look at assessment averages in another report I do, by the way. Thus if the assessment average itself is problematic that is a separate issue.

  3. Austin Fossey says:

    Hi Onno,

    Thank you for kind response! Your point about the value of adding metadata to item banks is well taken, though one might argue that there is no difference between topic data and metatags—both are metadata about the item. From my point of view, the question then is not whether it is appropriate to report on metadata, but whether it is appropriate to score on metadata.

    In their chapter in the Handbook for Test Development, Raymond and Neustel addressed this question in their discussion of different types of test specifications (AKA blueprints), noting that some specifications may have multiple dimensions, such as classifying items by process (e.g., cognitive levels) and by content (e.g., topics). While scoring could certainly be done for any dimension, the test developer must ultimately decide the utility of doing so. Raymond and Neustel observed that test users often prefer that scores be reported along content dimensions, as this is easier to interpret. This is the blueprint structure implied in Questionmark’s software.

    Of course, metadata can take any form that the test developer wants. There are examples in K12 language assessment where items are cross classified along two content dimensions, and the test designers have chosen to score and report in both content dimensions. A similar scoring model was implemented by one of Questionmark’s clients through a custom reporting system they developed.

    There are other outlier use cases too. Some higher education institutions use metatag data on assessment content and use these data for institutional research and accreditation reporting. Though this can be thought of as a form of scoring, it does not relate to the assessment construct or inferences about the participant, so the same considerations may not apply.

    In terms of simply exporting or reporting metadata related to content (for scoring purposes or otherwise), this is not likely something that will be available in Questionmark’s OnDemand software in the short term, though we are looking into the possibility of exposing these data through an API in the future, and we have already added some backend infrastructure to make this possible. If we develop this capability, we will announce the project in our Quarterly Online Product Briefings.

    Sincerely,

    Austin Fossey
    Reporting and Analytics Manager
    Questionmark

  4. Austin Fossey says:

    Hi Gail,

    The logic of your approach is sound, and I can think of several assessment programs that have implemented similar reporting rules. The major obstacle is how to determine if a sample’s average subscore is significantly less than an average total score. While some people may think a t-test would work, it is not statistically sound because the subscores are a subset of the total scores; i.e., the total scores are partly made up of the subscores (I made the mistake of proposing the same comparison at a previous job and was swiftly and sternly corrected by an angry mob of psychometricians).

    The way I have seen other organizations approach this question is to determine whether or not a mean subscore is higher or lower than expected, given the sample’s performance on the rest of the assessment. This can be done with a t-test, and you do not have the same data dependencies. You are essentially comparing one set of items on the assessment to the rest of the items on the assessment with no overlap. The examples I have seen were done with IRT, which is preferable since it would allow the researcher to account for differences in difficulty between content areas, so keep this in mind if using classical test theory for this comparison.

    The example above is useful when using a sample of students, but I have also seen use cases where researchers want to identify statistical strengths and weaknesses for individual students. I worked on a K12 assessment where this was accomplished with similar logic: did the student perform higher or lower than expected on a topic area, given his or her performance on the other topic areas? This was a statewide assessment, so we had the population’s scores. We conducted regression analyses, predicting each topic’s subscore based on the other topics’ subscores. If a student’s observed subscore in a topic was significantly higher or lower than what would be predicted by the regression analysis, it was classified as a relative strength or weakness, respectively. There were nice benefits to this approach, such as the fact that a low-scoring student could still have a relative strength, and high scoring students could still find relative weaknesses to work on, but as you might expect, most students did not have a relative strength or weakness since all of their subscores were so closely aligned.

    Both of these examples address how one might compare subscores for the purpose of identifying relative strengths and weaknesses; however, I think Feinberg and Wainer’s point takes precedence, in that there may be no benefit to making these comparisons at a subscore level if the subscores themselves are not meaningful. If the subscores are redundant to the total score, then comparing topic scores for strengths and weaknesses might be just as fruitful as comparing the scores for any random subset of items on the assessment.

    This is just a statistical perspective though. In the field, there are certainly other forces at work. It could very well be that reporting a low subscore to a student will help him or her get motivated to study up on that area. Though the subscore itself may not have any information not included in the total score, that sense of having a focused area of improvement might help a student feel empowered—“If I just work on this one area, I will pass!” This may just be a placebo effect, but if students show progress, then it is not something to be dismissed.

    Sincerely,

    Austin Fossey
    Reporting and Analytics Manager
    Questionmark

Leave a Reply