Posted by Austin Fossey
I had the privilege of meeting with an organization that is reporting subscores to show how their employees are improving across multiple areas of their domain, as determined by an assessment given before and after training. They have developed some slick reports to show these scores, including the participant’s first score, second score (after training is complete), and the change in those scores.
At first glance, these reports are pretty snazzy and seem to suggest huge improvements resulting from the training, but looks can be deceiving. I immediately noticed one participant had made a subscore gain of 25%, which sounds impressive—like he or she is suddenly 25% better at the tasks in that domain—but here is the fine print: that subscore was measured with only four items. To put it another way, that 25% improvement means that the participant answered one more item correctly. Other subscores were similarly underrepresented—most with four or fewer items in their topic.
In a previous post, I reported on an article by Richard Feinberg and Howard Wainer about how to determine if a subscore is worth reporting. My two loyal readers (you know who you are) may recall that a reported subscore has to be reliable, and it must contain information that is sufficiently different from the information contained in the assessment’s total score (AKA “orthogonality”).
In an article titled Comments on “A Note on Subscores” by Samuel A. Livingston, Sandip Sinharay and Shelby Haberman defended against a critique that their previous work (which informed Feinberg and Wainer’s proposed Value Added Ratio (VAR) metric) indicated that subscores should never be reported when examining changes across administrations. Sinharay and Haberman explained that in these cases, one should examine the suitability of the change scores, not the subscores themselves. One may then find that the change scores are suitable for reporting.
A change score is the difference in scores from one administration to the next. If a participant gets a subscore of 12 on their first assessment and a subscore of 30 on their next assessment, their change score for that topic is 18. This can then be thought of as the subscore of interest, and one can then evaluate whether or not this change score is suitable for reporting.
Change scores are also used to determine if a change in scores is statistically significant for a group of participants. If we want to know whether a group of participants is performing statistically better on an assessment after completing training (at a total score or subscore level), we do not compare average scores on the two tests. Instead, we look to see if the group’s change scores across the two tests are significantly greater than zero. This is typically analyzed with a dependent samples t-test.
The reliability, orthogonality, and significance of changes in subscores are statistical concerns, but scores must be interpretable and actionable to make a claim about the validity of the assessment. This raises the concern of domain representation. Even if the statistics are fine, a subscore cannot be meaningful if the items do not sufficiently represent the domain they are supposed to measure. Making an inference about a participant’s ability in a topic based on only four items is preposterous—you do not need to know anything about statistics to come to that conclusion.
To address the concern of domain representation, high-stakes assessment programs that report subscores will typically set a minimum for the number of items that are needed to sufficiently represent a topic before a subscore is reported. For example, one program I worked for required (perhaps somewhat arbitrarily) a minimum of eight items in a topic before generating a subscore. If this domain representation criterion is met, one can presumably use methods like the VAR to then determine if the subscores meet the statistical criteria for reporting.