An argument against using negative item scores in CTT

Last year, a client asked for my opinion about whether or not to use negative scores on test items. For example, if a participant answers an item correctly, they would get one point, but if they answer the item incorrectly, they would lose one point. This means the item would be scored dichotomously [-1,1] instead of in the more traditional way [0,1].

I believe that negative item scores are really useful if the goal is to confuse and mislead participants. They are not appropriate for most classical test theory (CTT) assessment designs, because they do not add measurement value, and they are difficult to interpret.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015  *space is limited

Measurement value of negative item scores

Changing the item scoring format from [0,1] to [-1,1] does not change anything about your ability to measure participants—after all, the dichotomous scores are just symbols. You are simply using a different total score scale.

Consider a 60-item assessment made up of dichotomously scored items. If the items are scored [0,1], the total score scale ranges from 0 to 60 points. If scored [-1,1], the score range doubles, now ranging from -60 to 60 points.

From a statistical standpoint, nothing has changed. The item-total discrimination statistics will be the same under both designs, as will the assessment’s reliability. The standard error of measurement will double, but that is to be expected because the score range has doubled. Thus there is no change in the precision of scores or misclassification rates. How you score the items does not matter as long as they are scored dichotomously on the same scale.

The figure below illustrates the score distributions for 1,000 normally distributed assessment scores that were simulated using WinGen. This sample’s item responses have been scored with three different models: [-1,1], [0,1], and [0,2]. While this shifts and stretches the distribution of scores on to different scales, there is no change in reliability or the standard error of measurement (as a percentage of the score range).

Distribution and assessment statistics for 1,000 simulated test scores with items dichotomously scored three ways: [-1,1], [0,1], and [0,2]

Interpretation issues of negative item scores

If the item scores do not make a difference statistically, and they are just symbols, then why not use negative scores? Remember that an item is a mechanism for collecting and quantifying evidence to support the student model, so how we score our items (and the assessment as a whole) plays a big role in how people interpret the participant’s performance.

Consider an item scored [0,1]. In a CTT model, a score of 1 represents accumulated evidence about the presence or magnitude of a construct, whereas a score of 0 suggests that no evidence was found in the response to this item.

Now suppose we took the same item and scored it [-1,1]. A score of 1 still suggests accumulated evidence, but now we are also changing the total score based on wrong answers. The interpretation is that we have collected evidence about the absence of the construct. To put it another way, the test designer is claiming to have positive evidence that the participants does not know something.

This is not an easy claim to make. In psychometrics, we can attempt to measure the presence of a hypothetical construct, but it is difficult to make a claim that a construct is not there. We can only make inferences about what we observe, and I argue that it is very difficult to build an evidentiary model for someone not knowing something.

Furthermore, negative scores negate evidence we have collected in other items. If a participant gets one item right and earns a point but then loses that point on the next item, we have essentially canceled out the information about the participant from a total score perspective. By using negative scores in a CTT model, we also introduce the possibility that someone can get a negative score on the whole test, but what would a negative score mean? This lack of interpretability is one major reason people do not use negative scores.

Consider a participant who answers 40 items correctly on the 60-item assessment I mentioned earlier. When scored [0,1], the raw score (40 points) corresponds to the number of correct responses provided by the participant. This scale is useful for calculating percentage scores (40/60 = 67% correct), setting cut scores, and supporting the interpretation of the participant’s performance.

When the same items are scored [1,-1], the participant’s score is more difficult to interpret. The participant answered 40 questions correctly, but they only get a score of 20. They know the maximum score on the assessment is 60 points, yet their raw score of 20 corresponds to a correct response rate of 67%, not 33%, since 20 points corresponds to 67% of the range between -60 to 60 points.

There are times when items need to be scored differently from other items on the assessment. Polytomous items clearly need different scoring models (though similar interpretive arguments could be leveled against people who try to score items in fractions of points), and there are times when an item may need to be weighted differently from other items. (We’ll discuss that in my next post.)Some item response theory (IRT) assessments like the SAT use negative points to correct for guessing, but this should only be done if you can demonstrate improved model fit and you have a theory and evidence to justify doing so. In general, when using CTT, negative item scores only serve to muddy the water.

Interested in learning more about classical test theory and item statistics? Psychometrician Austin Fossey will be delivering a free 75 minute online workshop — Item Analysis: Concepts and Practice Tuesday, June 23, 2015 11:00 AM – 12:15 PM EDT  *spots are limited