An argument against using negative item scores in CTT

Austin Fossey-42Posted by Austin Fossey

Last year, a client asked for my opinion about whether or not to use negative scores on test items. For example, if a participant answers an item correctly, they would get one point, but if they answer the item incorrectly, they would lose one point. This means the item would be scored dichotomously [-1,1] instead of in the more traditional way [0,1].

I believe that negative item scores are really useful if the goal is to confuse and mislead participants. They are not appropriate for most classical test theory (CTT) assessment designs, because they do not add measurement value, and they are difficult to interpret.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015  *space is limited

Measurement value of negative item scores

Changing the item scoring format from [0,1] to [-1,1] does not change anything about your ability to measure participants—after all, the dichotomous scores are just symbols. You are simply using a different total score scale.

Consider a 60-item assessment made up of dichotomously scored items. If the items are scored [0,1], the total score scale ranges from 0 to 60 points. If scored [-1,1], the score range doubles, now ranging from -60 to 60 points.

From a statistical standpoint, nothing has changed. The item-total discrimination statistics will be the same under both designs, as will the assessment’s reliability. The standard error of measurement will double, but that is to be expected because the score range has doubled. Thus there is no change in the precision of scores or misclassification rates. How you score the items does not matter as long as they are scored dichotomously on the same scale.

The figure below illustrates the score distributions for 1,000 normally distributed assessment scores that were simulated using WinGen. This sample’s item responses have been scored with three different models: [-1,1], [0,1], and [0,2]. While this shifts and stretches the distribution of scores on to different scales, there is no change in reliability or the standard error of measurement (as a percentage of the score range).

Distribution and assessment statistics for 1,000 simulated test scores with items dichotomously scored three ways: [-1,1], [0,1], and [0,2]

 Interpretation issues of negative item scores

If the item scores do not make a difference statistically, and they are just symbols, then why not use negative scores? Remember that an item is a mechanism for collecting and quantifying evidence to support the student model, so how we score our items (and the assessment as a whole) plays a big role in how people interpret the participant’s performance.

Consider an item scored [0,1]. In a CTT model, a score of 1 represents accumulated evidence about the presence or magnitude of a construct, whereas a score of 0 suggests that no evidence was found in the response to this item.

Now suppose we took the same item and scored it [-1,1]. A score of 1 still suggests accumulated evidence, but now we are also changing the total score based on wrong answers. The interpretation is that we have collected evidence about the absence of the construct. To put it another way, the test designer is claiming to have positive evidence that the participants does not know something.

This is not an easy claim to make. In psychometrics, we can attempt to measure the presence of a hypothetical construct, but it is difficult to make a claim that a construct is not there. We can only make inferences about what we observe, and I argue that it is very difficult to build an evidentiary model for someone not knowing something.

Furthermore, negative scores negate evidence we have collected in other items. If a participant gets one item right and earns a point but then loses that point on the next item, we have essentially canceled out the information about the participant from a total score perspective. By using negative scores in a CTT model, we also introduce the possibility that someone can get a negative score on the whole test, but what would a negative score mean? This lack of interpretability is one major reason people do not use negative scores.

Consider a participant who answers 40 items correctly on the 60-item assessment I mentioned earlier. When scored [0,1], the raw score (40 points) corresponds to the number of correct responses provided by the participant. This scale is useful for calculating percentage scores (40/60 = 67% correct), setting cut scores, and supporting the interpretation of the participant’s performance.

When the same items are scored [1,-1], the participant’s score is more difficult to interpret. The participant answered 40 questions correctly, but they only get a score of 20. They know the maximum score on the assessment is 60 points, yet their raw score of 20 corresponds to a correct response rate of 67%, not 33%, since 20 points corresponds to 67% of the range between -60 to 60 points.

There are times when items need to be scored differently from other items on the assessment. Polytomous items clearly need different scoring models (though similar interpretive arguments could be leveled against people who try to score items in fractions of points), and there are times when an item may need to be weighted differently from other items. (We’ll discuss that in my next post.)Some item response theory (IRT) assessments like the SAT use negative points to correct for guessing, but this should only be done if you can demonstrate improved model fit and you have a theory and evidence to justify doing so. In general, when using CTT, negative item scores only serve to muddy the water.

Interested in learning more about classical test theory and item statistics? Psychometrician Austin Fossey will be delivering a free 75 minute online workshop — Item Analysis: Concepts and Practice Tuesday, June 23, 2015 11:00 AM – 12:15 PM EDT  *spots are limited

4 Responses to An argument against using negative item scores in CTT

  1. Michael Murphy says:


    I have never used a (-1,1) format. I’ve used a (0,1) and a (-1,0) format, and they both have their purpose. In our technical training arenas, if the number of questions is evenly multiplied or divided in 100, you can use the (0,1) format and get a percentage no matter how you do the math. If you use (-1,0), you can start with 100 points and subtract, so the test taker gets the benefit of the “extra point” (e.g. 30 questions at 3.3 points each, leaving one point left over.)

    I advocate the dispensing of recorded written test scores completely, except in the place of aptitude, placement, and CLEP-style tests.

    The purpose of the test is for the instructor to use as a tool to see if students have achieved cognitive objectives. Once the instructor is satisfied, just move on. The only purpose for the numbers after that is for data analysis. That will never happen, of course.


  2. Ted Villella says:

    I have several multiple response items in an assessment and I noticed that if the user selects an incorrect response, but all of the five correct responses they still achieve the maximum score. This does not seem right to me so I plan to make the selection of the incorrect responses worth -1. I do not think I can allow a user to make an incorrect choice and not be penalized for it. I am new to Questionmark and still learning how to construct my assessments. Any suggestions?

  3. Austin Fossey says:

    Hi Ted,

    Great example! Given your item design, that may be the best way to approach the incorrect responses. Most test developers avoid this issue by specifying (and enforcing) how many options the participant can select in an item. This can be done with Questionmark’s authoring tools, and it is generally considered best practice for multiple response items. Item writers then provide guidance to the participant in the the stem, ending with phrases like “choose three” instead of “choose all that apply.” This strategy avoids the scenario you are describing of having to take away points from a participant who did know all of the correct answers (and just selected extra wrong answers too), and it also helps to avoid issues of response bias for participants who are not sure how many options to pick. I generally encourage my clients to design their multiple response items with this format.

    There are certainly specific cases where one may want a participant to choose answers without any guidance of how many options to select, but more reliable results may be attained by splitting these into multiple items rather than penalizing participants with negative scores in a single multiple response item.

    If the item design does necessitate that the participant select responses without guidance on how many to choose, then your strategy of taking away points will help discriminate between participants who know the precise correct response pattern, but it may not discriminate between other response patterns, and the item may yield negative item scores that can negate points earned on other items.



  4. Rick Ault says:

    In the case of Multiple Response Questions where you do not want to limit the number of choices, but still prevent points from being awarded when an incorrect answer is chosen, wrong answers can be set to score the entire question as a zero regardless of how many correct answers were chosen.

    To set up a question like this, you will need to use the question editor available in Authoring Manager. You will want to arrange your outcomes in the question so that the correct answers are the first outcomes evaluated, and accumulate the points for each correct selection made. Then, for the outcomes that evaluate if a wrong answer is also chosen, you simply score them as a zero, and ensure the “accumulate” check box is turned off for the particular outcome. This way, if a wrong answer is chosen, the accumulated points are overwritten with a zero.

Leave a Reply

Your email address will not be published. Required fields are marked *