# Simpson’s Paradox and the Steelyard Graph

If you work with assessment statistics or just about any branch of social science, you may be familiar with Simpson’s paradox—the idea that data trends between subgroups change or disappear when the subgroups are aggregated. There are hundreds of examples of Simpson’s paradox (and I encourage you to search some on the internet for kicks), but here is a simple example for the sake of illustration.

Let us say that I am looking to get trained as a certified window washer so that I can wash windows on Boston’s skyscrapers. Two schools in my area offer training, and both had 300 students graduate last year. Graduates from School A had an average certification test score of 70.7%, and graduates from School B had an average score of 69.0%. Ignoring for the moment whether these differences are significant, as a student I will likely choose School A due to its higher average test scores.

But here is where the paradox happens. Consider now that I have a crippling fear of heights, which may be a hindrance for my window-washing aspirations. It turns out that School A and School B also track test scores for their graduates based on whether or not they have a fear of heights. The table below reports the average scores for these phobic subgroups.

Notice anything? The average score for people with and without a fear of heights in School B is higher than the same groups in School A. The paradox is that School A has a higher average test score overall, yet School B can boast better average test scores for students with a fear of heights and students without a fear of heights. School B’s overall average is lower because they simply had more students with a fear of heights. If we want to test the significance of these differences, we can do so with ANOVA.

Gaviria and González-Barbera’s Steelyard Graph

Simpson’s paradox occurs in many different fields, but it is sometimes difficult to explain to stakeholders. Tables (like the one above) are often used to
illustrate the subgroup differences, but in the Fall 2014 issue of Educational Measurement, José-Luis Gaviria and Coral González-Barbera from the Universidad Complutense de Madrid won the publication’s data visualization contest with their Steelyard Graph, which illustrates Simpson’s Paradox with a graph resembling a steelyard balance. The publication’s visual editor, ETS’s Katherine Furgol Castellano, wrote the discussion piece for the Steelyard Graph, praising Gaviria and González-Barbera for the simplicity of the approach and the novel yet astute strategy of representing averages with balanced levers.

The figure below illustrates the same data from the table above using Gaviria and González-Barbera’s Steelyard Graph approach. The size of the squares corresponds to the number of students, the location on the lever indicates the average subgroup score, and the triangular fulcrum represents the school’s overall average score. Notice how clear it is that the subgroups in School B have higher average scores than their counterparts in School A. The example below has only two subgroups, but the same approach can be used for more subgroups.

Example of Gaviria and González-Barbera’s Steelyard Graph to visualize Simpson’s paradox for subgroups’ average test scores.

Making a Decision when Faced with Simpson’s Paradox

When one encounters Simpson’s paradox, decision-making can be difficult, especially if there are no theories to explain why the relational pattern is different at a subgroup level. This is why exploratory analysis often must be driven by and interpreted through a lens of theory. One could come up with arbitrary subgroups that reverse the aggregate relationships, even though there is no theoretical grounding for doing so. On the other hand, relevant subgroups may remain unidentified by researchers, though the aggregate relationship may still be sufficient for decision-making.

For example, as a window-washing student seeing the phobic subgroups’ performances, I might decide that School B is the superior school for teaching the trade, regardless of which subgroup a student belongs to. This decision is based on a theory that a fear of heights may impact performance on the certification assessment, in which case School B does a better job at preparing both subgroups for their assessments. If that theory is not tenable, it may be that School A is really the better choice, but as an acrophobic would-be window washer, I will likely choose School B after seeing this graph . . . as long as the classroom is located on the ground floor.

# An argument against using negative item scores in CTT

Last year, a client asked for my opinion about whether or not to use negative scores on test items. For example, if a participant answers an item correctly, they would get one point, but if they answer the item incorrectly, they would lose one point. This means the item would be scored dichotomously [-1,1] instead of in the more traditional way [0,1].

I believe that negative item scores are really useful if the goal is to confuse and mislead participants. They are not appropriate for most classical test theory (CTT) assessment designs, because they do not add measurement value, and they are difficult to interpret.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015  *space is limited

Measurement value of negative item scores

Changing the item scoring format from [0,1] to [-1,1] does not change anything about your ability to measure participants—after all, the dichotomous scores are just symbols. You are simply using a different total score scale.

Consider a 60-item assessment made up of dichotomously scored items. If the items are scored [0,1], the total score scale ranges from 0 to 60 points. If scored [-1,1], the score range doubles, now ranging from -60 to 60 points.

From a statistical standpoint, nothing has changed. The item-total discrimination statistics will be the same under both designs, as will the assessment’s reliability. The standard error of measurement will double, but that is to be expected because the score range has doubled. Thus there is no change in the precision of scores or misclassification rates. How you score the items does not matter as long as they are scored dichotomously on the same scale.

The figure below illustrates the score distributions for 1,000 normally distributed assessment scores that were simulated using WinGen. This sample’s item responses have been scored with three different models: [-1,1], [0,1], and [0,2]. While this shifts and stretches the distribution of scores on to different scales, there is no change in reliability or the standard error of measurement (as a percentage of the score range).

Distribution and assessment statistics for 1,000 simulated test scores with items dichotomously scored three ways: [-1,1], [0,1], and [0,2]

Interpretation issues of negative item scores

If the item scores do not make a difference statistically, and they are just symbols, then why not use negative scores? Remember that an item is a mechanism for collecting and quantifying evidence to support the student model, so how we score our items (and the assessment as a whole) plays a big role in how people interpret the participant’s performance.

Consider an item scored [0,1]. In a CTT model, a score of 1 represents accumulated evidence about the presence or magnitude of a construct, whereas a score of 0 suggests that no evidence was found in the response to this item.

Now suppose we took the same item and scored it [-1,1]. A score of 1 still suggests accumulated evidence, but now we are also changing the total score based on wrong answers. The interpretation is that we have collected evidence about the absence of the construct. To put it another way, the test designer is claiming to have positive evidence that the participants does not know something.

This is not an easy claim to make. In psychometrics, we can attempt to measure the presence of a hypothetical construct, but it is difficult to make a claim that a construct is not there. We can only make inferences about what we observe, and I argue that it is very difficult to build an evidentiary model for someone not knowing something.

Furthermore, negative scores negate evidence we have collected in other items. If a participant gets one item right and earns a point but then loses that point on the next item, we have essentially canceled out the information about the participant from a total score perspective. By using negative scores in a CTT model, we also introduce the possibility that someone can get a negative score on the whole test, but what would a negative score mean? This lack of interpretability is one major reason people do not use negative scores.

Consider a participant who answers 40 items correctly on the 60-item assessment I mentioned earlier. When scored [0,1], the raw score (40 points) corresponds to the number of correct responses provided by the participant. This scale is useful for calculating percentage scores (40/60 = 67% correct), setting cut scores, and supporting the interpretation of the participant’s performance.

When the same items are scored [1,-1], the participant’s score is more difficult to interpret. The participant answered 40 questions correctly, but they only get a score of 20. They know the maximum score on the assessment is 60 points, yet their raw score of 20 corresponds to a correct response rate of 67%, not 33%, since 20 points corresponds to 67% of the range between -60 to 60 points.

There are times when items need to be scored differently from other items on the assessment. Polytomous items clearly need different scoring models (though similar interpretive arguments could be leveled against people who try to score items in fractions of points), and there are times when an item may need to be weighted differently from other items. (We’ll discuss that in my next post.)Some item response theory (IRT) assessments like the SAT use negative points to correct for guessing, but this should only be done if you can demonstrate improved model fit and you have a theory and evidence to justify doing so. In general, when using CTT, negative item scores only serve to muddy the water.

Interested in learning more about classical test theory and item statistics? Psychometrician Austin Fossey will be delivering a free 75 minute online workshop — Item Analysis: Concepts and Practice Tuesday, June 23, 2015 11:00 AM – 12:15 PM EDT  *spots are limited

# Standard Setting: Bookmark Method Overview

Posted by Austin Fossey

In my last post, I spoke about using the Angoff Method to determine cut scores in a criterion-referenced assessment. Another commonly used method is the Bookmark Method. While both can be applied to a criterion-referenced assessment, Bookmark is often used in large-scale assessments with multiple forms or vertical score scales, such as some state education tests.

In their chapter entitled “Setting Performance Standards” in Educational Measurement (4th ed.), Ronald Hambleton and Mary Pitoniak discuss describe many commonly used standard setting procedures. Hambleton and Pitoniak classify the Bookmark as an “item mapping method,” which means that standard setters are presented with an ordered item booklet that is used to map the relationship between item difficulty and participant performance.

In Bookmark, item difficulty must be determined a priori. Note that the Angoff Method does not require us to have item statistics for the standard setting to take place, but we usually will have the item statistics to use as impact data. With Bookmark, item difficulty must be calculated with an item response theory (IRT) model before the standard setting.

Once the items’ difficulty parameters have been established, the psychometricians will assemble the items into an ordered item booklet. Each item gets its own page in the booklet, and the items are ordered from easiest to hardest, such that the hardest item is on the last page.

Each rater receives an ordered item booklet. The raters go through the entire booklet once to read every item. They then go back through and place a bookmark between the two items in the booklet that represent the cut point for what minimally qualified participants should know and be able to do.

Psychometricians will often ask raters to place the bookmark at the item where 67% of minimally qualified participants will get the item right. 67% is called the response probability, and it is an easy value for raters to use because they just pick the item where about two-thirds of minimally qualified participants will get the item right. Other response probabilities can be used (e.g., 50% of minimally qualified participants), and Hambleton and Pitoniak describe some of the issues around this decision in more detail.

After each rater has placed a bookmark, the process is similar to Angoff. The item difficulties corresponding to each bookmark are averaged, the raters discuss the result, impact data can be reviewed, and then raters re-set their bookmark before the final cut score is determined. I have also seen larger  programs break raters into groups of five people, and each group has their own discussion before bringing their recommended cut score to the larger group. This cuts down on discussion time and keeps any one rater from hijacking the whole group.

The same process can be followed if we have more than two classifications for the assessment. For example, instead of Pass and Fail, we may have Novice, Proficient, and Advanced. We would need to determine what makes a participant Advanced instead of Proficient, but the same response probability should be used when placing the bookmarks for these two categories.