Standard Setting – How Much Does the Ox Weigh?

Austin FosseyPosted by Austin Fossey

At the Questionmark 2013 Users Conference, I had an enjoyable debate with one of our clients about the merits and pitfalls underlying the assumptions of standard setting.

We tend to use methods like Angoff or the Bookmark Method to set standards for high-stakes assessments, and we treat the resulting cut scores as fact, but how can we be sure that the results of the standard setting reflect reality?

In his book, The Wisdom of Crowds, James Surowiecki recounts a story about Sir Francis Galton visiting a fair in 1906. Galton observed a game where people could guess the weight of an ox, and whoever was closest would win a prize.

Because guessing the weight of an ox was considered to be a lot of fun in 1906, hundreds of people lined up and wrote down their best guess. Galton got his hands on their written responses and took them home. He found that while no one guess was exactly right, the crowd’s mean guess was pretty darn good: only one pound off from the true weight of the ox.weight ox

We cannot expect any individual’s recommended cut score in a standard setting session to be spot on, but if we select a representative sample of experts and provide them with relevant information about the construct and impact data, we have a good basis for suggesting that their aggregated ratings are a faithful representation of the true cut score.

This is the nature of education measurement—our certainty about our inferences is dependent on the amount of data we have and the quality of that data. Just as we infer something about a student’s true abilities based on their responses to carefully selected items on a test, we have to infer something about the true cut score based on our subject matter experts’ responses to carefully constructed dialogues in the standard setting process.

We can also verify cut scores through validity studies, thus strengthening the case for our stakeholders. So take heart—your standard setters as a group have a pretty good estimate on the weight of that ox.

Standard Setting: Methods for establishing cut scores

greg_pope-150x1502

Posted by Greg Pope

My last post offered an introduction to standard setting; today I’d like to go into more detail about establishing cut scores. There are many standard setting methods used to set cut scores. These methods are generally split into two types: a) question-centered approaches and b) participant-centered approaches. A few of the most popular methods, with very brief descriptions of each, are provided below. For more detailed information on standard setting procedures and methods see the book, Setting Performance Standards: Concepts, Methods, and Perspectives, edited by Gregory Cizek and Robert Sternberg.

  • Modified Angoff method (question-centered): Subject matter experts (SMEs) are generally briefed on the Angoff method and allowed to take the test with the performance levels in mind. SMEs are then asked to provide estimates for each question of the proportion of borderline or “minimally acceptable” participants that they would expect to get the question correct. The estimates are generally in p-value type form (e.g., 0.6 for item 1: 60% of borderline passing participants would get this question correct). Several rounds are generally conducted with SMEs allowed to modify their estimates given different types of information (e.g., actual participant performance information on each question, other SME estimates, etc.). The final determination of the cut score is then made (e.g., by averaging estimates or taking the median). This method is generally used with multiple-choice questions.
  • I like a dichotomous modified Angoff approach where, instead of using p-value type statistics, SMEs are asked to simply provide a 0/1 for each question (“0” if a borderline acceptable participant would get the question wrong and “1” if a borderline acceptable participant would get the item right)
  • Nedelsky method (question-centered): SMEs make decisions on a question-by-question basis regarding which of the question distracters they feel borderline participants would be able to eliminate as incorrect. This method is generally used with multiple-choice questions only.
  • Bookmark method (question-centered): Questions are ordered by difficulty (e.g., Item Response Theory b-parameters or Classical Test Theory p-values) from easiest to hardest. SMEs make “bookmark” determinations of where performance levels (e.g., cut scores) should be (“As the test gets harder, where would a participant on the boundary of the performance level not be able to get any more questions correct?”) This method can be used with virtually any question type (e.g., multiple-choice, multiple-response, matching, etc.).
  • Borderline groups method (participant-centered): A description is prepared for each performance category. SMEs are asked to submit a list of participants whose performance on the test should be close to the performance standard (borderline). The test is administered to these borderline groups and the median test score is used as the cut score. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).
  • Contrasting groups method (participant-centered): SMEs are asked to categorize the participants in their classes according to the performance category descriptions. The test is administered to all of the categorized participants and the test score distributions for each of the categorized groups are compared. Where the distributions of the contrasting groups intersect is where the cut score would be located. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).

I hope this was helpful and I am looking forward to talking more about an exciting psychometric topic soon!