Standard Setting: A Keystone to Legal Defensibility

Austin Fossey-42Since the last Questionmark Users Conference, I have heard several clients discuss new measures at their companies requiring them to provide evidence of the legal defensibility of their assessment. Legal defensibility and validity are closely intertwined, but they are not synonymous. An assessment can be legally defensible, yet still have flaws that impact its validity. The distinction between the two is often the difference between how you developed the instrument versus how well you developed the instrument.

Regardless of whether you are concerned with legal defensibility or validity, careful attention should be paid to the evaluative component of your assessment program. What if  someone asks, “What does this score mean?” How do you answer? How do you justify your response? The answers to these questions impact how your stakeholders will interpret and use the results, and this may have consequences for your participants. Many factors go into supporting the legal defensibility and validity of assessment results, but one could argue that the keystone is the standard-setting process.

Standard setting is the process of dividing score scales so that scores can be interpreted and actioned (AERA, APA, NCME, 2014). The dividing points between sections of the scales  are called “cut scores,” and in criterion-referenced assessment, they typically correspond to performance levels that are defined a priori. These cut scores and their corresponding performance levels help test users make the cognitive leap from a participant’s response pattern to what can be a complex inference about the participant’s knowledge, skills, and abilities.

In their chapter in Educational Measurement (4th Ed.), Hambleton and Pitoniak explain that standard-setting studies need to consider many factors, and that they also can have major implications for participants and test users. For this reason, standard-setting studies are often rigorous, well-documented projects.

At this year’s Questionmark Users Conference, I will be delivering a session that introduces the basics of standard setting. We will discuss standard-setting methods for criterion- referenced and norm-referenced assessments, and we will touch on methods used in both large-scale assessments and in classroom settings. This will be a useful session for anyone who is working on documenting the legal defensibility of their assessment program or who is planning their first standard-setting study and wants to learn about different methods that are available. Participants are encouraged to bring their own questions and stories to share with the group.

Register today for the full conference, but if you cannot make it, make sure to catch the live webcast!

Standard Setting: Compromise and Normative Methods

Austin Fossey

Posted by Austin Fossey

We have discussed the Angoff and Bookmark methods of standard setting, which are two commonly used methods, but there are many more. I would again refer the interested reader to Hambleton and Pitoniak’s chapter in Educational Measurement (4th ed.) for descriptions of other criterion-referenced methods.

Though criterion-referenced assessment is the typical standard-setting scenario, cut scores may also be determined for normative assessments. In these cases, the cut score is often not set to make an inference about the participant, but instead set to help make an operational decision.

A common example of a normative standard is when the pass rate is set based on information that is unrelated to participants’ performance. A company may decide to hire the ten highest-scoring candidates, not because the other candidates are not qualified, but because there are only ten open positions. Of course if the candidate pool is weak overall, even the ten highest performers may still turn out to be lousy employees.

We may also set normative standards based on risk tolerance. You may recall from our post about criterion validity that test developers may use a secondary measure that they expect to correlate with performance on the assessment. An employer may wish to set a cut score to minimize type I errors (false positives) because of the risk involved. For example, ability to fly a plane safely may correlate strongly with aviation test scores, but because of the risk involved if we let an unqualified person fly a plane, we may want to set the cut score high even though we will exclude some qualified pilots.

aviation 1

Normative Standard Setting with Secondary Criterion Measure

The opposite scenario may occur as well. If Type I errors have little risk, an employer may set the cut score low to make sure that all qualified candidates are identified. Unqualified candidates who happen to pass may be identified for additional training through subsequent assessments or workplace observation.

If we decided to use a normative approach to standard setting, we need to be sure that there is justification, and the cut score should not be used to classify individuals. A normative standard by its nature implies that not everyone will pass the assessment, regardless of their individual abilities, which is why it would be inappropriate for most cases in education or certification assessment.

Hambleton and Pitoniak also describe one final class of standard-setting methods called compromise methods. Compromise methods combine the judgment of the standard setters with information about the political realities of different pass rates. One example is the Hofstee Method, where stand setters define the highest acceptable cut score (1), the lowest acceptable cut score (2), highest acceptable fail rate (3), and the lowest acceptable fail rate (4). These are plotted against a curve of participants’ score data, and the intersection is used as a cut score.

 aviation 2Hofstee Method ExampleAdapted from Educational Measurement (Ed. Brennan, 2006)

Standard Setting: Bookmark Method Overview

Austin FosseyPosted by Austin Fossey

In my last post, I spoke about using the Angoff Method to determine cut scores in a criterion-referenced assessment. Another commonly used method is the Bookmark Method. While both can be applied to a criterion-referenced assessment, Bookmark is often used in large-scale assessments with multiple forms or vertical score scales, such as some state education tests.

In their chapter entitled “Setting Performance Standards” in Educational Measurement (4th ed.), Ronald Hambleton and Mary Pitoniak discuss describe many commonly used standard setting procedures. Hambleton and Pitoniak classify the Bookmark as an “item mapping method,” which means that standard setters are presented with an ordered item booklet that is used to map the relationship between item difficulty and participant performance.

In Bookmark, item difficulty must be determined a priori. Note that the Angoff Method does not require us to have item statistics for the standard setting to take place, but we usually will have the item statistics to use as impact data. With Bookmark, item difficulty must be calculated with an item response theory (IRT) model before the standard setting.

Once the items’ difficulty parameters have been established, the psychometricians will assemble the items into an ordered item booklet. Each item gets its own page in the booklet, and the items are ordered from easiest to hardest, such that the hardest item is on the last page.standard book

Each rater receives an ordered item booklet. The raters go through the entire booklet once to read every item. They then go back through and place a bookmark between the two items in the booklet that represent the cut point for what minimally qualified participants should know and be able to do.

Psychometricians will often ask raters to place the bookmark at the item where 67% of minimally qualified participants will get the item right. 67% is called the response probability, and it is an easy value for raters to use because they just pick the item where about two-thirds of minimally qualified participants will get the item right. Other response probabilities can be used (e.g., 50% of minimally qualified participants), and Hambleton and Pitoniak describe some of the issues around this decision in more detail.

After each rater has placed a bookmark, the process is similar to Angoff. The item difficulties corresponding to each bookmark are averaged, the raters discuss the result, impact data can be reviewed, and then raters re-set their bookmark before the final cut score is determined. I have also seen larger  programs break raters into groups of five people, and each group has their own discussion before bringing their recommended cut score to the larger group. This cuts down on discussion time and keeps any one rater from hijacking the whole group.

The same process can be followed if we have more than two classifications for the assessment. For example, instead of Pass and Fail, we may have Novice, Proficient, and Advanced. We would need to determine what makes a participant Advanced instead of Proficient, but the same response probability should be used when placing the bookmarks for these two categories.

Standard Setting – How Much Does the Ox Weigh?

Austin FosseyPosted by Austin Fossey

At the Questionmark 2013 Users Conference, I had an enjoyable debate with one of our clients about the merits and pitfalls underlying the assumptions of standard setting.

We tend to use methods like Angoff or the Bookmark Method to set standards for high-stakes assessments, and we treat the resulting cut scores as fact, but how can we be sure that the results of the standard setting reflect reality?

In his book, The Wisdom of Crowds, James Surowiecki recounts a story about Sir Francis Galton visiting a fair in 1906. Galton observed a game where people could guess the weight of an ox, and whoever was closest would win a prize.

Because guessing the weight of an ox was considered to be a lot of fun in 1906, hundreds of people lined up and wrote down their best guess. Galton got his hands on their written responses and took them home. He found that while no one guess was exactly right, the crowd’s mean guess was pretty darn good: only one pound off from the true weight of the ox.weight ox

We cannot expect any individual’s recommended cut score in a standard setting session to be spot on, but if we select a representative sample of experts and provide them with relevant information about the construct and impact data, we have a good basis for suggesting that their aggregated ratings are a faithful representation of the true cut score.

This is the nature of education measurement—our certainty about our inferences is dependent on the amount of data we have and the quality of that data. Just as we infer something about a student’s true abilities based on their responses to carefully selected items on a test, we have to infer something about the true cut score based on our subject matter experts’ responses to carefully constructed dialogues in the standard setting process.

We can also verify cut scores through validity studies, thus strengthening the case for our stakeholders. So take heart—your standard setters as a group have a pretty good estimate on the weight of that ox.

Standard Setting: Methods for establishing cut scores


Posted by Greg Pope

My last post offered an introduction to standard setting; today I’d like to go into more detail about establishing cut scores. There are many standard setting methods used to set cut scores. These methods are generally split into two types: a) question-centered approaches and b) participant-centered approaches. A few of the most popular methods, with very brief descriptions of each, are provided below. For more detailed information on standard setting procedures and methods see the book, Setting Performance Standards: Concepts, Methods, and Perspectives, edited by Gregory Cizek and Robert Sternberg.

  • Modified Angoff method (question-centered): Subject matter experts (SMEs) are generally briefed on the Angoff method and allowed to take the test with the performance levels in mind. SMEs are then asked to provide estimates for each question of the proportion of borderline or “minimally acceptable” participants that they would expect to get the question correct. The estimates are generally in p-value type form (e.g., 0.6 for item 1: 60% of borderline passing participants would get this question correct). Several rounds are generally conducted with SMEs allowed to modify their estimates given different types of information (e.g., actual participant performance information on each question, other SME estimates, etc.). The final determination of the cut score is then made (e.g., by averaging estimates or taking the median). This method is generally used with multiple-choice questions.
  • I like a dichotomous modified Angoff approach where, instead of using p-value type statistics, SMEs are asked to simply provide a 0/1 for each question (“0” if a borderline acceptable participant would get the question wrong and “1” if a borderline acceptable participant would get the item right)
  • Nedelsky method (question-centered): SMEs make decisions on a question-by-question basis regarding which of the question distracters they feel borderline participants would be able to eliminate as incorrect. This method is generally used with multiple-choice questions only.
  • Bookmark method (question-centered): Questions are ordered by difficulty (e.g., Item Response Theory b-parameters or Classical Test Theory p-values) from easiest to hardest. SMEs make “bookmark” determinations of where performance levels (e.g., cut scores) should be (“As the test gets harder, where would a participant on the boundary of the performance level not be able to get any more questions correct?”) This method can be used with virtually any question type (e.g., multiple-choice, multiple-response, matching, etc.).
  • Borderline groups method (participant-centered): A description is prepared for each performance category. SMEs are asked to submit a list of participants whose performance on the test should be close to the performance standard (borderline). The test is administered to these borderline groups and the median test score is used as the cut score. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).
  • Contrasting groups method (participant-centered): SMEs are asked to categorize the participants in their classes according to the performance category descriptions. The test is administered to all of the categorized participants and the test score distributions for each of the categorized groups are compared. Where the distributions of the contrasting groups intersect is where the cut score would be located. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).

I hope this was helpful and I am looking forward to talking more about an exciting psychometric topic soon!

Standard Setting: An Introduction


Posted by Greg Pope

Standard setting was a topic of considerable interest to attendees at the Questionmark 2010 Users Conference  in March.We had some great discussions about standard setting methods and practical applications in some of the sessions I was leading, so I thought I would share some details about this topic here.

Standard setting is generally used in summative criterion referenced contexts. It is the process of setting a “pass/fail” score that distinguishes those participants who have the minimum acceptable level of competence in an area to pass from those participants who do not have the minimum acceptable level of competence in an area. For example, in a crane operation certification course, participants would be expected to have a certain level of knowledge and skills to operate a crane successfully and safely. In addition to a practical test (e.g., operation of a crane in a safe environment) candidates may also be required to take a crane certification exam in which they would need to achieve a certain minimum score in order to be allowed to operate a crane. On the crane certification exam a pass score of 75% or higher is required for a candidate to be able to operate a crane; anything below 75% and they would need to take the course again. Cut scores do not only refer to pass/fail benchmarks. For example, organizations may have several cut scores within an assessment that differentiate between “Advanced”, “Acceptable”, and “Failed” levels.

Cut scores are very common in high and medium-stakes assessment programs; well established processes for setting these cut scores and maintaining them across administrations are available. Generally, one would first build/develop the assessment with the cut score in mind. This would entail selecting questions that represent the proportionate topics areas being covered, ensuring an appropriate distribution of difficulty of the questions, and selecting more questions in the cut score range to maximize the “measurement information” near the cut score.

Once a test form is built it would undergo formal standard setting procedures to set or confirm the cut score(s). Here is a general overview of a typical Modified Angoff type standard setting process:

typical Modified Angoff type standard setting process

Stay tuned for my next post on this topic, in which I will describe some standard setting methods for establishing cut scores.