Standard Setting: Bookmark Method Overview

Austin FosseyPosted by Austin Fossey

In my last post, I spoke about using the Angoff Method to determine cut scores in a criterion-referenced assessment. Another commonly used method is the Bookmark Method. While both can be applied to a criterion-referenced assessment, Bookmark is often used in large-scale assessments with multiple forms or vertical score scales, such as some state education tests.

In their chapter entitled “Setting Performance Standards” in Educational Measurement (4th ed.), Ronald Hambleton and Mary Pitoniak discuss describe many commonly used standard setting procedures. Hambleton and Pitoniak classify the Bookmark as an “item mapping method,” which means that standard setters are presented with an ordered item booklet that is used to map the relationship between item difficulty and participant performance.

In Bookmark, item difficulty must be determined a priori. Note that the Angoff Method does not require us to have item statistics for the standard setting to take place, but we usually will have the item statistics to use as impact data. With Bookmark, item difficulty must be calculated with an item response theory (IRT) model before the standard setting.

Once the items’ difficulty parameters have been established, the psychometricians will assemble the items into an ordered item booklet. Each item gets its own page in the booklet, and the items are ordered from easiest to hardest, such that the hardest item is on the last page.standard book

Each rater receives an ordered item booklet. The raters go through the entire booklet once to read every item. They then go back through and place a bookmark between the two items in the booklet that represent the cut point for what minimally qualified participants should know and be able to do.

Psychometricians will often ask raters to place the bookmark at the item where 67% of minimally qualified participants will get the item right. 67% is called the response probability, and it is an easy value for raters to use because they just pick the item where about two-thirds of minimally qualified participants will get the item right. Other response probabilities can be used (e.g., 50% of minimally qualified participants), and Hambleton and Pitoniak describe some of the issues around this decision in more detail.

After each rater has placed a bookmark, the process is similar to Angoff. The item difficulties corresponding to each bookmark are averaged, the raters discuss the result, impact data can be reviewed, and then raters re-set their bookmark before the final cut score is determined. I have also seen larger  programs break raters into groups of five people, and each group has their own discussion before bringing their recommended cut score to the larger group. This cuts down on discussion time and keeps any one rater from hijacking the whole group.

The same process can be followed if we have more than two classifications for the assessment. For example, instead of Pass and Fail, we may have Novice, Proficient, and Advanced. We would need to determine what makes a participant Advanced instead of Proficient, but the same response probability should be used when placing the bookmarks for these two categories.

Standard Setting: Angoff Method Considerations

Austin FosseyPosted by Austin Fossey

In my last few posts, I spoke about validity. You may recall that our big takeaway was that validity has to do with the inferences we make about assessment results.

With many of our customers conducting criterion-referenced assessments, I’d like to use my next few posts to talk about setting standards that guide inferences about outcomes (e.g., pass and fail).

I’ll start by discussing the Angoff Method – about which Questionmark has some great resources including a recorded webinar on the subject. I encourage you to use this as a reference if you plan on using this method to set standards. Just to summarize, there are five key steps in the Angoff Method:

  1. Select the raters.
  2. Take the assessment.
  3. Rate the items.
  4. Review the ratings.
  5. Determine the cut score.

angoff 1Expert Ratings Spreadsheet Example

Using the Angoff Method to Set Cut Scores (Wheaton & Parry, 2012)

Some psychometricians repeat steps 3-5 in a modified version of this method. In my experience, when raters compare their results, their second rating often regresses to the mean. If the assessment has been field tested, the psychometrician may also use the first round of ratings to tell your raters how many of the participants would have passed based on their recommended cut score (impact data).

Whether or not you choose to do a second round of rating depends on your preference. A second round means that your raters’ results may be biased by the group’s ratings and impact data, but this also serves to reign in outliers that may skew the group’s recommended cut score. This latter problem can also be mitigated by having a large number of representative raters, as discussed in a previous post.

As part of step 3, psychometricians train raters to rate items, and the toughest part is defining a minimally qualified participant. I find that raters often make the mistake of discussing what a minimally qualified participant should be able to do rather than what they can do. Taking the time to nail down this definition will help to calibrate the group and temper that one overzealous rater who insists that participants get 100% of items correct to pass the assessment.

The definition of a minimally qualified participant depends on the observable variables you are measuring to make inferences about the construct. If your assessment has a blueprint, your psychometrician may guide the raters in a discussion of what a minimally qualified participant is able to do in each content area.

For example, if the assessment has a content area about ingredients in a peanut butter sandwich, there may be a brief discussion to confirm that a minimally qualified participant knows how to unscrew the lid of the peanut butter jar.

This example is silly (and delicious), but this level of detail is valuable when two of your raters disagree about what a minimally qualified participant is able to do. Resolving these disagreements before rating items helps to ensure that differences in ratings are a result of raters’ opinions and not artifacts of misunderstandings about what it means to be minimally qualified.

Standard Setting – How Much Does the Ox Weigh?

Austin FosseyPosted by Austin Fossey

At the Questionmark 2013 Users Conference, I had an enjoyable debate with one of our clients about the merits and pitfalls underlying the assumptions of standard setting.

We tend to use methods like Angoff or the Bookmark Method to set standards for high-stakes assessments, and we treat the resulting cut scores as fact, but how can we be sure that the results of the standard setting reflect reality?

In his book, The Wisdom of Crowds, James Surowiecki recounts a story about Sir Francis Galton visiting a fair in 1906. Galton observed a game where people could guess the weight of an ox, and whoever was closest would win a prize.

Because guessing the weight of an ox was considered to be a lot of fun in 1906, hundreds of people lined up and wrote down their best guess. Galton got his hands on their written responses and took them home. He found that while no one guess was exactly right, the crowd’s mean guess was pretty darn good: only one pound off from the true weight of the ox.weight ox

We cannot expect any individual’s recommended cut score in a standard setting session to be spot on, but if we select a representative sample of experts and provide them with relevant information about the construct and impact data, we have a good basis for suggesting that their aggregated ratings are a faithful representation of the true cut score.

This is the nature of education measurement—our certainty about our inferences is dependent on the amount of data we have and the quality of that data. Just as we infer something about a student’s true abilities based on their responses to carefully selected items on a test, we have to infer something about the true cut score based on our subject matter experts’ responses to carefully constructed dialogues in the standard setting process.

We can also verify cut scores through validity studies, thus strengthening the case for our stakeholders. So take heart—your standard setters as a group have a pretty good estimate on the weight of that ox.

Webinar: Using the Angoff method to set cut scores

Posted by Joan Phaup

How do you set appropriate pass/fail scores for competency tests?

We learned a lot about this during this year’s Questionmark Users Conference from two customers who have used the Angoff method for setting cut scores and think it’s a practical answer to this question.

Alan H. Wheaton and James R. Parry, who are involved respectively in curriculum management and test development for a large government agency, regard the Angoff method as a systematic,effective approach to establishing pass/fail scores for advancement tests. They will share their experiences and lessons learned during a Questionmark Customers Online webinar at 1 p.m. Eastern Time on Thursday, May 31.

Click here to sign up for Using the Angoff Method to Set Cut Scores, and plan to join us for an hour at the end of this month. The webinar will explain a five-step process for implementing the Angoff method as a way to improve the defensibility of your tests.

The Future Looks Bright

Posted by Jim Farrell

Snapshot from a “Future Solutions” focus group

Our Users Conferences are a time for us to celebrate our accomplishments and look forward to the challenges that lie in front of us. This year’s conference was full of amazing sessions presented by Questionmark staff and customers. Our software is being used to solve complex business problems. From a product standpoint, it is the very exciting to bring these real-life scenarios to our development teams to inspire them.

So where do we go from here? The Conference is our chance to stand in front of our customers and get feedback on our roadmap. We also held smaller “Future Solutions” focus groups to get feedback from our customers on what we have done and what we could do in the future to help them. In the best of times, these are an affirmation that we are on the right path. This was definitely one of those years.

One of our Future Solutions sessions focused on authoring. During that session, Doug Peterson and I laid out the future of Questionmark Live. This included an aggressive delivery cycle that will bring future releases at a rapid pace. Stay tuned for videos on new features available soon.

Ok…enough about us. This conference is really about our customers. The panel and peer discussion strand of this year’s conference had some of the most interesting topics. John Kleeman has already mentioned the security panel with our friends from Pearson Vue, ProctorU, Innovative Exams and Shenandoah University.

Another session that stood out was as a peer discussion test defensibility using the Angoff method to set cut scores. This conversation was very  interesting to me as someone who once had to create defensible assessments. I am eager to see organizations utilize Angoff because not only do  you want legally defensible assessments, you want to define levels of competency for a role and be able to determine how that can  predict future performance.

For those of you who do not know, the Angoff method is a way for Subject Matter Experts (SMEs) to grade the probability of a marginal student getting a question right. Attendees at this conference session were provided a handout that includes a seven-step flowchart guiding them in the design, development and implementation of the Angoff method.

If you are interested in Angoff and setting test scores I highly recommend reading Criterion-Referenced Test Development written by our good friends Sharon Shrock and Bill Coscarelli.

We really hope to see everyone at the 2013 Users Conference in Baltimore March 3 – 6. (I am hoping we may even get a chance to visit the beautiful Camden Yards!)

Standard Setting: Methods for establishing cut scores


Posted by Greg Pope

My last post offered an introduction to standard setting; today I’d like to go into more detail about establishing cut scores. There are many standard setting methods used to set cut scores. These methods are generally split into two types: a) question-centered approaches and b) participant-centered approaches. A few of the most popular methods, with very brief descriptions of each, are provided below. For more detailed information on standard setting procedures and methods see the book, Setting Performance Standards: Concepts, Methods, and Perspectives, edited by Gregory Cizek and Robert Sternberg.

  • Modified Angoff method (question-centered): Subject matter experts (SMEs) are generally briefed on the Angoff method and allowed to take the test with the performance levels in mind. SMEs are then asked to provide estimates for each question of the proportion of borderline or “minimally acceptable” participants that they would expect to get the question correct. The estimates are generally in p-value type form (e.g., 0.6 for item 1: 60% of borderline passing participants would get this question correct). Several rounds are generally conducted with SMEs allowed to modify their estimates given different types of information (e.g., actual participant performance information on each question, other SME estimates, etc.). The final determination of the cut score is then made (e.g., by averaging estimates or taking the median). This method is generally used with multiple-choice questions.
  • I like a dichotomous modified Angoff approach where, instead of using p-value type statistics, SMEs are asked to simply provide a 0/1 for each question (“0” if a borderline acceptable participant would get the question wrong and “1” if a borderline acceptable participant would get the item right)
  • Nedelsky method (question-centered): SMEs make decisions on a question-by-question basis regarding which of the question distracters they feel borderline participants would be able to eliminate as incorrect. This method is generally used with multiple-choice questions only.
  • Bookmark method (question-centered): Questions are ordered by difficulty (e.g., Item Response Theory b-parameters or Classical Test Theory p-values) from easiest to hardest. SMEs make “bookmark” determinations of where performance levels (e.g., cut scores) should be (“As the test gets harder, where would a participant on the boundary of the performance level not be able to get any more questions correct?”) This method can be used with virtually any question type (e.g., multiple-choice, multiple-response, matching, etc.).
  • Borderline groups method (participant-centered): A description is prepared for each performance category. SMEs are asked to submit a list of participants whose performance on the test should be close to the performance standard (borderline). The test is administered to these borderline groups and the median test score is used as the cut score. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).
  • Contrasting groups method (participant-centered): SMEs are asked to categorize the participants in their classes according to the performance category descriptions. The test is administered to all of the categorized participants and the test score distributions for each of the categorized groups are compared. Where the distributions of the contrasting groups intersect is where the cut score would be located. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).

I hope this was helpful and I am looking forward to talking more about an exciting psychometric topic soon!