Standard Setting: Bookmark Method Overview

Austin FosseyPosted by Austin Fossey

In my last post, I spoke about using the Angoff Method to determine cut scores in a criterion-referenced assessment. Another commonly used method is the Bookmark Method. While both can be applied to a criterion-referenced assessment, Bookmark is often used in large-scale assessments with multiple forms or vertical score scales, such as some state education tests.

In their chapter entitled “Setting Performance Standards” in Educational Measurement (4th ed.), Ronald Hambleton and Mary Pitoniak discuss describe many commonly used standard setting procedures. Hambleton and Pitoniak classify the Bookmark as an “item mapping method,” which means that standard setters are presented with an ordered item booklet that is used to map the relationship between item difficulty and participant performance.

In Bookmark, item difficulty must be determined a priori. Note that the Angoff Method does not require us to have item statistics for the standard setting to take place, but we usually will have the item statistics to use as impact data. With Bookmark, item difficulty must be calculated with an item response theory (IRT) model before the standard setting.

Once the items’ difficulty parameters have been established, the psychometricians will assemble the items into an ordered item booklet. Each item gets its own page in the booklet, and the items are ordered from easiest to hardest, such that the hardest item is on the last page.standard book

Each rater receives an ordered item booklet. The raters go through the entire booklet once to read every item. They then go back through and place a bookmark between the two items in the booklet that represent the cut point for what minimally qualified participants should know and be able to do.

Psychometricians will often ask raters to place the bookmark at the item where 67% of minimally qualified participants will get the item right. 67% is called the response probability, and it is an easy value for raters to use because they just pick the item where about two-thirds of minimally qualified participants will get the item right. Other response probabilities can be used (e.g., 50% of minimally qualified participants), and Hambleton and Pitoniak describe some of the issues around this decision in more detail.

After each rater has placed a bookmark, the process is similar to Angoff. The item difficulties corresponding to each bookmark are averaged, the raters discuss the result, impact data can be reviewed, and then raters re-set their bookmark before the final cut score is determined. I have also seen larger  programs break raters into groups of five people, and each group has their own discussion before bringing their recommended cut score to the larger group. This cuts down on discussion time and keeps any one rater from hijacking the whole group.

The same process can be followed if we have more than two classifications for the assessment. For example, instead of Pass and Fail, we may have Novice, Proficient, and Advanced. We would need to determine what makes a participant Advanced instead of Proficient, but the same response probability should be used when placing the bookmarks for these two categories.

Leave a Reply