Standard Setting – How Much Does the Ox Weigh?

Austin FosseyPosted by Austin Fossey

At the Questionmark 2013 Users Conference, I had an enjoyable debate with one of our clients about the merits and pitfalls underlying the assumptions of standard setting.

We tend to use methods like Angoff or the Bookmark Method to set standards for high-stakes assessments, and we treat the resulting cut scores as fact, but how can we be sure that the results of the standard setting reflect reality?

In his book, The Wisdom of Crowds, James Surowiecki recounts a story about Sir Francis Galton visiting a fair in 1906. Galton observed a game where people could guess the weight of an ox, and whoever was closest would win a prize.

Because guessing the weight of an ox was considered to be a lot of fun in 1906, hundreds of people lined up and wrote down their best guess. Galton got his hands on their written responses and took them home. He found that while no one guess was exactly right, the crowd’s mean guess was pretty darn good: only one pound off from the true weight of the ox.weight ox

We cannot expect any individual’s recommended cut score in a standard setting session to be spot on, but if we select a representative sample of experts and provide them with relevant information about the construct and impact data, we have a good basis for suggesting that their aggregated ratings are a faithful representation of the true cut score.

This is the nature of education measurement—our certainty about our inferences is dependent on the amount of data we have and the quality of that data. Just as we infer something about a student’s true abilities based on their responses to carefully selected items on a test, we have to infer something about the true cut score based on our subject matter experts’ responses to carefully constructed dialogues in the standard setting process.

We can also verify cut scores through validity studies, thus strengthening the case for our stakeholders. So take heart—your standard setters as a group have a pretty good estimate on the weight of that ox.

Assessment types and their uses: summative assessments

Posted by Julie Delazyn

To use assessments effectively, it’s important to understand their context and uses within the learning process.

Over the past few weeks I have written about diagnostic assessments, formative assessments and needs assessments. My last post in this series is about summative assessments.

Typical uses:

  • Measuring or certifying knowledge, skills and aptitudes (KSAs)
  • Providing a quantitative grade and making a judgment about a person’s knowledge, skills and achievement
  • Determining whether the examinee meets the predetermined standard for specialized expertise
  • Determining a participant’s level of performance at a particular time


  • Licensing exams
  • Certification tests
  • Pre-employment tests
  • Academic entrance exams
  • Post-course tests
  • Exams during study

Medium, High


Summative assessments are easy to explain: they sum up the knowledge or the skills of the person taking the test. This type of assessment provides a quantitative grade and makes a judgment about a person’s knowledge, skills and achievement. A typical example would be a certification that a technician must pass in order to install and/or do repairs on a particular piece of machinery. In passing the certification exam, a candidate proves his or her understanding of the machinery.

For more details about assessments and their uses check out the white paper, Assessments Through the Learning Process. You can download it free here, after login. Another good source for testing and assessment terms is our glossary.

Conference Close-up: Using Flash and Captivate Questions with Questionmark

Posted by Joan Phaup

Participants in the annual Questionmark Users Conference bring a lot of enthusiasm about using innovative question types in their assessments. A number of our customers have extedougnsive experience with this and like to share their expertise at the conference. I spoke the other day with Doug Peterson from Verizon Communications and asked him about the case study he will share at the conference about using Flash and Captivate questions  within Questionmark Perception.

Here’s a quick wrap-up of our conversation:

Q:  What’s your role at Verizon  Communications?
A:  I have two roles: I develop, maintain and deliver training — mainly  now on internet technologies — and I’m responsible for a series of online automated tests for our help center training program. This is a pass/fail curriculum and very high stakes because these tests can affect people’s job status. So we need to be absolutely sure that the tests are well written and well maintained. These used to be written tests that were graded by an instructor. We turned to Questionmark for an objective, unbiased, online, airtight testing system and I oversee that.

Q: How are you using Questionmark Perception?
A: We have a couple of tests for each of the three modules in the training curriculum. We use Questionmark for end-of-lesson reviews as well as the higher stakes tests  that determine whether a person has passed or failed a module. We use scenarios that trainees might encounter in working with a customer. There might be 6 to 8 scenarios in each test and 10 or 12 questions about each scenario. The trainees take these tests right in the classroom, on their classroom computers.  We create individual QM accounts for each student and schedule the tests directly for those accounts.  We schedule them for a specific day and time window.  No one can see the tests except for the students, and they can only access them during the testing window. We had subject matter experts tell us what we needed to cover in the scenarios and what questions we needed to ask about them. They explained what would be a reasonable way to present a question or simulation to test a particular skill. Once we’d created all the scenarios and written all the questions we did an in-depth validation.

Q: What will you be sharing during your case study presentation at the Users Conference?
A:  Our call center agents have to use several applications when they get a call from a customer. They’ll have to look up a trouble ticket, get information about the customer and so forth. We need to make sure they knew how to use those applications, so we have created Perception questions using Captivate and Flash files with ActionScript that present the application to the student. Then the student needs to work through the application to demonstrate their proficiency with it. We’ve worked out a way to create a highly interactive, very realistic simulation in Flash that captures each student’s actions in using a particular application. It really tracks step by step. Being able to take the individual things from the Flash scenarios makes it so that when we run reports after the test we can easily see if a lot of people are is missing something like clicking on a particular button. Then the instructor can go back and make sure the students understand what they are supposed to. We went through a complex process to figure all this out, but it’s given us the ability to create a highly interactive, very realistic simulation in Flash with action script ActionScript coding and all kinds of logic and still pass back individual point values for different tasks. I’m very proud of the tests we have created and the work we have done. We have some fabulous questions in there that allows the students to show that they really understand applications and know how to do something from start to finish. We learned many tips and tricks along the way and I will be sharing those with the people at my session.

Q: What are you looking forward to at this year’s conference?
A: I really enjoyed the sessions on item analysis and test validity at the 2009 conference, and I am looking forward to learning even more about those subjects this year. And anything about new functionality in Perception Version 5 will be on my list too.

You can attend Doug’s presentations and many others at the conference in Miami March 14 – 17. Early-bird registration ends January 22nd, so sign up soon!

Psychometrics 101: How do I know if an assessment is reliable? (Part 3)


Posted by Greg Pope

Following up from my posts last week on reliability I thought I would finish up on this theme by explaining the internal consistency reliability measure: Cronbach’s Alpha.

Cronbach’s Alpha produces the same results as the Kuder-Richardson Formula 20 (KR-20) internal consistency reliability for dichotomously scored questions (right/wrong, 1/0), but  Cronbach’s Alpha  also allows for the analysis of polytomously scored questions (partial credit, 0 to 5). This is why Questionmark products (e.g., Test Analysis Report, RMS) use Cronbach’s Alpha rather than KR-20.

People sometimes ask me about KR-21. This is a quick and dirty reliability estimate formula that almost always produces lower values than KR-20. KR-21 assumes that all questions have equal difficulty (p-value) to make hand calculations easier. This assumption of all questions having the same difficulty is usually not very close to reality where questions on an assessment generally have a range of difficulty. This is why few people in the industry use KR-21 over KR-20 or Cronbach’s Alpha.

My colleagues and I generally recommend that Cronbach’s Alpha values of 0.90 or greater are excellent and acceptable for high-stakes tests, while values of 0.7 to 0.90 are considered to be acceptable/good and appropriate for medium-stakes tests. Generally values below 0.5 are considered unacceptable. With this said, in low stakes testing situations it may not be possible to obtain high internal consistency reliability coefficient values. In this context one might be better off evaluating the performance of an assessment on an item-by-item basis rather than focusing on the overall assessment reliability value.


White Paper: Delivering Assessments Safely and Securely

julie-smallPosted by Julie Chazyn

We have just updated our white paper on Delivering Assessments Safely and Securely, which helps people choose security measures that match up with the types of assessments they’re delivering – from low stakes to high stakes.info1

This new paper takes into account changes in technologies and standards that have taken place in the last few years—as well as new testing environments and methods. We’ve also added some tips to help prevent cheating.

You can download the paper here.