Checklists for Test Development

Austin Fossey-42Posted by Austin Fossey

There are many fantastic books about test development, and there are many standards systems for test development, such as The Standards for Educational and Psychological Testing. There are also principled frameworks for test development and design, such as evidence-centered design (ECD). But it seems that the supply of qualified test developers cannot keep up with the increased demand for high-quality assessment data, leaving many organizations to piece together assessment programs, learning as they go.checklist

As one might expect, this scenario leads to new tools targeted at these rookie test developers—simplified guidance documents, trainings, and resources attempting to idiot-proof test development. As a case in point, Questionmark seeks to distill information from a variety of sources into helpful, easy-to-follow white papers and blog posts. At an even simpler level, there appears to be increased demand for checklists that new test developers can use to guide test development or evaluate assessments.

For example, my colleague, Bart Hendrickx, shared a Dutch article from the Research Center for Examination and Certification (RCEC) at University of Twente describing their Beoordelingssysteem. He explained that this system provides a rubric for evaluating education assessments in areas like representativeness, reliability, and standard setting. The Buros Center for Testing addresses similar needs for users of mental assessments. In the Assessment Literacy section of their website, Buros has documents with titles like “Questions to Ask When Evaluating a Test”—essentially an evaluation checklist (though Buros also provides their own professional ratings of published assessments). There are even assessment software packages that seek to operationalize a test development checklist by creating a rigid workflow that guides the test developer through different steps of the design process.

The benefit of these resources is that they can help guide new test developers through basic steps and considerations as they build their instruments. It is certainly a step up from a company compiling a bunch of multiple choice questions on the fly and setting a cut score of 70% without any backing theory or test purpose. On the other hand, test development is supposed to be an iterative process, and without the flexibility to explore the nuances and complexities of the instrument, the results and the inferences may fall short of their targets. An overly simple, standardized checklist for developing or evaluating assessments may not consider an organization’s specific measurement needs, and the program may be left with considerable blind spots in its validity evidence.

Overall, I am glad to see that more organizations are wanting to improve the quality of their measurements, and it is encouraging to see more training resources to help new test developers tackle the learning curve. Checklists may be a very helpful tool for a lot of applications, and test developers frequently create their own checklists to standardize practices within their organization, like item reviews.

What do our readers think? Are checklists the way to go? Do you use a checklist from another organization in your test development?





Standard Setting: Methods for establishing cut scores


Posted by Greg Pope

My last post offered an introduction to standard setting; today I’d like to go into more detail about establishing cut scores. There are many standard setting methods used to set cut scores. These methods are generally split into two types: a) question-centered approaches and b) participant-centered approaches. A few of the most popular methods, with very brief descriptions of each, are provided below. For more detailed information on standard setting procedures and methods see the book, Setting Performance Standards: Concepts, Methods, and Perspectives, edited by Gregory Cizek and Robert Sternberg.

  • Modified Angoff method (question-centered): Subject matter experts (SMEs) are generally briefed on the Angoff method and allowed to take the test with the performance levels in mind. SMEs are then asked to provide estimates for each question of the proportion of borderline or “minimally acceptable” participants that they would expect to get the question correct. The estimates are generally in p-value type form (e.g., 0.6 for item 1: 60% of borderline passing participants would get this question correct). Several rounds are generally conducted with SMEs allowed to modify their estimates given different types of information (e.g., actual participant performance information on each question, other SME estimates, etc.). The final determination of the cut score is then made (e.g., by averaging estimates or taking the median). This method is generally used with multiple-choice questions.
  • I like a dichotomous modified Angoff approach where, instead of using p-value type statistics, SMEs are asked to simply provide a 0/1 for each question (“0” if a borderline acceptable participant would get the question wrong and “1” if a borderline acceptable participant would get the item right)
  • Nedelsky method (question-centered): SMEs make decisions on a question-by-question basis regarding which of the question distracters they feel borderline participants would be able to eliminate as incorrect. This method is generally used with multiple-choice questions only.
  • Bookmark method (question-centered): Questions are ordered by difficulty (e.g., Item Response Theory b-parameters or Classical Test Theory p-values) from easiest to hardest. SMEs make “bookmark” determinations of where performance levels (e.g., cut scores) should be (“As the test gets harder, where would a participant on the boundary of the performance level not be able to get any more questions correct?”) This method can be used with virtually any question type (e.g., multiple-choice, multiple-response, matching, etc.).
  • Borderline groups method (participant-centered): A description is prepared for each performance category. SMEs are asked to submit a list of participants whose performance on the test should be close to the performance standard (borderline). The test is administered to these borderline groups and the median test score is used as the cut score. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).
  • Contrasting groups method (participant-centered): SMEs are asked to categorize the participants in their classes according to the performance category descriptions. The test is administered to all of the categorized participants and the test score distributions for each of the categorized groups are compared. Where the distributions of the contrasting groups intersect is where the cut score would be located. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).

I hope this was helpful and I am looking forward to talking more about an exciting psychometric topic soon!

How to Create a Multiple Choice Question in Questionmark Perception

Want to know how to author a multiple choice question in Questionmark Perception? This software simulation from our Learning Cafe will take you through the process of writing your question stimulus and possible answers, then assigning scores and optional feedback for each answer. In just 3 minutes you’ll know the basics.


This is just one of many simulations available in the Learning Cafe, which has two sections — one about best practices and the other about the workings of Questionmark Perception. Feel free to take a advantage of these resources anytime you like.

The Secret of Writing Multiple-Choice Test Items

julie-smallPosted by Julie Chazyn

I read a very informative blog entry on the CareerTech Testing Center Blog that I thought was worth sharing. It’s about multiple-choice questions: how they are constructed and some tips and tricks to creating them.

I asked its author, Kerry Eades, an Assessment Specialist at the Oklahoma Department of Career and Technology teacherEducation (ODCTE), about his reasons for blogging on The Secret of Writing Multiple-Choice Test Items. According to Kerry, CareerTech Testing Center took this lesson out of a booklet they put together as a resource for subject matter experts who write multiple-choice questions for their item banks, as well as for instructors who needed better instruments to create strong in-class assessments for their own classrooms. Kerry points out that the popularity of multiple-choice questions “stems from the fact that they can be designed to measure a variety of learning outcomes.” He says it takes a great deal of time, skill, and adherence to a set of well-recognized rules for item construction to develop a good multiple-choice question item.

The CareerTech Testing Center works closely with instructors, program administrators, industry representatives, and credentialing entities to ensure skills standards and assessments meet Carl Perkins requirements, reflect national standards and local industry needs. Using Questionmark Perception, CareerTech conducts tests for more than 100 career majors, with an online competency assessment system that delivers approximately 75,000 assessments per year.

Check out The Secret of Writing Multiple-Choice Test Items.

For more authoring tips visit Questionmark’s Learning Café.

Some New Questionmark Web Seminars

joan-small1Posted by Joan Phaup

We have some new web seminars coming up for those of you who want to go into more detail on particular aspects of using Questionmark Perception.

From Item Banking to Content Harvesting: Authoring in Questionmark


Perception — May 7th at 3 p.m. EDT

  • This webinar will demonstrate the use of different authoring tools to author questions for use in surveys, quizzes, tests and exams. Questions from the various sources will be assembled into a sample assessments, which will then be taken online. Figure out which tools are the most practical for you and how to make them work together to produce assessments quickly and easily.

Analyzing and Sharing Assessment Results with Questionmark Enterprise Reporter — May 20th at 3 p.m. EDT

  • This session explains each of Perception’s 12 standard reports and the data and statistics they contain. Join us to learn how use templates to help you create reports easily and to learn about the various filter options you can use.

We have two seminars scheduled for Thursday, April 16th:

Overview of New Features in Perception v4.4 , set for 11 a.m. EDT in the U.S.

  • Includes several live demonstrations and gives you the opportunity to ask questions about them.

Beyond Multiple Choice: Nine Ways to Leverage Technology for Better Assessments at 10 a.m. BST in the U.K.

  • Will explore the role of assessments in measuring people’s knowledge, skills and attitudes. Join us to learn techniques for creating effective assessments that will help improve performance, manage workforce competencies, and ensure regulatory compliance.

We continue with a full schedule of introductory webinars for beginners. You can learn more and register for the webinar of your choice  at the following links:

US Webinar Schedule

UK Webinar Schedule