SlideShare Presentation on Creating Better Tests

Headshot JuliePosted by Julie Delazyn

Creating strong, defensible assessments is a frequent theme in this blog.

Most recently, Doug Peterson took readers step by step through a ten-part series of posts on Test Design and Delivery. Based on this series, he put together a presentation (now available on SlideShare) called Five Steps to Better Tests: Best Practices for Design and Delivery for the Questionmark Users Conference in March.

The presentation goes through each key step test creation process from planning the test, with tips like avoiding bias and stereotyping, to creating it and setting passing standards, delivering it while ensuring test security and finally, evaluating it using item-level data to improve item quality.

Enjoy the presentation below, and mark your calendar for the 2014 Questionmark Users Conference, March 4-7 in in San Antonio, Texas.

Evaluating the Test — Test Design & Delivery Part 10

Doug Peterson HeadshotPosted By Doug Peterson

In this, the 10th and final installment of the Test Design and Delivery series, we take a look at evaluating the test. Statistical analysis improves as the number of test takers goes up, but data from even a few attempts can provide useful information. In most cases, it we recommended performing analysis on data from at least 100 participants data from 250 or more is considered more trustworthy.

Analysis falls into two categories: item statistics and analysis (the performance of an individual item), and test analysis (the performance of the test as a whole). Questionmark provides both of these analyses in our Reports and Analytics suites.

Item statistics provide information on things like how many times an item has been presented and how many times each choice has been selected. This information can point out a number of problems:

  • An item that has been presented a lot may need to be retired. There is no hard and fast number as far as how many presentations is too many, but items on a high-stakes test should be changed fairly frequently.
  • If the majority of test-takers are getting the question wrong but they are all selecting the same choice, the wrong choice may be flagged as the correct answer, or the training might be teaching the topic incorrectly.
  • If no choice is being selected a majority of the time, it may indicate that the test-takers are guessing, which could in turn indicate a problem with the training. It could also indicate that no choice is completely correct.

Item analysis typically provides two key pieces of information: the Difficulty Index and the Point-Biserial Correlation.

  • Difficulty index: P value = % who answered correctly
  • Too high = too easy
  • Too low = too hard, confusing or misleading, problem with content or instruction
  • Point-Biserial Correlation: how well item discriminated between those who did well on the exam and those who did not
  • Positive value = those who got the item correct also did well on the exam, and those who got the item wrong also did poorly on the exam
  • Negative value = those who did well on the test got the item wrong, those who did poorly on the test got the item right
  • +0.10 or above is typically required to keep an item

Test analysis typically comes down to determining a Reliability Coefficient. In other words, does the test measure knowledge consistently – does it produce similar results under consistent conditions? (Please note that this has nothing to do with validity. Reliability does not address whether or not the assessment tests what it is supposed to be testing. Reliability only indicates that the assessment will return the same results consistently, given the same conditions.)

  • Reliability Coefficient: range of 0 – 1.00
  • Acceptable value depends on consequences of testing error
  • If failing means having to take some training again, a lower value might be acceptable
  • If failing means the health and safety of coworkers might be in jeopardy, a high value is required

part 10

There are a number of different types of consistency:

  • Test – Retest: repeatability of test scores with the passage of time
  • Alternate / Parallel Form: consistency of score across two or more forms by same test taker
  • Inter-Rater: consistency of test score when rated by different raters
  • Internal Consistency: extent to which items on a test measure the same thing
  • Most common: Kuder Richardson-20 (KR-20) or Coefficient Alpha
  • Items must be single answer (right/wrong)
  • May be low if test measures several different, unrelated objectives
  • Low value can also indicate many very easy or hard items, poorly written items that do not discriminate well, or items that do not test the proper content
  • Mastery Classification Consistency
  • Criterion-referenced tests
  • Not affected by items measuring unrelated items
  • 3 common measures:
  • Phi coefficient
  • Agreement coefficient
  • Kappa

Doug will share these and other best practices for test design and delivery at the Questionmark Users Conference in Baltimore March 3 -6. The program includes an optional pre-conference workshop on Criterion-Referenced Test Development led by Sharon Shrock and Bill Coscarelli. Click here for conference and workshop registration.

Delivering the Test — Test Design & Delivery Part 8

Posted By Doug Peterson

You’ve done your Job Task Analysis, created a competency model, and used it to develop a Test Content Outline (TCO). You’ve created well-written items that map back to your TCO. You’ve determined how many, and which type of, questions you need for each content area. You have avoided bias and stereotyping, and worked to ensure validity and reliability. You’ve developed your test directions for both the test-taker and the administrator. You’ve set your cutscore.

It’s finally time to deliver the assessment!

Here are some things to think about as you deliver your assessment:


If you’re using pencil and paper tests, you need to make sure the tests are stored in a secure location until test time. Test booklets and answer sheets must be numbered, and the test administrators should complete tracking forms that account for all booklets and answer sheets. Test-takers should be required to provide some form of identification to prove that they are the person who is scheduled for the exam.

Computer-based testing also needs to be secure. One way to increase security is to deliver the assessment in a testing center with a proctor in the room. If the test-takers are distributed across many locations, Questionmark offers Questionmark Secure, which locks down test-takers’ machines and doesn’t allow them to copy questions or switch tasks. Computer-based testing security can/should also include some form of identification and password verification.

Test-Retest Policies

Many times a testing organization will allow someone who fails a test to retest at some point. You also need to account for someone getting sick during the middle of a test, or getting an emergency phone call and having to leave. What if the power goes out in the middle of a computer-based test? You need to determine ahead of time what you will do in situations like these.

If the test is interrupted, will you let the test-taker resume the test (pick up where they left off) or take a new test? A lot of this has to do with the length of the interruption – did the test-taker have time to go off and look up any answers? This is not a consideration if your test doesn’t allow the participant to go back and change answers.

The problem with retesting is that the test-taker has already seen the questions. You should consider not providing individual question feedback if the test-taker fails the test, so that he/she doesn’t know what to go look up between tests. Most organizations require a waiting period between takes so that the questions will not be fresh in the test-taker’s mind.

A lot of the problem with retesting can be alleviated by creating multiple test forms (versions) with different questions. If a test-taker fails on their first attempt and wants to retest, you can give them a different form for the retest. At that point you don’t have to worry that they remembered any questions from the first attempt and went home to look up the answers, because they will be seeing all new questions. If you use multiple forms, you must ensure that the exact same topics are covered in the same depth, with questions having the same level of difficulty.

In the next post, we’ll take a look at controlling item exposure, limiting opportunities for cheating, and maintaining test integrity and ethics.

Assembling the Test Form — Test Design and Delivery Part 7

Posted By Doug Peterson

In the previous post in this series, we looked at putting together assessment instructions for both the participant and the instructor/administrator. Now it’s time to start selecting the actual questions.

Back in Part 2 we discussed determining how many items needed to be written for each content area covered by the assessment. We looked at writing 3 times as many items as were actually needed, knowing that some would not
make it through the review process. Doing this also enables you to create multiple forms of the test, where each form covers the same concepts with equivalent – but different – questions. We also discussed the amount of time a participant needs to answer each question type, as shown in this table:

As you’re putting your assessment together, you have to account for the time required to take the assessment. You have to multiply the number of each question type in the assessment by the values in the table above.

You also need to allow time for:

  • Reading the instructions
  • Reviewing sample items
  • Completing practice items
  • Completing demographic info
  • Taking breaks

If you already know the time allowed for your assessment, you may have to work backwards or make some compromises. For example, if you know that you only have one hour for the assessment, and you have a large amount of content to cover, you may want to consider focusing on multiple choice and fill-in-the-blank questions, and stay away from matching and short-answer to maximize the number of questions you can include in the time period allowed.

To select the actual items for the assessment, you may want to consider using a Test Assembly Form, which might look something like this:

The content area is in the first column. The second column shows how many questions are needed for that content area (as calculated back in Part 2). Each item should have a short identifier associated with it, and this is provided in the “Item #” column. The “Keyword” column is just that – one or two words to remind you what the question addresses. The last column lists the item number of an alternate item in case a problem is found with the first selection during assessment review.

As you select items, watch out for two things:

1. Enemy items. This is when one item gives away the answer to another item. Make sure that the stimulus or answer to one item does not answer or give a clue to the answer of another item.

2. Overlap. This is when two questions basically test the same thing. You want to cover all of the content in a given content area, so each question for that content area should cover something unique. If you find that you have several questions assessing the same thing, you may need to write some new questions or you may need to re-calculate how many questions you actually need.

Once you have your assessment put together, you need to calculate the cutscore. This topic could easily be another (very lengthy) blog series, and there are many books available on calculating cutscores. I recently read the book, Cutscores: A Manual for Setting Standards of Performance on Educational and Occupational Tests, by Zieky, Perie and Livingston. I found it to be a very good book, considering that the subject matter isn’t exactly “thrill a minute”. The authors discuss 18 different methods for setting cutscores, including which methods to use in various situations and how to carry out a cutscore study. They look at setting cutscores for criterion-referenced assessments (where performance is judged against a set standard) as well as norm-referenced assessments (where the performance of one participant is judged against the performance of the other participants). They also look at pass/fail situations as well as more complex judgments such as dividing participants into basic, proficient and advanced categories.

Preparing to Create the Assessment – Test Design & Delivery Part 6

Posted By Doug Peterson

At this point in the design process, you’ve written all the items for your assessment. Before you assemble them into a test, they need to be reviewed. Be sure to link each item to the Test Content Outline (TCO), then ask a group of subject matter experts (SMEs) to review the questions. This very well could be the same group of SMEs that wrote the questions in the first place, in which case they can simply review each other’s work. There are three main things to look at when reviewing each item:

  • Spelling and grammar
  • Clarity – is it clear what the item is asking? Is the item asking only one question and does it have only one correct answer? Is the item free of any extraneous information, bias, and stereotyping?
  • Connection to TCO – it is legitimate to include this item on this assessment because it clearly and directly pertains to the goals of the training.

Once you are confident that you have a complete set of well-written items that tie directly to your TCO, it’s time to start putting the assessment together. In addition to determining which questions from your item bank you want to include (which will be discussed in the next entry in this series), you must also develop test directions for the participant. These directions should include:

  • Purpose of the assessment
  • Amount of time allowed
  • Procedures for asking questions
  • Procedures for completing the assessment
  • Procedures for returning test materials

As part of your participant directions, you may want to consider including sample items, especially if the format is unusual or unfamiliar to the participants. Sample items also help reduce test anxiety. Remember, you want to assess the participant’s true knowledge, which means you don’t want a “stress barrier” getting in the way.

In addition to the participant’s instructions, you also want to put together instructions for the assessment administrator – the instructor or proctor who will be handing the test out and watching over the room while the participant’s take the assessment. Having a set of written instructions will help ensure consistency when the assessment is given by different administrators in different locations. The instructions should include:

  • The participant’s instructions, which should be read aloud
  • How to handle and document irregularities
  • The administrator’s monitoring responsibilities and methods (e.g., no phone conversations, walk around the room every 10 minutes, etc.)
  • Hardware and software requirements and instructions, if applicable
  • Contact information for technical help

As you develop your assessment, make sure that you are taking into account any local or national laws. For example, American test centers must comply with the Americans with Disabilities Act (ADA). The ADA requires that the test site be accessible to participants in wheelchairs and that compensation be made for certain impaired abilities (e.g., larger print or a screen reader for visually impaired participants). The administrator’s instructions should cover what to do in each case.

Test Design and Delivery: Overview

Posted By Doug Peterson

I had the pleasure of attending an ASTD certification class on Test Design and Delivery in Denver, Colorado, several weeks ago (my wife said it was no big deal, as I’ve been certifiable for a long time now). I’m going to use my blog posts for the next couple of months to pass along the highlights of what I learned.

The content for the class was developed by the good folks at ACT. During our two days together we covered the following topics:

  1. Planning the Test
  2. Creating the Test Items
  3. Creating the Test Form
  4. Delivering the Test
  5. Evaluating the Test

Over the course of this blog series, we’ll take a look at the main points from each topic in the class. We’ll look at all the things that go into writing a test before the first question is crafted, like establishing reliability and validity from the beginning and identifying content areas to be covered (as well as the number of questions needed for each area).

Next we’ll discuss some best practices for writing test items, including increasing the cognitive load and avoiding bias and stereotypes. After that we’ll discuss pulling items together into a test form, including developing instructions and setting passing scores.

The last few blogs will focus on some things you need to look at when delivering a test like security and controlling item exposure. Then we’ll look at evaluating a test’s performance by examining item-level and test-level data to improve quality and assess reliability.

As we work our way through this series of blogs, be sure to ask questions and share your thoughts in the comments section!

Posts in this series:

  1. Planning the Test
  2. Determining Content
  3. Final Planning Considerations
  4. Writing Test Items
  5. Avoiding Bias and Stereotypes
  6. Preparing to Create the Assessment
  7. Assembling the Test Form
  8. Delivering the Test
  9. Content Protection and Secure Delivery
  10. Evaluating the Test