Evaluating the Test — Test Design & Delivery Part 10

Doug Peterson HeadshotPosted By Doug Peterson

In this, the 10th and final installment of the Test Design and Delivery series, we take a look at evaluating the test. Statistical analysis improves as the number of test takers goes up, but data from even a few attempts can provide useful information. In most cases, it we recommended performing analysis on data from at least 100 participants data from 250 or more is considered more trustworthy.

Analysis falls into two categories: item statistics and analysis (the performance of an individual item), and test analysis (the performance of the test as a whole). Questionmark provides both of these analyses in our Reports and Analytics suites.

Item statistics provide information on things like how many times an item has been presented and how many times each choice has been selected. This information can point out a number of problems:

  • An item that has been presented a lot may need to be retired. There is no hard and fast number as far as how many presentations is too many, but items on a high-stakes test should be changed fairly frequently.
  • If the majority of test-takers are getting the question wrong but they are all selecting the same choice, the wrong choice may be flagged as the correct answer, or the training might be teaching the topic incorrectly.
  • If no choice is being selected a majority of the time, it may indicate that the test-takers are guessing, which could in turn indicate a problem with the training. It could also indicate that no choice is completely correct.

Item analysis typically provides two key pieces of information: the Difficulty Index and the Point-Biserial Correlation.

  • Difficulty index: P value = % who answered correctly
  • Too high = too easy
  • Too low = too hard, confusing or misleading, problem with content or instruction
  • Point-Biserial Correlation: how well item discriminated between those who did well on the exam and those who did not
  • Positive value = those who got the item correct also did well on the exam, and those who got the item wrong also did poorly on the exam
  • Negative value = those who did well on the test got the item wrong, those who did poorly on the test got the item right
  • +0.10 or above is typically required to keep an item

Test analysis typically comes down to determining a Reliability Coefficient. In other words, does the test measure knowledge consistently – does it produce similar results under consistent conditions? (Please note that this has nothing to do with validity. Reliability does not address whether or not the assessment tests what it is supposed to be testing. Reliability only indicates that the assessment will return the same results consistently, given the same conditions.)

  • Reliability Coefficient: range of 0 – 1.00
  • Acceptable value depends on consequences of testing error
  • If failing means having to take some training again, a lower value might be acceptable
  • If failing means the health and safety of coworkers might be in jeopardy, a high value is required

part 10

There are a number of different types of consistency:

  • Test – Retest: repeatability of test scores with the passage of time
  • Alternate / Parallel Form: consistency of score across two or more forms by same test taker
  • Inter-Rater: consistency of test score when rated by different raters
  • Internal Consistency: extent to which items on a test measure the same thing
  • Most common: Kuder Richardson-20 (KR-20) or Coefficient Alpha
  • Items must be single answer (right/wrong)
  • May be low if test measures several different, unrelated objectives
  • Low value can also indicate many very easy or hard items, poorly written items that do not discriminate well, or items that do not test the proper content
  • Mastery Classification Consistency
  • Criterion-referenced tests
  • Not affected by items measuring unrelated items
  • 3 common measures:
  • Phi coefficient
  • Agreement coefficient
  • Kappa

Doug will share these and other best practices for test design and delivery at the Questionmark Users Conference in Baltimore March 3 -6. The program includes an optional pre-conference workshop on Criterion-Referenced Test Development led by Sharon Shrock and Bill Coscarelli. Click here for conference and workshop registration.

Content protection and secure delivery: Test Design and Delivery Part 9

Posted By Doug Peterson

Writing good items and putting together valid and reliable assessments can take a lot of time and cost a lot of money. Part of an assessment’s reliability and validity is based on the test-taker not knowing the items ahead of time. For these reasons, it is critical that item exposure be controlled.

This starts during the development process by requiring everyone who develops items or assessments to sign a confidentiality agreement. Developers’ computers should, at the very least, be password-protected, and you should consider data encryption as well.

Care must be taken to prevent question theft once an assessment is assembled and delivered. Do not allow overly generous time limits, which would provide time for a test-taker to go back through the assessment and memorize questions. If your assessment is delivered electronically, consider not allowing backward movement through the test. Be very careful about allowing the use of a “scribble sheet”, as someone might try to write down questions and sneak them out of the test center: be sure to number all scribble sheets and collect them at the end of the assessment.

Computer-based testing makes it very easy to utilize random item selection when the assessment is delivered. While this does mean having to develop more items, it cuts down the number of times any one item is delivered and helps to reduce cheating by presenting different questions in a different order to teach test-taker.

It is critical to track the number of times an item has been delivered. After a certain number of deliveries, you will want to retire an item and replace it with a new item. The main factor that impacts how many times an item should be exposed is whether the assessment is high-stakes or low-stakes. Items on a high-stakes exam should have a lower maximum number of exposures, but items on a low-stakes exam can have a higher number of exposures.

As long as there have been tests, there have been test-takers who try to cheat. Make sure that you authenticate each examinee to ensure that the person who is supposed to be taking the exam is, in fact, the person taking the exam. Testing centers typically prohibit talking, using notes, and using cell phones during tests. Maintain a minimum amount of space between test-takers, or use carrels to physically separate them.

Test administrators should walk around the room during the test. Unauthorized personnel should not be permitted to enter the room during the test, and the administrator should not leave the room for any reason without first bringing in another administrator.

Computer-based testing presents its own set of security challenges, especially when testing is permitted outside of a secure testing center (e.g., in the test-taker’s home). Questionmark offers the Questionmark Secure client, which locks  down test-takers’ machines and doesn’t allow them to copy questions or switch tasks.

Computer-based testing security can/should also include some form of identification and password verification. Additionally, in the last few years, technology has become available that allows for the remote monitoring of test-takers using built-in laptop/tablet cameras or small desktop devices.

Click here for links to a complete listing of posts in this series.

Delivering the Test — Test Design & Delivery Part 8

Posted By Doug Peterson

You’ve done your Job Task Analysis, created a competency model, and used it to develop a Test Content Outline (TCO). You’ve created well-written items that map back to your TCO. You’ve determined how many, and which type of, questions you need for each content area. You have avoided bias and stereotyping, and worked to ensure validity and reliability. You’ve developed your test directions for both the test-taker and the administrator. You’ve set your cutscore.

It’s finally time to deliver the assessment!

Here are some things to think about as you deliver your assessment:


If you’re using pencil and paper tests, you need to make sure the tests are stored in a secure location until test time. Test booklets and answer sheets must be numbered, and the test administrators should complete tracking forms that account for all booklets and answer sheets. Test-takers should be required to provide some form of identification to prove that they are the person who is scheduled for the exam.

Computer-based testing also needs to be secure. One way to increase security is to deliver the assessment in a testing center with a proctor in the room. If the test-takers are distributed across many locations, Questionmark offers Questionmark Secure, which locks down test-takers’ machines and doesn’t allow them to copy questions or switch tasks. Computer-based testing security can/should also include some form of identification and password verification.

Test-Retest Policies

Many times a testing organization will allow someone who fails a test to retest at some point. You also need to account for someone getting sick during the middle of a test, or getting an emergency phone call and having to leave. What if the power goes out in the middle of a computer-based test? You need to determine ahead of time what you will do in situations like these.

If the test is interrupted, will you let the test-taker resume the test (pick up where they left off) or take a new test? A lot of this has to do with the length of the interruption – did the test-taker have time to go off and look up any answers? This is not a consideration if your test doesn’t allow the participant to go back and change answers.

The problem with retesting is that the test-taker has already seen the questions. You should consider not providing individual question feedback if the test-taker fails the test, so that he/she doesn’t know what to go look up between tests. Most organizations require a waiting period between takes so that the questions will not be fresh in the test-taker’s mind.

A lot of the problem with retesting can be alleviated by creating multiple test forms (versions) with different questions. If a test-taker fails on their first attempt and wants to retest, you can give them a different form for the retest. At that point you don’t have to worry that they remembered any questions from the first attempt and went home to look up the answers, because they will be seeing all new questions. If you use multiple forms, you must ensure that the exact same topics are covered in the same depth, with questions having the same level of difficulty.

In the next post, we’ll take a look at controlling item exposure, limiting opportunities for cheating, and maintaining test integrity and ethics.

Assembling the Test Form — Test Design and Delivery Part 7

Posted By Doug Peterson

In the previous post in this series, we looked at putting together assessment instructions for both the participant and the instructor/administrator. Now it’s time to start selecting the actual questions.

Back in Part 2 we discussed determining how many items needed to be written for each content area covered by the assessment. We looked at writing 3 times as many items as were actually needed, knowing that some would not
make it through the review process. Doing this also enables you to create multiple forms of the test, where each form covers the same concepts with equivalent – but different – questions. We also discussed the amount of time a participant needs to answer each question type, as shown in this table:

As you’re putting your assessment together, you have to account for the time required to take the assessment. You have to multiply the number of each question type in the assessment by the values in the table above.

You also need to allow time for:

  • Reading the instructions
  • Reviewing sample items
  • Completing practice items
  • Completing demographic info
  • Taking breaks

If you already know the time allowed for your assessment, you may have to work backwards or make some compromises. For example, if you know that you only have one hour for the assessment, and you have a large amount of content to cover, you may want to consider focusing on multiple choice and fill-in-the-blank questions, and stay away from matching and short-answer to maximize the number of questions you can include in the time period allowed.

To select the actual items for the assessment, you may want to consider using a Test Assembly Form, which might look something like this:

The content area is in the first column. The second column shows how many questions are needed for that content area (as calculated back in Part 2). Each item should have a short identifier associated with it, and this is provided in the “Item #” column. The “Keyword” column is just that – one or two words to remind you what the question addresses. The last column lists the item number of an alternate item in case a problem is found with the first selection during assessment review.

As you select items, watch out for two things:

1. Enemy items. This is when one item gives away the answer to another item. Make sure that the stimulus or answer to one item does not answer or give a clue to the answer of another item.

2. Overlap. This is when two questions basically test the same thing. You want to cover all of the content in a given content area, so each question for that content area should cover something unique. If you find that you have several questions assessing the same thing, you may need to write some new questions or you may need to re-calculate how many questions you actually need.

Once you have your assessment put together, you need to calculate the cutscore. This topic could easily be another (very lengthy) blog series, and there are many books available on calculating cutscores. I recently read the book, Cutscores: A Manual for Setting Standards of Performance on Educational and Occupational Tests, by Zieky, Perie and Livingston. I found it to be a very good book, considering that the subject matter isn’t exactly “thrill a minute”. The authors discuss 18 different methods for setting cutscores, including which methods to use in various situations and how to carry out a cutscore study. They look at setting cutscores for criterion-referenced assessments (where performance is judged against a set standard) as well as norm-referenced assessments (where the performance of one participant is judged against the performance of the other participants). They also look at pass/fail situations as well as more complex judgments such as dividing participants into basic, proficient and advanced categories.

Test Design and Delivery: Overview

Posted By Doug Peterson

I had the pleasure of attending an ASTD certification class on Test Design and Delivery in Denver, Colorado, several weeks ago (my wife said it was no big deal, as I’ve been certifiable for a long time now). I’m going to use my blog posts for the next couple of months to pass along the highlights of what I learned.

The content for the class was developed by the good folks at ACT. During our two days together we covered the following topics:

  1. Planning the Test
  2. Creating the Test Items
  3. Creating the Test Form
  4. Delivering the Test
  5. Evaluating the Test

Over the course of this blog series, we’ll take a look at the main points from each topic in the class. We’ll look at all the things that go into writing a test before the first question is crafted, like establishing reliability and validity from the beginning and identifying content areas to be covered (as well as the number of questions needed for each area).

Next we’ll discuss some best practices for writing test items, including increasing the cognitive load and avoiding bias and stereotypes. After that we’ll discuss pulling items together into a test form, including developing instructions and setting passing scores.

The last few blogs will focus on some things you need to look at when delivering a test like security and controlling item exposure. Then we’ll look at evaluating a test’s performance by examining item-level and test-level data to improve quality and assess reliability.

As we work our way through this series of blogs, be sure to ask questions and share your thoughts in the comments section!

Posts in this series:

  1. Planning the Test
  2. Determining Content
  3. Final Planning Considerations
  4. Writing Test Items
  5. Avoiding Bias and Stereotypes
  6. Preparing to Create the Assessment
  7. Assembling the Test Form
  8. Delivering the Test
  9. Content Protection and Secure Delivery
  10. Evaluating the Test


Mobile Delivery: Using Questionmark’s App for Apple® iPhone® and iPad™

Posted by Julie Delazyn

Questionmark’s auto-sensing delivery makes it easy to author an assessment once, and then deliver to many different types of devices. The same Questionmark assessment you deliver to desktop or laptop can also be delivered via standard browsers on many different smartphones and tablets – but today I want to highlight the Questionmark Apps for Apple® iPhone® and iPad™. Plainly put, the Questionmark App equals added convenience. With simple configuration and options for customization it offers organizations an easy way to deliver quizzes and surveys to learners on the move, giving participants one-touch access to assessments that have been assigned to them.

To get the Questionmark App:

  1. Visit the App store and search for “Questionmark”
  2. Choose either Questionmark’s iPad App or iPhone/iPod touch App
  3. Install the App on your device

Once installed:

  1. Try a Demo and see examples of different assessments for you to try
  2. Configure the app to point it directly to your Perception Server. You only have to configure the application once; it will remember the settings.

To easily configure the App, customers can enter the URL of their Questionmark Perception server or enter a customer number provided by our Customer Care Team.

Check out this video to see how to configure the app:

For more on mobile delivery be sure to attend the Questionmark Users Conference in Los Angeles, March 15-18. View the conference program and register soon!