5 Steps to Better Tests

Julie ProfilePosted by Julie Delazyn

Creating fair, valid and reliable tests requires starting off right: with careful planning. Starting with that foundation, you will save time and effort while producing tests that yield trustworthy results.five steps white paper

Five essential steps for producing high-quality tests:

1. Plan: What elements must you consider before crafting the first question? How do you identify key content areas?

2. Create: How do you write items that increase the cognitive load, avoid bias and stereotyping?

3. Build: How should you build the test form and set accurate pass/ fail scores?

4. Deliver: What methods can be implemented to protect test content and discourage cheating?

5. Evaluate: How do you use item-, topic-, and test-level data to assess reliability and improve quality?

Download this complimentary white paper full of best practices for test design, delivery and evaluation.

 

Field Test Studies: Taking your items for a test drive

Austin FosseyPosted by Austin Fossey

In large-scale assessment, a significant amount of work goes into writing items before a participant ever sees them. Items are drafted, edited, reviewed for accuracy, checked for bias, and usually rewritten several times before they are ready to be deployed. Despite all this work, a true test of an item’s performance will come when it is first delivered to participants.

Even though we work so hard to write high-quality items, some bad items may slip past our review committees. To be safe, most large-scale assessment programs will try out their items with a field test.

A field test delivers items to participants under the same conditions used in live testing, but the items do not count toward the participants’ scores. This allows test developers and psychometricians to harvest statistics that can be used in an item analysis to flag poorly performing items.

There are two methods for field testing items. The first method is to embed your new items into an assessment that is already operational. The field test items will not count against the participants’ scores, but the participants will not know which items are scored items and which items are field test items.

The second method is to give participants an assessment that includes only field test items. The participants will not receive a score at the end of the assessment since none of the items have yet been approved to be used for live scoring, though the form may be scored later once the final set of items has  been approved for operational use.

In their chapter in Educational Measurement (4 th ed.), Schmeiser and Welch explain that embedding the items into an operational assessment is generally preferred. When items are field tested in an operational assessment, participants are more motivated to perform well on the items. The item data are also collected while the operational assessment is being delivered, which can help improve the reliability of the item statistics.

When participants take an assessment that only consists of field test items, they may not be motivated to try as hard as they would in an operational assessment, especially if the assessment will not be scored. However, field testing a whole form’s worth of items will give you better content coverage with the items so that you have more items that can be reviewed in the item analysis. If field testing an entire form, Shmeiser and Welch suggest using twice as many items as you will need for the operational form. Many items may need to be discarded or rewritten as a result of the item analysis, so you want to make sure you will still have enough to build an operational form at the end of the process.

Since the value of field testing items is to collect item statistics, it is also important to make sure that a representative sample of participants responds to the field test items. If the sample of participant responses is too small or not representative, then the item statistics may not be generalizable to the entire population.

Questionmark’s authoring solutions allow test developers to field test items by setting the item’s status to “Experimental.” The item will still be scored, and the statistics will be
generated in the Item Analysis Report, but the item will not count toward the participant’s final score.

qm Properties

Setting an item’s status to “Experimental” in Questionmark Live so that it can be field tested.

Writing Good Surveys, Part 6: Tips for the form of the survey

Doug Peterson HeadshotPosted By Doug Peterson

In this final installment of this series, we’ll take a look at some tips for the form of the survey itself.

The first suggestion is to avoid labeling sections of questions. Studies have shown that when it is obvious that a series of questions belong to a group, respondents tend to answer all the questions in the group the same way they answer the first question in the group. The same is true with visual formatting, like putting a box around a group of questions or extra space between groups. It’s best to just present all of the questions in a simple, sequentially numbered list.

As much as possible, keep questions at about the same length, and present the same number of questions (roughly, it doesn’t have to be exact) for each topic. Longer questions or more questions on a topic tend to require more reflection by the respondent, and tend to receive higher ratings. I suspect this might have something to do with the respondent feeling like the question or group of questions is more important (or at least more work) because it is longer, possibly making them hesitant to give something “important” a negative rating.

It is important to collect demographic information as part of a survey. However, a suspicion that he or she can be identified can definitely skew a respondent’s answers. Put the demographic information at the end of the survey to encourage honest responses to the preceding questions. Make as much of the demographic information optional as possible, and if the answers are collected and stored anonymously, assure the respondent of this. If you don’t absolutely need a piece of demographic information, don’t ask for it. The more anonymous the respondent feels, the more honest he or she will be.

Group questions with the same response scale together and present them in a matrix format. This reduces the cognitive load on the respondent; the response possibilities do not have to be figured out on each individual question, and the easier it is for respondents to fill out the survey, the more honest and accurate they will be. If you do not use the matrix format, consider listing the response scale choices vertically instead of horizontally. A vertical orientation clearly separates the choices and reduces the chance of accidentally selecting the wrong choice. And regardless of orientation, be sure to place more space between questions than between a question and its response scale.

I hope you’ve enjoyed this series on writing good surveys. I also hope you’ll join us in San Antonio in March 2014 for our annual Users Conference – I’ll be presenting a session on writing assessment and survey items, and I’m looking forward to hearing ideas and feedback from those in attendance!

Is a compliance test better with a higher pass score?

John Kleeman portraitPosted by John Kleeman

Is a test better if it has a higher pass (or cut) score?

For example, if you develop a test to check that people know material for regulatory compliance purposes, is it better if the pass score is 60%, 70%, 80% or 90%? And is your organization safer if your test has a high pass score?

To answer this question, you first need to know the purpose of the test – how the results will be used and what inferences you want to make from it. Most compliance tests are criterion-referenced – that is to say they measure specific skills, knowledge or competency. Someone who passes the test is competent for the job role; and someone who fails has not demonstrated competence and might need remedial training.

Before considering a pass score, you need to consider whether questions are substitutable, i.e. that you can balance getting certain questions wrong and others right, and still be competent.  It could be that getting  particular questions wrong implies lack of competence, even if everything else is answered correctly. (For another way of looking at this, see Comprehensive Assessment Framework: Building the student model.) If a participant performs well on many items but gets a crucial safety question wrong, they still fail the test. See Golden Topics- Making success on key topics essential for passing a test for one way of creating tests that work like this in Questionmark.

But assuming questions are substitutable and that a single pass score for a test is viable, how do you work out what that pass score should be? The table below shows 4 possible outcomes:

Pass test Fail test
Participant competent Correct decision Error of rejection
Participant not competent Error of acceptance Correct decision

Providing that the test is valid and reliable, a competent participant should pass the test and a not-competent one should fail it.

Picking a number at randomClearly, picking a pass score as a number “out of a hat” is not the right way to approach this. For a criterion-referenced test, you need to match the pass score to the way your questions measure competence. If you have too high a pass score, then you increase the number of errors of rejection: competent people are rejected and you will waste time re-training them and having them re-take the test. If you have too low a pass score, you will have too many errors of acceptance: not competent people are accepted with potential consequences for how they do the job..

You need to use informed judgement or statistical techniques to choose a pass score that supports valid inferences about the participants’ skills, knowledge or competence in the vast majority of cases. This means the number of errors or misclassifications is tolerable for the intended use-case. One technique for doing this is the Angoff method, as described in this SlideShare. Using Angoff, you classify each question by how likely it is that a minimally- competent participant would get it right, and then roll this up to work out the pass score.

Going back to the original question of whether a better test has a higher pass score, what matters is that your test is valid and reliable and that your pass score is set to the appropriate level to measure competency. You want the right pass score, not necessarily the highest pass score.

So what happens if you set your pass score without going through this process? For instance, you say that your test will have an 80% pass score before you design it.  If you do this, you are assuming that on average all the questions in the test will have an 80% chance of being answered correctly by a minimally-competent participant. But unless you have ways of measuring and checking that, you are abandoning logic and trusting to luck.

In general, a lower pass score does not necessarily imply an easier assessment. If the items are very difficult, a low pass score may still yield low pass rates. Pass scores are often set with a consideration for the difficulty of the items, either implicitly or explicitly.

So, is a test better if it has a higher pass score?

The answer is no. A test is best if it has the right pass score. And if one organization has a compliance test where the pass score is 70% and another has a compliance test where the pass score is 80%, this tells you nothing about how good each test is. You need to ask whether the tests are valid and reliable and how the pass scores were determined. There is an issue of “face validity” here: people might find it hard to believe that a test with a very low pass score is fair and reasonable, but in general a higher pass score does not make a better test.

If you want to learn more about setting a pass score, search this blog for articles on “standard setting” or “cut score” or read the excellent book Criterion-Referenced Test Development, by Sharon Shrock and Bill Coscarelli. We’ll also be talking about this and other best practices at our upcoming Users Conferences in Barcelona November 10-12 and San Antonio, Texas, March 4 – 7.

Best practices for test design and delivery: Join the webinar

Joan Phaup HeadshotPosted by Joan Phaup

So many people signed up for Doug Peterson’s recent web seminar about best practices for test design and delivery that we’re offering it again in August:

Join us at 11 a.m. Eastern Time on Thursday, August 22, for Five Steps to Better Tests: Best Practices for Design and Delivery

This webinar will give you practical tips for planning tests, creating items, and building, delivering and evaluating tests that yield actionable, meaningful results.

Doug speaks from experience, having spent more than 12 years in workforce development. During that time, he created training materials, taught in the classroom and over the Web, and created many online surveys, quizzes and tests.

The webinar is based on Doug’s 10-part series in this blog about test design and delivery, which is also available in the form of Questionmark White Paper, Five Steps to Better Tests.

Join the webinar for a lively explanation of these five essential steps for effective test design and delivery:

1. Planning the test
2. Creating the test items
3. Creating the test form
4. Delivering the test
5. Evaluating the test

Go to our UK website  or our  US website for webinar details and free registration.