How can a randomized test be fair to all?
Posted by Joan Phaup
James Parry, who is test development manager at the U.S Coast Guard Training Center in Yorktown, Virginia, will answer this question during a case study presentation the Questionmark Users Conference in San Antonio March 4 – 7. He’ll be co-presenting with LT Carlos Schwarzbauer, IT Lead at the USCG Force Readiness Command’s Advanced Distributed Learning Branch.
James and I spoke the other day about why tests created from randomly drawn items can be useful in some cases—but also about their potential pitfalls and some techniques for avoiding them.
When are randomly designed tests an appropriate choice?
There are several reasons to use randomized tests. Randomization is appropriate when you think there’s a possibility of participants sharing the contents of their test with others who have not taken it. Another reason would be in a computer lab style testing environment where you are testing many on the same subject at the same time with no blinders between the computers. So even if participants look at the screens next to them, chances are they won’t see the same items.
How are you using randomly designed tests?
We use randomly generated tests at all three levels of testing low-, medium- and high-stakes. The low- and medium-stakes tests are used primarily at the schoolhouse level for knowledge- and performance-based knowledge quizzes and tests. We are also generating randomized tests for on-site testing using tablet computers or local installed workstations.
Our most critical use is for our high-stakes enlisted advancement tests, which are administered both on paper and by computer. Participants are permitted to retake this test every 21 days if they do not achieve a passing score. Before we were able to randomize the test there were only three parallel paper versions. Candidates knew this so some would “test sample” without studying to get an idea of every possible question. They would retake the first version, then the second, and so forth until they passed it. With randomization the word has gotten out that this is not possible anymore.
What are the pitfalls of drawing items randomly from an item bank?
The biggest pitfall is the potential for producing tests that have different levels of difficulty or that don’t present a balance of questions on all the subjects you want to cover. A completely random test can be unfair. Suppose you produce a 50-item randomized test from an entire test item bank of 500 items. Participant “A” might get an easy test, “B” might get a difficult test and “C” might get a test with 40 items on one topic and 10 on the rest and so on.
How do you equalize the difficulty levels of your questions?
This is a multi-step process. The item author has to make sure they develop sufficient numbers of items in each topic that will provide at least 3 to 5 items for each enabling objective. They have to think outside the box to produce items at several cognitive levels to ensure there will be a variety of possible levels of difficulty. This is the hardest part for them because most are not trained test writers.
Once the items are developed, edited, and approved in workflow, we set up an Angoff rating session to assign a cut score for the entire bank of test items. Based upon the Angoff score, each item is assigned a difficulty level of easy, moderate or hard and assigned a metatag to match within Questionmark. We use a spreadsheet to calculate the number and percentage of available items at each level of difficulty in each topic. Based upon the results, the spreadsheet tells how many items to select from the database at each difficulty level and from each topic. The test is then designed to match these numbers so that each time it is administered it will be parallel, with the same level of difficulty and the same cut score.
Is there anything audience members should do to prepare for this session?
Come with an open mind and a willingness to think outside of the box.
How will your session help audience members ensure their randomized tests are fair?
I will give them the tools to use starting with a quick review of using the Angoff method to set a cut score and then discuss the inner workings of the spreadsheet that I developed to ensure each test is fair and equal.