How can a randomized test be fair to all?

Joan Phaup 2013 (3) Posted by Joan Phaup

James Parry, who is test development manager at the U.S Coast Guard Training Center in Yorktown, Virginia, will answer this question during a case study presentation the Questionmark Users Conference in San Antonio March 4 – 7. He’ll be co-presenting with LT Carlos Schwarzbauer, IT Lead at the USCG Force Readiness Command’s Advanced Distributed Learning Branch.

James and I spoke the other day about why tests created from randomly drawn items can be useful in some cases—but also about their potential pitfalls and some techniques for avoiding them.

When are randomly designed tests an appropriate choice?

James Parry

James Parry

There are several reasons to use randomized tests.  Randomization is appropriate when you think there’s a possibility of participants sharing the contents of their test with others who have not taken it.  Another reason would be in a computer lab style testing environment where you are testing many on the same subject at the same time with no blinders between the computers. So even if participants look at the screens next to them, chances are they won’t see the same items.

How are you using randomly designed tests?

We use randomly generated tests at all three levels of testing low-, medium- and high-stakes.  The low- and medium-stakes tests are used primarily at the schoolhouse level for knowledge- and performance-based knowledge quizzes and tests.  We are also generating randomized tests for on-site testing using tablet computers or local installed workstations.

Our most critical use is for our high-stakes enlisted advancement tests, which are administered both on paper and by computer. Participants are permitted to retake this test every 21 days if they do not achieve a passing score.  Before we were able to randomize the test there were only three parallel paper versions. Candidates knew this so some would “test sample” without studying to get an idea of every possible question. They would retake the first version, then the second, and so forth until they passed it. With randomization the word has gotten out that this is not possible anymore.

What are the pitfalls of drawing items randomly from an item bank?

The biggest pitfall is the potential for producing tests that have different levels of difficulty or that don’t present a balance of questions on all the subjects you want to cover. A completely random test can be unfair.  Suppose you produce a 50-item randomized test from an entire test item bank of 500 items.   Participant “A” might get an easy test, “B” might get a difficult test and “C” might get a test with 40 items on one topic and 10 on the rest and so on.

How do you equalize the difficulty levels of your questions?

This is a multi-step process. The item author has to make sure they develop sufficient numbers of items in each topic that will provide at least 3 to 5 items for each enabling objective.  They have to think outside the box to produce items at several cognitive levels to ensure there will be a variety of possible levels of difficulty. This is the hardest part for them because most are not trained test writers.

Once the items are developed, edited, and approved in workflow, we set up an Angoff rating session to assign a cut score for the entire bank of test items.  Based upon the Angoff score, each item is assigned a difficulty level of easy, moderate or hard and assigned a metatag to match within Questionmark.  We use a spreadsheet to calculate the number and percentage of available items at each level of difficulty in each topic. Based upon the results, the spreadsheet tells how many items to select from the database at each difficulty level and from each topic. The test is then designed to match these numbers so that each time it is administered it will be parallel, with the same level of difficulty and the same cut score.

Is there anything audience members should do to prepare for this session?

Come with an open mind and a willingness to think outside of the box.

How will your session help audience members ensure their randomized tests are fair?

I will give them the tools to use starting with a quick review of using the Angoff method to set a cut score and then discuss the inner workings of the spreadsheet that I developed to ensure each test is fair and equal.


See more details about the conference program here and register soon.

7 Responses to How can a randomized test be fair to all?

  1. Monika says:

    Dear Joan,

    Thank you for this post. I see that you only use ‘Angoff difficulty’in the metatag. However, as we know, experts are not always accurate withtheir estimation of the difficulty. Is it also possible with QMP to use empirical statistics about the items? such as proportion correct and item rest correlation?

    with regards,

  2. Austin Fossey says:

    Hi Monika,

    I think that one could use item statistics like proportion correct and item-rest correlation to classify items for tagging (and the Questionmark Item Analysis Report is a great way to get these data), but I think there is still an advantage to using the Angoff weights for the final tagging. Mr. Parry may disagree with me, but it seems to me that using the item statistics would be problematic because they are sample dependent, whereas the Angoff weights are set based on the expert panel’s definition of a minimally qualified participant.

    I think in many cases, we could not be certain that the sample of participants is representative of the entire population, so the item statistics may not be generalizable. Even if we were confident that we had a representative sample of participants, the interpretation of the proportion correct would be different from the Angoff weights. The proportion correct would represent difficulty for the population, whereas the Angoff weight would represent the theoretical difficulty for a sample of minimally qualified participants.

    This is why using the item statistics might yield a different application of the tags. An item that is difficult for the general population might be considered medium difficulty for candidates who are minimally qualified, or vice versa. The discrepancies will probably be greater when there is a bigger difference between the hypothetical minimally qualified participant and the average participant taking the test.

    There is no doubt that expert Angoff ratings can be inaccurate–even more so at the item level–but if the Angoff study is conducted well, the experts are trained, and the group of experts is sufficiently large and representative, then I would have some confidence in the correct classification of difficulty by Angoff weights, even if the weights themselves may not be very accurate. Again, Mr. Parry may disagree with me, but I think that if I were creating a Linear-on-the-Fly test using the method he described, I would use the Angoff weights to classify the items, but I may also use the proportion correct and item-rest correlation (or item-total correlation) as impact data to inform the decisions. For example, if I had two items that had medium difficulty Angoff weights, but their p values were .90 and .20, I might question why these items behaved so differently in the participant sample. The item statistics may also be useful for validating the design.



  3. Monika says:

    you again,

    it is true that p-values are sample dependent, but if we will calibrate them theire not. And what about retesting the items? is it also possible to have a one p-value based on different testing occasions.

    Again, here, we can draw a conclusion that QMP can be a kickass software if it will allow IRT analyses 🙂

    i would like to meet Mr. Parry 😉 to disagree with you together 😀

  4. Austin Fossey says:

    Hi Monika,

    Yup, sorry, it is me again. But I agree on most points! Though I believe the calibration would still be sample dependent, the relative difficulties of the items would remain static unless we had an unnaturally homogenous group of participants as our calibration sample. I agree that this could be a much more stable way of tagging items by their observed difficulty.

    I think the decision ultimately comes down to the inference we want to make based on the delivery of the items. All of these ideas are examples of Linear-on-the-Fly Tests (LOFT), and the design of a LOFT can use many different criteria for random selection of items. Mr. Parry is recommending Angoff weights since he wants to control delivery in relation to the definition of a minimally qualified participant, but in your case, you want to control delivery so that all participants have a similar distribution of item difficulty based on how the items perform in the entire population. I think both strategies would work equally well, though the choice may affect how we interpret and communicate the results. I have also seen LOFT designs where items are randomly selected by content standards, and there is no reason we could not use multiple selection dimensions, such as content and difficulty at the same time!

    The only disagreement I have is about your evaluation of Questionmark software’s ability to kick tail. Questionmark has already met and exceeded this threshold, and continues to do so with every release! I think the fact that we are even discussing the use of Questionmark for a LOFT delivery environment is proof enough. I do not disagree that it would be very cool to have IRT-based delivery and scoring, but since CTT and IRT scores correlate so closely, and since many Questionmark clients are measuring for classification rather than aptitude, I would say it is a solid tool set for most measurement applications.

    Don’t worry though! I know all our awesome psychometricians want to get in there and merge your item calibration data into Questionmark, and we have not forgotten! For the time being though, I hope to see you at our Users Conference and User Group Meetings to continue this debate (and we can drag Jim Parry in too, though I should probably give him fair warning)!

    Thanks as always for the thoughtful and perceptive feedback 🙂



  5. Tony Li says:

    Hi Austin,

    Just a quick question after seen your post. Could you give me a little bit information about Questionmark’s capacity in terms of delivering LOFT and computer adaptive testing?

  6. Hi Tony,

    Questionmark customers can implement a LOFT delivery design using item metatags, and this is what James Parry demonstrated at his presentation at the 2014 Questionmark Users Conference. Mr. Parry shared his LOFT design where items were delivered by Questionmark Topics for subscores and content representation, but items were then randomly selected based on other metadata, which is the basis of a LOFT design. In Mr. Parry’s case, I believe he used Angoff weights as an indicator for item difficulty to select the same number of easy, medium, and difficult items within a topic for each participant. Of course, clients can use whichever metadata they want to structure the random selection in LOFT, though I still recommend classifying items’ content with the Questionmark Topic structure for scoring purposes.

    With regards to computer-adaptive testing (CAT), Questionmark’s delivery system is not yet able to implement a CAT design (in the traditional sense of using IRT parameters for selecting items that will maximize the information estimate). We know that there are a few customer use cases for which CAT would be an appropriate design choice, and we are looking into adding a CAT delivery engine to our product in the future.

    Thanks for your question!


    Austin Fossey

  7. […] are reflected in your test specifications matrix – or blueprint. QuestionMarkPerception has this to say on the […]

Leave a Reply

Your email address will not be published. Required fields are marked *