Know what your questions are about before you deliver the test

Austin Fossey-42Posted by Austin Fossey

A few months ago, I had an interesting conversation with an assessment manager at an educational institution—not a Questionmark customer, mind you. Finding nothing else in common, we eventually began discussing assessment design.

At this institution (which will remain anonymous), he admitted that they are often pressed for time in their assessment development cycle. There is not enough time to do all of the item development work they need to do before their students take the assessment. To get around this, their item writers draft all of the items, conduct an editorial review, and then deliver the items. The items are assigned topics after administration, and students’ total scores and topic scores are calculated from there. He asked me if Questionmark software allows test developers to assign topics and calculate topic scores after assessing the students, and I answered truthfully that it does not.

But why not? Is there a reason test developers should not do what is being practiced at this institution? Yes, there are in fact two reasons. Get ready for some psychometric finger-wagging.

Consider what this institution is doing. The items are drafted and subjected to an editorial review, but no one ever classifies the items within a topic until after the test has been administered. Recall what people typically do during a content review prior to administration:

  • Remove items that are not relevant to the domain.
  • Ensure that the blueprint is covered.
  • Check that items are assigned to the correct topic.

If topics are not assigned until after the participants have already tested, we risk the validity of the results and the legal defensibility of the test. If we have delivered items that are not relevant to the domain, we have wasted participants’ time and will need to adjust their total score. Okay, we can manage that by telling the participants ahead of time that some of the test items might not count. But if we have not asked the correct number of questions for a given area of the blueprint, the entire assessment score will be worthless—a threat to validity known as construct underrepresentation or construct deficiency in The Standards for Educational and Psychological Testing.

For example, if we were supposed to deliver 20 items from Topic A, but find out after the fact that only 12 items have been classified as belonging to Topic A, then there is little we can do about it besides rebuilding the test form and making everyone take the test again.

The Standards provide helpful guidance in these matters. For this particular case, the Standards point out that:

“The test developer is responsible for documenting that the items selected for the test meet the requirements of the test specifications. In particular, the set of items selected for a new test form . . . must meet both content and psychometric specifications.” (p. 82)

Publications describing best practices for test development also specify that the content must be determined before delivering an operational form. For example, in their chapter in Educational Measurement (4th Edition), Cynthia Schmeiser and Catherine Welch note the importance of conducting a content review of items before field testing, as well a final content review of a draft test form before it becomes operational.

In Introduction to Classical and Modern Test Theory, Linda Crocker and James Algina also made an interesting observation about classroom assessments, noting that students expect to be graded on all of the items they have been asked to answer. Even if notified in advance that some items might not be counted (as one might do in field testing), students might not consider it fair that their score is based on a yet-to-be-determined subset of items that may not fully represent the content that is supposed to be covered.

This is why Questionmark’s software is designed the way it is. When creating an item, item writers must assign an item to a topic, and items can be classified or labeled along other dimensions (e.g., cognitive process) using metatags. Even if an assessment program cannot muster any further content review, at least the item writer has classified items by content area. The person building the test form then has the information they need to make sure that the right questions get asked.

We have a responsibility as test developers to treat our participants fairly and ethically. If we are asking them to spend their time taking a test, then we owe them the most useful measurement that we can provide. Participants trust that we know what we are doing. If we postpone critical, basic development tasks like content identification until after participants have already given us their time, we are taking advantage of that trust.

Item Development – Summary and Conclusions

Austin Fossey-42Posted by Austin Fossey

This post concludes my series on item development in large-scale assessment. I’ve discussed some key processes in developing items, including drafting items, reviewing items, editing items, and conducting an item analysis. The goal of this process is to fine-tune a set of items so that test developers have an item pool from which they can build forms for scored assessment while being confident about the quality, reliability, and validity of the items. While the series covered a variety of topics, there are a couple of key themes that were relevant to almost every step.

First, documentation is critical, and even though it seems like extra work, it does pay off. Documenting your item development process helps keep things organized and helps you reproduce processes should you need to conduct development again. Documentation is also important for organization and accountability. As noted in the posts about content review and bias review, checklists can help ensure that committee members consider a minimal set of criteria for every item, but they also provide you with documentation of each committee member’s ratings should the item ever be challenged. All of this documentation can be thought of as validity evidence—it helps support your claims about the results and refute rebuttals about possible flaws in the assessment’s content.

The other key theme is the importance of recruiting qualified and representative subject matter experts (SMEs). SMEs should be qualified to participate in their assigned task, but diversity is also an important consideration. You may want to select item writers with a variety of experience levels, or content experts who have different backgrounds. Your bias review committee should be made up of experts who can help identify both content and response bias across the demographic areas that are pertinent to your population. Where possible, it is best to keep your SME groups independent so that you do not have the same people responsible for different parts of the development cycle. As always, be sure to document the relevant demographics and qualifications of your SMEs, even if you need to keep their identities anonymous.

This series is an introduction for organizing an item development cycle, but I encourage readers to refer to the resources mentioned in the articles for
more information. This series also served as the basis for a session at the 2015 Questionmark Users Conference, which Questionmark customers can watch in the Premium section of the Learning Café.

You can link back to all of the posts in this series by clicking on the links below, and if you have any questions, please comment below!

Item Development – Managing the Process for Large-Scale Assessments

Item Development – Training Item Writers

Item Development – Five Tips for Organizing Your Drafting Process

Item Development – Benefits of editing items before the review process

Item Development – Organizing a content review committee (Part 1)

Item Development – Organizing a content review committee (Part 2)

Item Development – Organizing a bias review committee (Part 1)

Item Development – Organizing a bias review committee (Part 2)

Item Development – Conducting the final editorial review

Item Development – Planning your field test study

Item Development – Psychometric review

Item Development – Conducting the final editorial review

Austin Fossey-42Posted by Austin Fossey

Once you have completed your content review and bias review, it is best to conduct a final editorial review.

You may have already conducted an editorial review prior to the content and bias reviews to cull items with obvious item-writing flaws or inappropriate item types—so by the time you reach this second editorial review, your items should only need minor edits.

This is the time to put the final polish on all of your items. If your content review committee and bias review committee were authorized to make changes to the items, go back and make sure they followed your style guide and that they used accurate grammar and spelling. Make sure they did not make any drastic changes that violate your test specifications, such as adding a fourth option to a multiple choice item that should only have three options.

If you have resources to do so, have professional editors review the items’ content. Ask the editors to identify issues with language, but review their suggestions rather than letting them make direct edits to the items. The editors may suggest changes that violate your style guide, they may not be familiar with language that is appropriate for your industry, or they may wish to make a change that would drastically impact the item content. You should carefully review their changes to make sure they are each appropriate.

As with other steps in the item development process, documentation and organization is key. Using item writing software like that provided by Questionmark can help you track revisions to items, document changes, and track your items to make sure each one is reviewed.

Do not approve items with a rubber stamp. If an item needs major content revisions, send it back to the item writers and begin the process again. Faulty items can undermine the validity of your assessment and can result in time-consuming challenges from participants. If you have planned ahead, you should have enough extra items to allow for some attrition while retaining enough items to meet your test specifications.

Finally, be sure that you have the appropriate stakeholders sign off on each item. Once the item passes this final editorial review, it should be locked down and considered ready to deliver to participants. Ideally, no changes should be made to items once they are in delivery, as this may impact how participants respond to the item and perform on the assessment. (Some organizations require senior executives to review and approve any requested changes to items that are already in delivery.)

When you are satisfied that the items are perfect, they are ready to be field tested. In the next post, I will talk about item try-outs, selecting a field test sample, assembling field test forms, and delivering the field test.

Check out our white paper: 5 Steps to Better Tests for best practice guidance and practical advice for the five key stages of test and exam development.

Austin Fossey will discuss test development at the 2015 Users Conference in Napa Valley, March 10-13. Register before Jan. 29 and save $100.