Item Development – Planning your field test study

Austin Fossey-42Posted by Austin Fossey

Once the items have passed their final editorial review, they are ready to be delivered to participants, but they are not quite ready to be delivered as scored items. For large-scale assessments, it is best practice to deliver your new items as unscored field test items so that you can gather item statistics for review before using the items to count toward a participant’s score. We discussed field test studies in an earlier post, but today we will focus more on the operational aspects of this task.

If you are embedding field test items, there is little you need to do to plan for the field test, other than to collect data on your participants to ensure representativeness and to make sure that enough participants respond to the item to yield stable statistics. You can collect data for representativeness by using demographic questions in Questionmark’s authoring tools.

If field testing an entire form, you will need to plan your field test carefully. When an entire form is going to be field tested, Schmeiser and Welch ( Educational Measurement, 4th ed.) recommend testing twice as many items as you will need for your operational form.

To check representativeness, you may want to survey your participants in advance to help you select your participant sample. For example, if your participant population is 60% female and 40% male, but your field test sample is 70% male, then that may impact the validity of your field test results. It will be up to you to decide which factors are relevant (e.g., sex, ethnicity, age, level of education, location, level of experience). You can use Questionmark’s authoring tools and reports to deliver and analyze these survey results.

You will also need to entice participants to take your field test. Most people will not want to take a test if they do not have to, but you will likely want to conduct the field test expeditiously. You may want to offer an incentive to test, but that incentive should not bias the results.

For example, I worked on a certification assessment where the assessment cost participants several hundred dollars. To incentivize participation in the field test study of multiple new forms, we offered the assessment free of charge and told participants that their results would be scored once the final forms were assembled. We surveyed volunteers and selected a representative sample to field test each of the forms.

The number of responses you need for each item will depend on your scoring model and your organization’s policies. If using Classical Test Theory, some organizations will feel comfortable with 80 – 100 responses, but Item Response Theory models may require 200 – 500 responses to yield stable item parameters.

More is always better, but it is not always possible. For instance, if an assessment is for a very small population, you may not have very many field test participants. You will still be able to use the item statistics, but they should be interpreted cautiously in conjunction with their standard errors. In the next post, we will talk about interpreting item statistics in the psychometric review following the field test.

Field Test Studies: Taking your items for a test drive

Austin FosseyPosted by Austin Fossey

In large-scale assessment, a significant amount of work goes into writing items before a participant ever sees them. Items are drafted, edited, reviewed for accuracy, checked for bias, and usually rewritten several times before they are ready to be deployed. Despite all this work, a true test of an item’s performance will come when it is first delivered to participants.

Even though we work so hard to write high-quality items, some bad items may slip past our review committees. To be safe, most large-scale assessment programs will try out their items with a field test.

A field test delivers items to participants under the same conditions used in live testing, but the items do not count toward the participants’ scores. This allows test developers and psychometricians to harvest statistics that can be used in an item analysis to flag poorly performing items.

There are two methods for field testing items. The first method is to embed your new items into an assessment that is already operational. The field test items will not count against the participants’ scores, but the participants will not know which items are scored items and which items are field test items.

The second method is to give participants an assessment that includes only field test items. The participants will not receive a score at the end of the assessment since none of the items have yet been approved to be used for live scoring, though the form may be scored later once the final set of items has  been approved for operational use.

In their chapter in Educational Measurement (4 th ed.), Schmeiser and Welch explain that embedding the items into an operational assessment is generally preferred. When items are field tested in an operational assessment, participants are more motivated to perform well on the items. The item data are also collected while the operational assessment is being delivered, which can help improve the reliability of the item statistics.

When participants take an assessment that only consists of field test items, they may not be motivated to try as hard as they would in an operational assessment, especially if the assessment will not be scored. However, field testing a whole form’s worth of items will give you better content coverage with the items so that you have more items that can be reviewed in the item analysis. If field testing an entire form, Shmeiser and Welch suggest using twice as many items as you will need for the operational form. Many items may need to be discarded or rewritten as a result of the item analysis, so you want to make sure you will still have enough to build an operational form at the end of the process.

Since the value of field testing items is to collect item statistics, it is also important to make sure that a representative sample of participants responds to the field test items. If the sample of participant responses is too small or not representative, then the item statistics may not be generalizable to the entire population.

Questionmark’s authoring solutions allow test developers to field test items by setting the item’s status to “Experimental.” The item will still be scored, and the statistics will be
generated in the Item Analysis Report, but the item will not count toward the participant’s final score.

qm Properties

Setting an item’s status to “Experimental” in Questionmark Live so that it can be field tested.