Item Development – Planning your field test study

Austin Fossey-42Posted by Austin Fossey

Once the items have passed their final editorial review, they are ready to be delivered to participants, but they are not quite ready to be delivered as scored items. For large-scale assessments, it is best practice to deliver your new items as unscored field test items so that you can gather item statistics for review before using the items to count toward a participant’s score. We discussed field test studies in an earlier post, but today we will focus more on the operational aspects of this task.

If you are embedding field test items, there is little you need to do to plan for the field test, other than to collect data on your participants to ensure representativeness and to make sure that enough participants respond to the item to yield stable statistics. You can collect data for representativeness by using demographic questions in Questionmark’s authoring tools.

If field testing an entire form, you will need to plan your field test carefully. When an entire form is going to be field tested, Schmeiser and Welch ( Educational Measurement, 4th ed.) recommend testing twice as many items as you will need for your operational form.

To check representativeness, you may want to survey your participants in advance to help you select your participant sample. For example, if your participant population is 60% female and 40% male, but your field test sample is 70% male, then that may impact the validity of your field test results. It will be up to you to decide which factors are relevant (e.g., sex, ethnicity, age, level of education, location, level of experience). You can use Questionmark’s authoring tools and reports to deliver and analyze these survey results.

You will also need to entice participants to take your field test. Most people will not want to take a test if they do not have to, but you will likely want to conduct the field test expeditiously. You may want to offer an incentive to test, but that incentive should not bias the results.

For example, I worked on a certification assessment where the assessment cost participants several hundred dollars. To incentivize participation in the field test study of multiple new forms, we offered the assessment free of charge and told participants that their results would be scored once the final forms were assembled. We surveyed volunteers and selected a representative sample to field test each of the forms.

The number of responses you need for each item will depend on your scoring model and your organization’s policies. If using Classical Test Theory, some organizations will feel comfortable with 80 – 100 responses, but Item Response Theory models may require 200 – 500 responses to yield stable item parameters.

More is always better, but it is not always possible. For instance, if an assessment is for a very small population, you may not have very many field test participants. You will still be able to use the item statistics, but they should be interpreted cautiously in conjunction with their standard errors. In the next post, we will talk about interpreting item statistics in the psychometric review following the field test.

Item Analysis Report Revisited

Austin FosseyPosted by Austin Fossey

If you are a fanatical follower of our Questionmark blog, then you already know that we have written more than a dozen articles relating to item analysis in a Classical Test Theory framework. So you may ask, “Austin, why does Questionmark write so much about item analysis statistics? Don’t you ever get out?”

Item analysis statistics are some of the easiest-to-use indicators of item quality, and these are tools that any test developer should be using in their work . By helping people understand these tools, we can help them get the most out of our technologies. And yes, I do get out. I went out to get some coffee once last April.

So why are we writing about item analysis statistics again? Since publishing many of the original blog articles about item analysis, Questionmark has built a new version of the Item Analysis Report in Questionmark Analytics, adding filtering capabilities beyond those of the original Question Statistics Report in Enterprise Reporter.

In my upcoming posts, I will revisit the concepts of item difficulty, item-total score correlation, and high-low discrimination in the context of the Item Analysis Report in Analytics. I will also provide an overview of item reliability and how it would be used operationally in test development.

item analysis report

Screenshot of the Item Analysis Report (Summary View) in Questionmark Analytics

Using Questionmark’s OData API to Analyze Item Key Distribution

Austin FosseyPosted by Austin Fossey

The Questionmark OData API, which offers flexible access to data for the creation of custom reports, can help you ensure the quality of your tests.

For instance, you can use OData to create a frequency table of item keys in a multiple choice assessment. This report shows the number of items that have the first choice as the correct answer, the number of items that have the second choice as the correct answer, et cetera.

analyze 1

Why do we care about how often each choice number is the item key? If there is a pattern in how correct choices are assigned, it may affect how participants perform on the test, and this can lead to construct-irrelevant variance; i.e., the scores are being affected by factors other than the participant’s knowledge and abilities.

Let’s say that our assessment has 50 items, and on 30 (60%) of those items the second choice is the item key. Now let’s put ourselves in the shoes of a qualified participant. Halfway through the assessment, we might start thinking, “Gosh, I just picked the second choice four times in a row. Maybe I should go back and check some of those answers.” Because of poor test design, we are second-guessing our answers. Even if we do not change our responses, time is being wasted and test anxiety is rising, which might negatively affect our responses later in the assessment.

The opposite problem may arise too. If an unqualified participant figures out that the second choice is most often the key, he or she may pick the second choice even when he or she does not know the answer, resulting in an inflated score.

When looking at the distribution of keys across a selected response assessment, we expect to see an even distribution of the keys across the choices. For example, if we have a multiple choice assessment with four choices in each item labeled A, B, C, and D, we would like to see the 25% of the keys assigned to each of these choices.

analyze 2

You do not have to limit your assessment research to this example. The beauty of OData is that you can access your data whenever you have a new question you would like to investigate. For example, instead of reviewing the frequencies of your keys, you may want to determine the ratio of the length of the key to the length of the other options (a common item writing mistake is to write keys that are noticeably longer than the distractors). You may also want to look for patterns in the keys’ text (e.g., 10 items all have “OData” as the correct choice).

Click here for more information about the OData API. (If you are a Questionmark software support plan customer, you can get step-by-step instructions for using OData to create the item key frequency table in the Premium Videos section of the Questionmark Learning Café.)