Item Development Tips For Defensible Assessments

Julie ProfilePosted by Julie Delazyn

Whether you work with low-stakes assessments, small-scale classroom assessments or large-scale, high-stakes assessment, understanding and applying some basic principles of item development will greatly enhance the quality of your results.

What began as a popular 11-part blog series has morphed into a white paper: Managing Item Development for Large-Scale Assessment, which offers sound advice on how-to organize and execute item development steps that will help you create defensible assessments. These steps include:   Item Dev.You can download your copy of the complimentary white paper here: Managing Item Development for Large-Scale Assessment

Item analysis: Selecting items for the test form – Part 2

Austin Fossey-42Posted by Austin Fossey

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

IT 2

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.

IT2

The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 1

Austin Fossey-42Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.

But don’t worry, this is actually a new topic for the blog.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. 

Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.

When building a test form, item statistics can help us in two ways.

  • They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
  • They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.

In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.

Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.

Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.

Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).

In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

 

 

Question Type Report: Use Cases

Austin Fossey-42Posted by Austin Fossey

A client recently asked me if there is a way to count the number of each type of item in their item bank, so I pointed them toward the Question Type Report in Questionmark Analytics. While this type of frequency data can also be easily pulled using our Results API, it can be useful to have a quick overview of the number of items (split out by item type) in the item bank.

The Question Type Report does not need to be run frequently (and Analytics usage stats reflect that observation), but the data can help indicate the robustness of an item bank.

This report is most valuable in situations involving topics for a specific assessment or set of related assessments. While it might be nice to know that we have a total of 15,000 multiple choice (MC) items in the item bank, these counts are trivial unless we have a system-wide practical application—for example planning a full program translation or selling content to a partner.

This report can provide a quick profile of the population of the item bank or a topic when needed, though more detailed item tracking by status, topic, metatags, item type, and exposure is advisable for anyone managing a large-scale item development project. Below are some potential use cases for this simple report.

Test Development and Maintenance:
The Question Type Report’s value is primarily its ability to count the number of each type of item within a topic. If we know we have 80 MC items in a topic for a new assessment, and they all need be reviewed by a bias committee, then we can plan accordingly.

Form Building:
If we are equating multiple forms using a common-item design, the report can help us determine how many items go on each form and the degree to which the forms can overlap. Even if we only have one form, knowing the number of items can help a test developer check that enough items are available to match the blueprint.

Item Development:
If the report indicates that there are plenty of MC items ready for future publications, but we only have a handful of essay items to cover our existing assessment form, then we might instruct item writers to focus on developing new essay questions for the next publication of the assessment.

Question type

Example of a Question Type Report showing the frequency distribution by item type.

 

When to Give Partial Credit for Multiple-Response Items

Austin Fossey-42 Posted by Austin Fossey

Three different customers recently asked me how to decide between scoring a multiple-response (MR) item dichotomously or polytomously; i.e., when should an MR item be scored right/wrong, and when should we give partial credit? I gave some garrulous, rambling answers, so the challenge today is for me to explain this in a single blog post that I can share the next time it comes up.

In their chapter on multiple-choice and matching exercises in Educational Assessment of Students (5th ed.), Anthony Nitko and Susan Brookhart explain that matching items (which we may extend to include MR item formats, drag-and-drop formats, survey-matrix formats, etc.) are often a collection of single-response multiple choice (MC) items. The advantage of the MR format is that is saves space and you can leverage dependencies in the questions (e.g., relationships between responses) that might be redundant if broken into separate MC items.

Given that an MR items is often a set of individually scored MC items, then a polytomously scored format almost always makes sense. From an interpretation standpoint, there are a couple of advantages for you as a test developer or instructor. First, you can differentiate between participants who know some of the answers and those who know none of the answers. This can improve the item discrimination. Second, you have more flexibility in how you choose to score and interpret the responses. In the drag-and-drop example below (a special form of an MR item), the participant has all of the dates wrong; however, the instructor may still be interested in knowing that the participant knows the correct order of events for the Stamp Act, the Townshend Act, and the Boston Massacre.

stamp 1

Example of a drag-and-drop item in Questionmark where the participant’s responses are wrong, but the order of responses is partially correct.

Are there exceptions? You know there are. This is why it is important to have a test blueprint document, which can help clarify which item formats to use and how they should be evaluated. Consider the following two variations of a learning objective on a hypothetical CPR test blueprint:

  • The participant can recall the actions that must be taken for an unresponsive victim requiring CPR.
  • The participant can recall all three actions that must be taken for an unresponsive victim requiring CPR.

The second example is likely the one that the test developer would use for the test blueprint. Why? Because someone who knows two of the three actions is not going to cut it. This is a rare all-or-nothing scenario where knowing some of the answers is essentially the same (from a qualifications standpoint) as knowing none of the answers. The language in this learning objective (“recall all three actions”) is an indicator to the test developer that if they use an MR item to assess this learning objective, they should score it dichotomously (no partial credit). The example below shows how one might design an item for this hypothetical learning objective with Questionmark’s authoring tools:

stamp 2

Example of a Questionmark authoring screen for MR item that is scored dichotomously (right/wrong).

To summarize, a test blueprint document is the best way to decide if an MR item (or variant) should be scored dichotomously or polytomously. If you do not have a test blueprint, think critically about what you are trying to measure and the interpretations you want reflected in the item score. Partial-credit scoring is desirable in most use cases, though there are occasional scenarios where an all-or-nothing scoring approach is needed—in which case the item can be scored strictly right/wrong. Finally, do not forget that you can score MR items differently within an assessment. Some MR items can be scored polytomously and others can be scored dichotomously on the same test, though it may be beneficial to notify participants when scoring rules differ for items that use the same format.

If you are interested in understanding and applying some basic principles of item development and enhancing the quality of your results, download the free white paper written by Austin: Managing Item Development for Large-Scale Assessment

Item Development – Summary and Conclusions

Austin Fossey-42Posted by Austin Fossey

This post concludes my series on item development in large-scale assessment. I’ve discussed some key processes in developing items, including drafting items, reviewing items, editing items, and conducting an item analysis. The goal of this process is to fine-tune a set of items so that test developers have an item pool from which they can build forms for scored assessment while being confident about the quality, reliability, and validity of the items. While the series covered a variety of topics, there are a couple of key themes that were relevant to almost every step.

First, documentation is critical, and even though it seems like extra work, it does pay off. Documenting your item development process helps keep things organized and helps you reproduce processes should you need to conduct development again. Documentation is also important for organization and accountability. As noted in the posts about content review and bias review, checklists can help ensure that committee members consider a minimal set of criteria for every item, but they also provide you with documentation of each committee member’s ratings should the item ever be challenged. All of this documentation can be thought of as validity evidence—it helps support your claims about the results and refute rebuttals about possible flaws in the assessment’s content.

The other key theme is the importance of recruiting qualified and representative subject matter experts (SMEs). SMEs should be qualified to participate in their assigned task, but diversity is also an important consideration. You may want to select item writers with a variety of experience levels, or content experts who have different backgrounds. Your bias review committee should be made up of experts who can help identify both content and response bias across the demographic areas that are pertinent to your population. Where possible, it is best to keep your SME groups independent so that you do not have the same people responsible for different parts of the development cycle. As always, be sure to document the relevant demographics and qualifications of your SMEs, even if you need to keep their identities anonymous.

This series is an introduction for organizing an item development cycle, but I encourage readers to refer to the resources mentioned in the articles for
more information. This series also served as the basis for a session at the 2015 Questionmark Users Conference, which Questionmark customers can watch in the Premium section of the Learning Café.

You can link back to all of the posts in this series by clicking on the links below, and if you have any questions, please comment below!

Item Development – Managing the Process for Large-Scale Assessments

Item Development – Training Item Writers

Item Development – Five Tips for Organizing Your Drafting Process

Item Development – Benefits of editing items before the review process

Item Development – Organizing a content review committee (Part 1)

Item Development – Organizing a content review committee (Part 2)

Item Development – Organizing a bias review committee (Part 1)

Item Development – Organizing a bias review committee (Part 2)

Item Development – Conducting the final editorial review

Item Development – Planning your field test study

Item Development – Psychometric review