Item Development – Psychometric review

Austin FosseyPosted by Austin Fossey

The final step in item development is the psychometric review. You have drafted the items, edited them, sent them past your review committee, and tried them out with your participants. The psychometric review will use item statistics to flag any items that may need to be removed before you build your assessment forms for production. It is common to look at statistics relating to difficulty, discrimination, and bias.

As with other steps in the item development process, you should assemble an independent, representative, qualified group of subject matter experts (SMEs) to look at flagged items. If you are short on time, you may only want to have them review the items with statistical flags. Their job is to figure out what is wrong with items that return poor statistics.

Difficulty – Items that are too hard or too easy are often not desirable on criterion-referenced assessments because they do not discriminate well. However, they may be desirable for norm-referenced assessments or aptitude tests where you want to accurately measure a wide spectrum of ability.

If using classical test theory, you will flag items based on their p-value (difficulty index). Remember that lower values are harder items, and higher values are easier items. I typically flag anything over 0.90 or 0.95 and anything under 0.25 or 0.20, but others have different preferences.

If an item is flagged by its difficulty, there are several things to look for. If an item is too hard, it may be that the content has not been taught yet to your population, or it is obscure content. This may not be justification for removing the item from the assessment if it aligns well with your  blueprint. However, it could also be that the item is confusing, mis-keyed, or has overlapping options, in which case you should consider removing it from the assessment before you go to production.

If an item is too easy, it may be that the population of participants has mastered this content, though it may still be relevant to the blueprint. You will need to make the decision about whether or not that item should remain. However, there could be other reasons an item is too easy, such as item cluing, poor distractors, identifiable key patterns, or compromised content. Again, in these scenarios you should consider removing the item before using it on a live form.

Discrimination – If an item does not discriminate well, it means that it does not help differentiate between high- and low-performing participants. These items do not add much to the information available in the assessment, and if they have negative discrimination values, they may actually be adding construct-irrelevant variance to your total scores.

If using classical test theory, you will flag your items based on their item-total discrimination (Pearson product-moment correlation) or their item-rest correlation (item-remainder correlation). The latter is most useful for short assessments (25 items or fewer), small sample sizes, or assessments with items weighted differently. I typically flag items with discrimination values below 0.20 or 0.15, but again, other test developers will have their own preferences.

If an item is flagged for discrimination, it may have some of the same issues that cause problems with item difficulty, such as a mis-keyed response or overlapping options. Easy items and difficult items will also tend to have lower discrimination values due to the lack of variance in the item scores. There may be other issues impacting discrimination, such as when high-performing participants overthink an item and end up getting it wrong more often than lower-performing participants.

Statistical Bias – In earlier posts, we talked about using differential item functioning (DIF) to identify statistical bias in items. Recall that this can only be done with item response theory models (IRT), so you cannot use classical test theory statistics to determine statistical bias. Logistic regression can be used to identify both uniform and non-uniform DIF. DIF software will typically classify DIF effect sizes as A, B, or C. If possible, review any item flagged with DIF, but if there are too many items or you are short on time, you may want to focus on the items that fall into categories B or C.

DIF can occur from bias in the content or response process, which are the same issues your bias review committee was looking for. Sometimes DIF statistics help uncover content bias or response process bias that your bias review committee missed; however, you may have an item flagged for DIF, but no one can explain why it is performing differently between demographic groups. If you have a surplus of items, you may still want to discard these flagged items just to be safe, even if you are not sure why they are exhibiting bias.

Remember, not all items flagged in the psychometric review need to be removed. This is why you have your SMEs there. They will help determine whether there is a justification to keep an item on the assessment even though it may have poor item statistics. Nevertheless, expect to cull a lot of your flagged items before building your production forms.

psych review

Example of an item flagged for difficulty (p = 0.159) and discrimination (item-total correlation = 0.088). Answer option information table shows that this item was likely mis-keyed.

 

Interact with your data: Looking forward to Napa

Steve Lay HeadshotPosted by Steve Lay

It’s almost time for the Questionmark Users Conference, which this year is being held in Napa, California. As usual there’s plenty on the program for delegates interested in integration matters!

At last year’s conference we talked a lot about OData for Analytics, (which I have also written about here: What is OData, and why is it important? ). OData is a data standard originally created by Microsoft but now firmly embedded in the open standards community through a technical group at OASIS. OASIS have taken on further development, resulting in the publication of the most recent version, OData 4.

This year we’ve built on our earlier work with the Results OData API to extend our adoption of OData to our delivery database, but there’s a difference. Whereas the Results OData API provides access to data, the data exposed from our delivery system supports read and write actions, allowing third-party integrations to interact with your data during the delivery process.

Why would you want to do that?

Some assessment delivery processes involve actions that take place outside the Questionmark system. The most obvious example is essay grading. Although the rubrics (the rules for scoring) are encoded in the Questionmark database, it takes a human being outside the system to follow those rules and to assign marks to the participant. We already have a simple scoring tool built directly in to Enterprise Manager but for more complex scoring scenarios you’ll want to integrate with external marking tools.

The new Delivery OData API provides access to the data you need, allowing you to read a participant’s answers and write back the scores using a simple Unscored -> Saved -> Scored workflow. When the result is placed in the final status, the participant’s result is updated and will appear with the updated scores in future reports.

I’ll be teaming up with Austin Fossey, our product owner for reporting, and Howard Eisenberg, our head of Solution Services, to talk at the conference about Extending Your Platform, during which we’ll be covering these topics. I’m also delighted that colleagues from Rio Salado College will also be talking about their own scoring tool that is built right on top of the Delivery OData API.

I look forward to meeting you in Napa but if you can’t make it this year, don’t worry, some of the sessions will be live-streamed. Click here to register so that we can send you your login info and directions. And you can always follow along with social media by following and tweeting with @Questionmark.

Standard Setting: A Keystone to Legal Defensibility

Austin Fossey-42Since the last Questionmark Users Conference, I have heard several clients discuss new measures at their companies requiring them to provide evidence of the legal defensibility of their assessment. Legal defensibility and validity are closely intertwined, but they are not synonymous. An assessment can be legally defensible, yet still have flaws that impact its validity. The distinction between the two is often the difference between how you developed the instrument versus how well you developed the instrument.

Regardless of whether you are concerned with legal defensibility or validity, careful attention should be paid to the evaluative component of your assessment program. What if  someone asks, “What does this score mean?” How do you answer? How do you justify your response? The answers to these questions impact how your stakeholders will interpret and use the results, and this may have consequences for your participants. Many factors go into supporting the legal defensibility and validity of assessment results, but one could argue that the keystone is the standard-setting process.

Standard setting is the process of dividing score scales so that scores can be interpreted and actioned (AERA, APA, NCME, 2014). The dividing points between sections of the scales  are called “cut scores,” and in criterion-referenced assessment, they typically correspond to performance levels that are defined a priori. These cut scores and their corresponding performance levels help test users make the cognitive leap from a participant’s response pattern to what can be a complex inference about the participant’s knowledge, skills, and abilities.

In their chapter in Educational Measurement (4th Ed.), Hambleton and Pitoniak explain that standard-setting studies need to consider many factors, and that they also can have major implications for participants and test users. For this reason, standard-setting studies are often rigorous, well-documented projects.

At this year’s Questionmark Users Conference, I will be delivering a session that introduces the basics of standard setting. We will discuss standard-setting methods for criterion- referenced and norm-referenced assessments, and we will touch on methods used in both large-scale assessments and in classroom settings. This will be a useful session for anyone who is working on documenting the legal defensibility of their assessment program or who is planning their first standard-setting study and wants to learn about different methods that are available. Participants are encouraged to bring their own questions and stories to share with the group.

Register today for the full conference, but if you cannot make it, make sure to catch the live webcast!

Item Development – Organizing a content review committee (Part 2)

Austin Fossey-42Posted by Austin Fossey

In my last post, I explained the function of a content review committee and the importance of having a systematic review process. Today I’ll provide some suggestions for how you can use the content review process to simultaneously collect content validity evidence without having to do a lot of extra work.

If you want to get some extra mileage out of your content review committee, why not tack on a content validity study? Instead of asking them if an item has been assigned to the correct area of the specifications, ask them to each write down how they would have classified the item’s content. You can then see if topics picked by your content review committee correspond with the topics that your item writers assigned to the items.

There are several ways to conduct content validity studies, and a content validity study might not be sufficient evidence to support the overall validity of the assessment results. A full review of validity concepts is outside the scope of this article, but one way to check whether items match their intended topics is to have your committee members rate how well they
think an item matches each topic on the specifications. A score of 1 means they think the item matches, a score of -1 means they think it does not match, and a score of 0 means that they are not sure.

If each committee member provides their own ratings, you can calculate the index of congruence , which was proposed by Richard Rovinelli and Ron Hambleton. You can then create a table of these indices to see whether the committee’s classifications correspond to the content classifications given by your item writers.

The chart below compares item writers’ topic assignments for two items and the index of congruence determined by a content committee’s ratings of the two items on an assessment with ten topics. We see that both groups agreed that Item 1 belonged to Topic 5 and Item 2 belonged to Topic 1. We also see that the content review committee was uncertain on whether or not Item 1 measured Topic 2, and we see that some of the committee members felt that Item 2 measured  Topic 7.

ID2

Comparison of content review committee’s index of congruence and item writers’ classifications of two items on an assessment with ten topics.

 

Item Development – Five Tips for Organizing Your Drafting Process

Austin FosseyPosted by Austin Fossey

Once you’ve trained your item writers, they are ready to begin drafting items. But how should you manage this step of the item development process?

There is an enormous amount of literature about item design and item writing techniques—which we will not cover in this series—but as Cynthia Shmeiser and Catherine Welch observe in their chapter in Educational Measurement (4th ed.), there is very little guidance about the item writing process. This is surprising, given that item writing is critical to effective test development.

It may be tempting to let your item writers loose in your authoring software with a copy of the test specifications and see what comes back, but if you invest time and effort in organizing your item drafting sessions, you are likely to retain more items and better support the validity of the results.

Here are five considerations for organizing item writing sessions:

  • Assignments – Shmeiser and Welch recommend giving each item writer a specific assignment to set expectations and to ensure that you build an item bank large enough to
    meet your test specifications. If possible, distribute assignments evenly so that no single author has undue influence over an entire area of your test specifications. Set realistic goals for your authors, keeping in mind that some of their items will likely be dropped later in item reviews.
  • Instructions – In the previous post, we mentioned the benefit of a style guide for keeping item formats consistent. You may also want to give item writers instructions or templates for specific item types, especially if you are working with complex item types. (You should already have defined the types of items that can be used to measure each area of your test specifications in advance.)
  • Monitoring – Monitor item writers’ progress and spot-check their work. This is not a time to engage in full-blown item reviews, but periodic checks can help you to provide feedback and correct misconceptions. You can also check in to make sure that the item writers are abiding by security policies and formatting guidelines. In some item writing workshops, I have also asked item writers to work in pairs to help check each other’s work.
  • Communication – With some item designs, several people may be involved in building the item. One team may be in charge of developing a scoring model, another team may draft content, and a third team may add resources or additional stimuli, like images or animations. These teams need to be organized so that materials are
    handed off on time, but they also need to be able to provide iterative feedback to each other. For example, if the content team finds a loophole in the scoring model, they need to be able to alert the other teams so that it can be resolved.
  • Be Prepared – Be sure to have a backup plan in case your item writing sessions hit a snag. Know what you are going to do if an item writer does not complete an assignment or if content is compromised.

Many of the details of the item drafting process will depend on your item types, resources, schedule, authoring software, and availability of item writers. Determine what you need to accomplish, and then organize your item writing sessions as much as possible so that you meet your goals.

In my next post, I will discuss the benefits of conducting an initial editorial review of the draft items before they are sent to review committees.

Item Development – Training Item Writers

Austin FosseyPosted by Austin Fossey

Once we have defined the purpose of the assessment, completed our domain analysis, and finalized a test blueprint, we might be eager to jump right in to item writing, but there is one important step to take before we begin: training!

Unless you are writing the entire assessment yourself, you will need a group of item writers to develop the content. These item writers are likely experts in their fields, but they may have very little understanding of how to create assessment content. Even if these experts have experience writing items, it may be beneficial to provide refresher trainings, especially if anything has changed in your assessment design.

In their chapter in Educational Measurement (4 th ed.), Cynthia Shmeiser and Catherine Welch note that it is important to consider the qualifications and representativeness of your item writers. It is common to ask item writers to fill out a brief survey to collect demographic information. You should keep these responses on file and possibly add a brief document explaining why you consider these item writers to be a qualified and representative sample.

Shmeiser and Welch also underscore the need for security. Item writers should be trained on your content security guidelines, and your organization may even ask them to sign an agreement stating that they will abide by those guidelines. Make sure everyone understands the security guidelines, and have a plan in place in case there are any violations.

Next, begin training your item writers on how to author items, which should include basic concepts about cognitive levels, drafting stems, picking distractors, and using specific item types appropriately. Shmeiser and Welch suggest that the test blueprint be used as the foundation of the training. Item writers should understand the content included in the specifications and the types of items they are expected to create for that content. Be sure to share examples of good and bad items.

If possible, ask your writers to create some practice items, then review their work and provide feedback. If they are using the item authoring software for the first time, be sure to acquaint them with the tools before they are given their item writing assignments.

Your item writers may also need training on your item data, delivery method, or scoring rules. For example, you may ask item writers to cite a reference for each item, or you might ask them to weight certain items differently. Your instructions need to be clear and precise, and you should spot check your item writers’ work. If possible, write a style guide that includes clear guidelines about item construction, such as fonts to use, acceptable abbreviations, scoring rules, acceptable item types, et cetera.

I know from my own experience (and Shmeiser and Welch agree) that investing more time in training will have a big payoff down the line. Better training leads to substantially better item retention rates when items are reviewed. If your item writers are not trained well, you may end up throwing out many of their items, which may not leave you enough for your assessment design. Considering the cost of item development and the time spent writing and reviewing items, putting in a few more hours of training can equal big savings for your program in the long run.

In my next post, I will discuss how to manage your item writers as they begin the important work of drafting the items.