Item Development – Psychometric review

Austin FosseyPosted by Austin Fossey

The final step in item development is the psychometric review. You have drafted the items, edited them, sent them past your review committee, and tried them out with your participants. The psychometric review will use item statistics to flag any items that may need to be removed before you build your assessment forms for production. It is common to look at statistics relating to difficulty, discrimination, and bias.

As with other steps in the item development process, you should assemble an independent, representative, qualified group of subject matter experts (SMEs) to look at flagged items. If you are short on time, you may only want to have them review the items with statistical flags. Their job is to figure out what is wrong with items that return poor statistics.

Difficulty – Items that are too hard or too easy are often not desirable on criterion-referenced assessments because they do not discriminate well. However, they may be desirable for norm-referenced assessments or aptitude tests where you want to accurately measure a wide spectrum of ability.

If using classical test theory, you will flag items based on their p-value (difficulty index). Remember that lower values are harder items, and higher values are easier items. I typically flag anything over 0.90 or 0.95 and anything under 0.25 or 0.20, but others have different preferences.

If an item is flagged by its difficulty, there are several things to look for. If an item is too hard, it may be that the content has not been taught yet to your population, or it is obscure content. This may not be justification for removing the item from the assessment if it aligns well with your  blueprint. However, it could also be that the item is confusing, mis-keyed, or has overlapping options, in which case you should consider removing it from the assessment before you go to production.

If an item is too easy, it may be that the population of participants has mastered this content, though it may still be relevant to the blueprint. You will need to make the decision about whether or not that item should remain. However, there could be other reasons an item is too easy, such as item cluing, poor distractors, identifiable key patterns, or compromised content. Again, in these scenarios you should consider removing the item before using it on a live form.

Discrimination – If an item does not discriminate well, it means that it does not help differentiate between high- and low-performing participants. These items do not add much to the information available in the assessment, and if they have negative discrimination values, they may actually be adding construct-irrelevant variance to your total scores.

If using classical test theory, you will flag your items based on their item-total discrimination (Pearson product-moment correlation) or their item-rest correlation (item-remainder correlation). The latter is most useful for short assessments (25 items or fewer), small sample sizes, or assessments with items weighted differently. I typically flag items with discrimination values below 0.20 or 0.15, but again, other test developers will have their own preferences.

If an item is flagged for discrimination, it may have some of the same issues that cause problems with item difficulty, such as a mis-keyed response or overlapping options. Easy items and difficult items will also tend to have lower discrimination values due to the lack of variance in the item scores. There may be other issues impacting discrimination, such as when high-performing participants overthink an item and end up getting it wrong more often than lower-performing participants.

Statistical Bias – In earlier posts, we talked about using differential item functioning (DIF) to identify statistical bias in items. Recall that this can only be done with item response theory models (IRT), so you cannot use classical test theory statistics to determine statistical bias. Logistic regression can be used to identify both uniform and non-uniform DIF. DIF software will typically classify DIF effect sizes as A, B, or C. If possible, review any item flagged with DIF, but if there are too many items or you are short on time, you may want to focus on the items that fall into categories B or C.

DIF can occur from bias in the content or response process, which are the same issues your bias review committee was looking for. Sometimes DIF statistics help uncover content bias or response process bias that your bias review committee missed; however, you may have an item flagged for DIF, but no one can explain why it is performing differently between demographic groups. If you have a surplus of items, you may still want to discard these flagged items just to be safe, even if you are not sure why they are exhibiting bias.

Remember, not all items flagged in the psychometric review need to be removed. This is why you have your SMEs there. They will help determine whether there is a justification to keep an item on the assessment even though it may have poor item statistics. Nevertheless, expect to cull a lot of your flagged items before building your production forms.

psych review

Example of an item flagged for difficulty (p = 0.159) and discrimination (item-total correlation = 0.088). Answer option information table shows that this item was likely mis-keyed.


Leave a Reply