Item Development – Psychometric review

Austin FosseyPosted by Austin Fossey

The final step in item development is the psychometric review. You have drafted the items, edited them, sent them past your review committee, and tried them out with your participants. The psychometric review will use item statistics to flag any items that may need to be removed before you build your assessment forms for production. It is common to look at statistics relating to difficulty, discrimination, and bias.

As with other steps in the item development process, you should assemble an independent, representative, qualified group of subject matter experts (SMEs) to look at flagged items. If you are short on time, you may only want to have them review the items with statistical flags. Their job is to figure out what is wrong with items that return poor statistics.

Difficulty – Items that are too hard or too easy are often not desirable on criterion-referenced assessments because they do not discriminate well. However, they may be desirable for norm-referenced assessments or aptitude tests where you want to accurately measure a wide spectrum of ability.

If using classical test theory, you will flag items based on their p-value (difficulty index). Remember that lower values are harder items, and higher values are easier items. I typically flag anything over 0.90 or 0.95 and anything under 0.25 or 0.20, but others have different preferences.

If an item is flagged by its difficulty, there are several things to look for. If an item is too hard, it may be that the content has not been taught yet to your population, or it is obscure content. This may not be justification for removing the item from the assessment if it aligns well with your  blueprint. However, it could also be that the item is confusing, mis-keyed, or has overlapping options, in which case you should consider removing it from the assessment before you go to production.

If an item is too easy, it may be that the population of participants has mastered this content, though it may still be relevant to the blueprint. You will need to make the decision about whether or not that item should remain. However, there could be other reasons an item is too easy, such as item cluing, poor distractors, identifiable key patterns, or compromised content. Again, in these scenarios you should consider removing the item before using it on a live form.

Discrimination – If an item does not discriminate well, it means that it does not help differentiate between high- and low-performing participants. These items do not add much to the information available in the assessment, and if they have negative discrimination values, they may actually be adding construct-irrelevant variance to your total scores.

If using classical test theory, you will flag your items based on their item-total discrimination (Pearson product-moment correlation) or their item-rest correlation (item-remainder correlation). The latter is most useful for short assessments (25 items or fewer), small sample sizes, or assessments with items weighted differently. I typically flag items with discrimination values below 0.20 or 0.15, but again, other test developers will have their own preferences.

If an item is flagged for discrimination, it may have some of the same issues that cause problems with item difficulty, such as a mis-keyed response or overlapping options. Easy items and difficult items will also tend to have lower discrimination values due to the lack of variance in the item scores. There may be other issues impacting discrimination, such as when high-performing participants overthink an item and end up getting it wrong more often than lower-performing participants.

Statistical Bias – In earlier posts, we talked about using differential item functioning (DIF) to identify statistical bias in items. Recall that this can only be done with item response theory models (IRT), so you cannot use classical test theory statistics to determine statistical bias. Logistic regression can be used to identify both uniform and non-uniform DIF. DIF software will typically classify DIF effect sizes as A, B, or C. If possible, review any item flagged with DIF, but if there are too many items or you are short on time, you may want to focus on the items that fall into categories B or C.

DIF can occur from bias in the content or response process, which are the same issues your bias review committee was looking for. Sometimes DIF statistics help uncover content bias or response process bias that your bias review committee missed; however, you may have an item flagged for DIF, but no one can explain why it is performing differently between demographic groups. If you have a surplus of items, you may still want to discard these flagged items just to be safe, even if you are not sure why they are exhibiting bias.

Remember, not all items flagged in the psychometric review need to be removed. This is why you have your SMEs there. They will help determine whether there is a justification to keep an item on the assessment even though it may have poor item statistics. Nevertheless, expect to cull a lot of your flagged items before building your production forms.

psych review

Example of an item flagged for difficulty (p = 0.159) and discrimination (item-total correlation = 0.088). Answer option information table shows that this item was likely mis-keyed.


Item Analysis – Differential Item Functioning (DIF)

Austin Fossey-42Posted by Austin Fossey

We have talked about item difficulty and item discrimination in a classical test theory framework, and we have discussed how these indices can be used to flag items that may potentially affect the validity of the inferences we make about the assessment results. Another area of item performance that is often used for item retention decisions is item bias, commonly referred to as differential item functioning (DIF).

DIF studies are generally implemented to see how the performances of two groups compare on a single item in an assessment (though studies can be done with more than two groups). One group is typically referred to as the reference group, and the other is the focal group. The focal group is the group that theory or previous research suggests may be disadvantaged by the item.

One simple method that some practitioners use is based on the four-fifths rule, which is detailed in the Uniform Guidelines for Employee Selection Procedures. This method involves comparing the correct response rates (p values) for the two groups. If the ratio of the smaller p value to the higher p value is less than 0.80, then the item may be adversely impacting the group with the lower p value. For example, if 50% of males answer an item correctly and 75% of females answer an item correctly, then 0.50/0.75 = 0.66 < 0.80, so we may be concerned that the item is adversely affecting the response patterns of males.

The four-fifths rule is attractive because it is easy to calculate, but it is prone to sampling error and misinterpretation. Continuing with our example, what if the population of males on average actually knows less about the content than the population of females? Then we would expect to see large differences in p values for the two groups because this reflects the actual differences in ability in the population.

In Differential Item Functioning (eds. Holland & Wainer, 1993), Angoff explains that DIF is occurring when an item displays different statistical properties between groups after those groups have been matched on a measure of proficiency. To put it another way, we need to first account for differences in the groups’ abilities, and then see if there are still differences in the item performance.

There are many ways to investigate DIF while accounting for participants’ abilities, and your decision may be influenced by whether or not you are using item response theory (IRT) for your student model, whether you have missing data, and whether or not the DIF is uniform or non-uniform.

Uniform DIF indicates that one group is (on average) always at a disadvantage when responding to the item. If we were to create item characteristic curves for the two groups, they would not intersect. Non-uniform DIF means that one group has an advantage for some proficiency levels, but is at a disadvantage at other proficiency levels. In this scenario, the two item characteristic curves would intersect.

Item Characteristic Curve

Item characteristic curves demonstrating examples of uniform and non-uniform DIF.

In my next post, I will introduce two common methods for detecting uniform and non-uniform DIF: the Mantel-Haenszel method and logistic regression. Unlike the four-fifths rule, these methods account for participants’ abilities (as represented by total scores) before making inferences about each group’s performance on an item.

Applications of confidence intervals in a psychometric context


Posted by Greg Pope

I have always been a fan of confidence intervals. Some people are fans of sports teams, for me, it’s confidence intervals! I find them really useful in assessment reporting contexts, all the way from item and test analysis psychometrics to participant reports.

Many of us get exposure to the practical use of confidence intervals via the media, when survey results are quoted. For example: “Of the 1,000 people surveyed, 55% said they will vote for John Doe. The margin of error for the survey was plus or minus 5% 95 times out of 100.” This is saying that the “observed” percentage of people who  say they will vote for Mr. Doe is 55% and there is a 95% chance that the “true” percentage of people who will vote for John Doe is somewhere between 50-60%.

Sample size is a big factor in the margin of error: generally, the larger the sample the smaller the margin of error as we get closer to representing the population.  (We can’t survey approximately all 307,006,550 people in the US now, can we!) So if the sample was 10,000 instead of 1,000 we would expect that the margin of error would be smaller than plus or minus 5%.

These concepts are relevant in an assessment context as well. You may remember my previous post on Classical Test Theory and reliability in which I explained that an observed test score (the score a participant achieves on an assessment) is composed of a true score and error. In other words, the observed score that a participant achieves is not 100% accurate; there is always error in the measurement. What this means practically is that if a participant achieves 50% on an exam their true score could actually be somewhere between say 44% and 56%.

This notion that observed scores are not absolute has implications for verifying what participants know and can do. For example, a  participant who achieves 50% on a crane certification exam (on which the pass score is 50%) would pass the exam and be able to hop into a crane, moving stuff up and down and around. However, achieving a score right on the borderline means this person may not, in fact, know enough to pass the exam if he or she were to take it again and then be certified on crane operation. His/her supervisor might not feel very confident about letting this person operate that crane!

To deal with the inherent uncertainty around observed scores, some organizations factor this margin of error in when setting the cut score…but this is another fun topic that I touched on in another post. I believe a best practice is to incorporate a confidence interval into the reporting of scores for participants in order to recognize that the score is not an “absolute truth” and is an estimate of what a person knows and can do. A simple example of a participant report I created to demonstrate this shows a diamond that encapsulates the participant score; the vertical height of the diamond represents the confidence interval around the participant’s score.

In some of my previous posts I talked about how sample size affects the robustness of item level statistics like p-values and item-total correlation coefficients and provided graphics showing the confidence interval ranges for the statistics based on sample sizes. I believe confidence intervals are also very useful in this psychometric context of evaluating the performance of items and tests. For example, often when we see a p-value for a question of 0.600 we incorrectly accept this as the “truth” that 60% of participants got the question right. In actual fact, this p-value of 0.600 is an observation and the “true” p-value could actually be between 0.500 and 0.700, a big difference when we are carefully choosing questions to shape our assessment!

With the holiday season fast approaching, perhaps Santa has a confidence interval in his sack for you and your organization to apply to your assessment results reporting and analysis!

Should I include really easy or really hard questions on my assessments?


Posted by Greg Pope

I thought it might be fun to discuss something that many people have asked me about over the years: “Should I include really easy or really hard questions on my assessments?” It is difficult to provide a simple “Yes” or “No” answer because, as with so many things in testing, it depends! However, I can provide some food for thought that may help you when building your assessments.

We can define easy questions as those with high p-values (item difficulty statistics) such as 0.9 to 1.0 (90-100% of participants answer the question correctly). We can define hard questions as those with low p-values such as 0.15 to 0 (15-0% answer the question correctly). These ranges are fairly arbitrary: some organizations in some contexts may consider greater than 0.8 easy and less than 0.25 difficult.

When considering how easy or difficult questions should be, start by asking, “What is the purpose of the assessment program and the assessments being developed?” If the purpose of an assessment is to provide a knowledge check and facilitate learning during a course, then maybe a short formative quiz would be appropriate. In this case, one can be fairly flexible in selecting questions to include on the quiz. Having some easier and harder questions is probably just fine. If the purpose of an assessment is to measure a participant’s ability to process information quickly and accurately under duress, then a speed test would likely be appropriate. In that case, a large number of low-difficulty questions should be included on the assessment.

However, in many common situations having very difficult or very easy questions on an assessment may not make a great deal of sense. For a criterion referenced example, if the purpose of an assessment is to certify participants as knowledgeable and skilful enough to do a certain job competently (e.g., crane operation), the difficulty of questions  would need careful scrutiny. The exam may have a cut score that participants need to achieve in order to be considered good enough (e.g., 60+%). Here are a few reasons why having many very easy or very hard questions on this type of assessment may not make sense:

Very easy items won’t contribute a great deal to the measurement of the construct

A very easy item that almost every participant gets right doesn’t tell us a great deal about what the participant knows and can do. A question like: “Cranes are big. Yes/No” doesn’t tell us a great deal about whether someone has the knowledge or skills to operate a crane. Very easy questions, in this context, are almost like “give-away” questions that contribute virtually nothing to the measurement of the construct. One would get almost the same measurement information (or lack thereof) from asking a question like “What is your shoe size?” because everyone (or mostly everyone) would get it correct.

Tricky to balance blueprint

Assessment construction generally requires following a blueprint that needs to be balanced in terms of question content, difficulty, and other factors. It is often very difficult to balance these blueprints for all factors, and using extreme questions makes this all the more challenging because there are generally more questions available that are of average rather than extreme difficulty.

Potentially not enough questions providing information near the cut score

In a criterion referenced exam with a cut score of 60% one would want the most measurement information in the exam near this cut score. What do I mean by this? Well, questions with p-values around 0.60 will provide the most information regarding whether participants just have the knowledge and skills to pass or just don’t have the knowledge and skills to pass. This topic requires a more detailed look at assessment development techniques that I will elaborate on soon in an upcoming blog post!

Effect of question difficulty on question discrimination

The difficulty of questions affects the discrimination (item-total correlation) statistics of the question. Extremely easy or extremely hard questions have a harder time obtaining those high discrimination statistics that we look for. In the graph below, I show the relationship between question difficulty p-values and item-total correlation discrimination statistics. Notice that the questions (the little diamonds) that have very low and very high p-values also have very low discrimination statistics and those around 0.5 have the highest discrimination statistics.

Psychometrics 101: How do I know if an assessment is reliable? (Part 3)


Posted by Greg Pope

Following up from my posts last week on reliability I thought I would finish up on this theme by explaining the internal consistency reliability measure: Cronbach’s Alpha.

Cronbach’s Alpha produces the same results as the Kuder-Richardson Formula 20 (KR-20) internal consistency reliability for dichotomously scored questions (right/wrong, 1/0), but  Cronbach’s Alpha  also allows for the analysis of polytomously scored questions (partial credit, 0 to 5). This is why Questionmark products (e.g., Test Analysis Report, RMS) use Cronbach’s Alpha rather than KR-20.

People sometimes ask me about KR-21. This is a quick and dirty reliability estimate formula that almost always produces lower values than KR-20. KR-21 assumes that all questions have equal difficulty (p-value) to make hand calculations easier. This assumption of all questions having the same difficulty is usually not very close to reality where questions on an assessment generally have a range of difficulty. This is why few people in the industry use KR-21 over KR-20 or Cronbach’s Alpha.

My colleagues and I generally recommend that Cronbach’s Alpha values of 0.90 or greater are excellent and acceptable for high-stakes tests, while values of 0.7 to 0.90 are considered to be acceptable/good and appropriate for medium-stakes tests. Generally values below 0.5 are considered unacceptable. With this said, in low stakes testing situations it may not be possible to obtain high internal consistency reliability coefficient values. In this context one might be better off evaluating the performance of an assessment on an item-by-item basis rather than focusing on the overall assessment reliability value.