# Item Analysis Report – Item Difficulty Index

Posted by Austin Fossey

In classical test theory, a common item statistic is the item’s difficulty index, or “*p* value.” Given many psychometricians’ notoriously poor spelling, might this be due to thinking that “difficulty” starts with* p*?

Actually, the *p* stands for the proportion of participants who got the item correct. For example, if 100 participants answered the item, and 72 of them answered the item correctly, then the *p* value is 0.72. The *p* value can take on any value between 0.00 and 1.00. Higher values denote easier items (more people answered the item correctly), and lower values denote harder items (fewer people answered the item correctly).

Typically, test developers use this statistic as one indicator for detecting items that could be removed from delivery. They set thresholds for items that are too easy and too difficult, review them, and often remove them from the assessment.

Why throw out the easy and difficult items? Because they are not doing as much work for you. When calculating the item-total correlation (or “discrimination”) for unweighted items, Crocker and Algina (Introduction to Classical and Modern Test Theory) note that discrimination is maximized when *p* is near 0.50 (about half of the participants get it right).

Why is discrimination so low for easy and hard items? An easy item means that just about everyone gets it right, no matter how proficient they are in the domain; the item does not discriminate well between high and low performers. (We will talk more about discrimination in subsequent posts.)

Sometimes you may still need to use a very easy or very difficult item on your test form. You may have a blueprint that requires a certain number of items from a given topic, and all of the available items might happen to be very easy or very hard. I also see this scenario in cases with non-compensatory scoring of a topic. For example, a simple driving test might ask, “Is it safe to drink and drive?” The question is very easy and will likely have a high p value, but the test developer may include it so that if a participant gets the item wrong, they automatically fail the entire assessment.

You may also want very easy or very hard items if you are using item response theory (IRT) to score an aptitude test, though it should be noted that item difficulty is modeled differently in an IRT framework. IRT yields standard errors of measurement that are conditional on the participant’s ability, so having hard and easy items can help produce better estimates of high- and low-performing participants’ abilities, respectively. This is different from the classical test theory where the standard error of measurement is the same for all observed scores on an assessment.

While simple to calculate, the *p* value requires cautious interpretation. As Crocker and Algina note, the *p* value is a function of the number of participants who know the answer to the item plus the number of participants who were able to correctly guess the answer to the item. In an open response item, that latter group is likely very small (absent any cluing in the assessment form), but in a typical multiple choice item, a number of participants may answer correctly, based on their best educated guess.

Recall also that *p* values are statistics—measures from a sample. Your interpretation of a *p* value should be informed by your knowledge of the sample. For example, if you have delivered an assessment, but only advanced students have been scheduled to take it, then the *p* value will be higher than it might be when delivered to a more representative sample.

Since the *p* value is a statistic, we can calculate the standard error of that statistic to get a sense of how stable the statistic is. The standard error will decrease with larger sample sizes. In the example below, 500 participants responded to this item, and 284 participants answered the item correctly, so the *p* value is 284/500 = 0.568. The standard error of the statistic is ± 0.022. If these 500 participants were to answer this item over and over again (and no additional learning took place), we would expect the *p* value for this item to fall in the range of 0.568 ± 0.022 about 68% of the time.

Item *p* value and standard error of the statistic from Questionmark’s Item Analysis Report

if only half of the high group answers an item correctly and all the low group answer it wrongly item discrimination of the item is?

Hi Farhad,

Great question! It depends what formula you use to calculate item discrimination. If you use high-low discrimination (which is not very common anymore), the discrimination would be -0.50. If, however, you use a correlational discrimination statistic like the Pearson product-moment correlation or item-rest correlation, the value will depend on the variance in the item scores and total scores. I imagine it would still be close to zero or negative though. These correlational statistics are typically preferred for modern item analysis.

Thanks!

-Austin

on the topic of discrimination: for a dichotomous item, is there an rule of thumb regarding the minimum value of an item’s discrimination parameter?

moreover, is there a sample size calculator to determine the minimum sample needed for calibrating items?

Hi Vince,

Yes, the rule of thumb was specified by Ebel (1965) in his book

Measuring Educational Achievement. Ebel suggested that test developers accept discrimination values that are significantly greater than zero. This is still the recommendation for most test developers today. Some people will also cut corners and just set a cutoff value, but this introduces the possibility of making a Type I or Type II error in the evaluation of the item’s discrimination. Nevertheless, this is an efficient way of making a first pass at the item review, and it may be sufficient for many practical applications. I have seen test developers use cutoff values of 0.15, 0.20, and 0.30. The value is a matter of preference and the test developer’s risk tolerance in terms of flagging the items.In terms of sample size, if you are referring to sample size for an item discrimination analysis, there are tons of free calculators available online (just google “sample size calculator correlation”). If you use one of these, make sure you account for the fact that you are doing a one-tailed test, since we want to be able to determine whether an item’s discrimination is significantly

greaterthan zero.If you are referring to sample size for item calibration in an IRT model, the sample size will depend on the type of analysis the test developer is conducting, as well as the IRT model. de Ayala discusses this in his chapter “Joint Maximum Likelihood Parameter Estimation” in

The Theory and Practice of Item Response Theory. He provides a good overview of the considerations that impact sample size and concludes that a “few hundred” participants is a good guideline for item calibration, but he cautions that some applications may need more or fewer participants.-Austin

Dear Austin Fossey

Could you please let me know what software you are using to generate the report on item difficulty shown above in your post?

Thank you very much in adavnce

Ali

what if the given is the number of wrong answer and with the given upper and lower . how could be done?

Hi Ali,

Of course! I am using Questionmark OnDemand, specifically the Item Analysis Report in Analytics.

Cheers,

Austin

Hi Yanna,

I am not sure I understand the proposed calculation. Perhaps you could provide an example?

In general, the percentage of wrong answers will just be: 1.00 – p value. We usually frame statistics in terms of correct answers because the correct answers represent positive evidence about a construct, whereas an incorrect answer usually does not represent anything in a unidimensional assessment.

Cheers,

Austin

can i ask? why they get a 27% in the higher group and 27% in the lower group to test?

Hi Carlo,

Of course you can ask! That’s why we’re here!

The 27% score bands are based on a study by Truman Kelley that was published in 1939 called “Selection of upper and lower groups for the validation of test items.” Kelley had actually recommended the 27% grouping for the calculation of the index of discrimination (what is called “High-Low Discrimination” in Questionmark’s software), but no one really uses the index of discrimination anymore for a variety of reasons that I discussed in another post. Anyway, Kelley demonstrated that splitting the participants up into the top 27% and bottom 27% of scores would provide the most sensitive index of discrimination in some conditions, though other researchers later demonstrated that other score ranges would yield equally sensitive results for large enough sample sizes.

The 27% score bands reported in Questionmark’s item analysis report are an artifact of the days when people still used the index of discrimination (a practice that was common as recently as the 1980’s due to lack of access to computers to calculate correlational discrimination indices). The people who originally designed Questionmark’s Item Analysis Report kept this design, but some companies report quartiles (which is essentially the same as using Kelley’s 27% method) or quintiles instead. There is no benefit of using one method over another for the purposes of item review, so the choice is primarily a matter of preference.

Cheers,

Austin