Develop Better Tests with Item Analysis [New eBook]

Posted by Chloe Mendonca

Item Analysis is probably the most important tool for increasing test effectiveness.  In order to write items that accurately and reliably measure what they’re intended to, you need to examine participant responses to each item. You can use this information to improve test items and identify unfair or biased items.

So what’s the process for conducting an item analysis? What should you be looking for? How do you determine if a question is “good enough”?

Questionmark has just published a new eBook “Item Analysis Analytics, which answers these questions. The eBook shares many examples of varying statistics that you may come across item analysis ebookin your own analyses.

Download this eBook to learn about these aspects of analytics:

  • the basics of classical test theory and item analysis
  • the process of conducting an item analysis
  • essential things to look for in a typical item analysis report
  • whether a question “makes the grade” in terms of psychometric quality

This eBook is available as a PDF and ePUB suitable for viewing on a variety of mobile devices and eReaders.

I hope you enjoy reading it!

Item Analysis Report – Item Reliability

Austin FosseyPosted by Austin Fossey

In this series of posts, we have been discussing the statistics that are reported on the Item Analysis Report, including the difficulty index, correlational discrimination, and high-low discrimination.

The final statistic reported on the Item Analysis Report is the item reliability. Item reliability is simply the product of the standard deviation of item scores and a correlational discrimination index (Item-Total Correlation Discrimination in the Item Analysis Report). So item reliability reflects how much the item is contributing to total score variance. As with assessment reliability, higher values represent better reliability.

Like the other statistics in the Item Analysis Report, item reliability is used primarily to inform decisions about item retention. Crocker and Algina (Introduction to Classical and Modern Test Theory) describe three ways that test developers might use the item reliability index.

1) Choosing Between Two Items in Form Construction

If two items have similar discrimination values, but one item has a higher standard deviation of item scores, then that item will have higher item reliability and will contribute more to the assessment’s reliability. All else being equal, the test developer might decide to retain the item with higher reliability and save the lower reliability item in the bank as backup.

2) Building a Form with a Required Assessment Reliability Threshold

As Crocker and Algina demonstrate, Cronbach’s Alpha can be calculated as a function of the standard deviations of items’ scores and items’ reliabilities. If the test developer desires a certain minimum for the assessment’s reliability (as measured by Cronbach’s Alpha), they can use these two item statistics to build a form that will yield the desired level of internal consistency.

3) Building a Form with a Required Total Score Variance Threshold

Crocker and Algina explain that the total score variance is equivalent to the square of the sum of item reliability indices, so test developers may continue to add items to a form based on their item reliability values until they meet their desired threshold for total score variance.


Item reliability from Questionmark’s Item Analysis Report (item detail page)

Item Analysis Report – High-Low Discrimination

Austin Fossey-42Posted by Austin Fossey

In our discussion about correlational item discrimination, I mentioned that there are several other ways to quantify discrimination. One of the simplest ways to calculate discrimination is the High-Low Discrimination index, which is included on the item detail views in Questionmark’s Item Analysis Report.

To calculate the High-Low Discrimination value, we simply subtract the percentage of low-scoring participants who got the item correct from the percentage of high-scoring participants who got the item correct. If 30% of our low-scoring participants answered correctly, and 80% of our high-scoring participants answered correctly, then the High-Low Discrimination is 0.80 – 0.30 = 0.50.

But what is the cut point between high and low scorers? In his article, “Selection of Upper and Lower Groups for the Validation of Test Items,” Kelley demonstrated that the High-Low Discrimination index may be more stable when we define the upper and lower groups as participants with the top 27% and bottom 27% of total scores, respectively. This is the same method that is used to define the upper and lower groups in Questionmark’s Item Analysis Report.

The interpretation of High-Low Discrimination is similar to the interpretation of correlational indices: positive values indicate good discrimination, values near zero indicate that there is little discrimination, and negative discrimination indicates that the item is easier for low-scoring participants.

In Measuring Educational Achievement, Ebel recommended the following cut points for interpreting High-Low Discrimination (D):

Capture Blog 18

In Introduction to Classical and Modern Test Theory, Crocker and Algina note that there are some drawbacks to the High-Low Discrimination index. First, it is more common to see items with the same p value having large discrepancies in their High-Low Discrimination values. Second, unlike correlation discrimination indices, High-Low Discrimination can only be calculated for dichotomous items. Finally, the High-Low Discrimination does not have a defined sampling distribution, which means that confidence intervals cannot be calculated, and practitioners cannot determine whether there are statistical differences in High-Low Discrimination values.

Nevertheless, High-Low Discrimination is easy to calculate and interpret, so it is still a very useful tool for item analysis, especially in small-scale assessment. The figure below shows an example of the High-Low Discrimination value on the item detail view of the Item Analysis Report.


High-Low Discrimination value on the item detail page of Questionmark’s Item Analysis Report.

Item Analysis Report – Item Difficulty Index

Austin FosseyPosted by Austin Fossey

In classical test theory, a common item statistic is the item’s difficulty index, or “p value.” Given many psychometricians’ notoriously poor spelling, might this be due to thinking that “difficulty” starts with p?

Actually, the p stands for the proportion of participants who got the item correct. For example, if 100 participants answered the item, and 72 of them answered the item correctly, then the p value is 0.72. The p value can take on any value between 0.00 and 1.00. Higher values denote easier items (more people answered the item correctly), and lower values denote harder items (fewer people answered the item correctly).

Typically, test developers use this statistic as one indicator for detecting items that could be removed from delivery. They set thresholds for items that are too easy and too difficult, review them, and often remove them from the assessment.

Why throw out the easy and difficult items? Because they are not doing as much work for you. When calculating the item-total correlation (or “discrimination”) for unweighted items, Crocker and Algina (Introduction to Classical and Modern Test Theory) note that discrimination is maximized when p is near 0.50 (about half of the participants get it right).

Why is discrimination so low for easy and hard items? An easy item means that just about everyone gets it right, no matter how proficient they are in the domain; the item does not discriminate well between high and low performers. (We will talk more about discrimination in subsequent posts.)

Sometimes you may still need to use a very easy or very difficult item on your test form. You may have a blueprint that requires a certain number of items from a given topic, and all of the available items might happen to be very easy or very hard. I also see this scenario in cases with non-compensatory scoring of a topic. For example, a simple driving test might ask, “Is it safe to drink and drive?” The question is very easy and will likely have a high p value, but the test developer may include it so that if a participant gets the item wrong, they automatically fail the entire assessment.

You may also want very easy or very hard items if you are using item response theory (IRT) to score an aptitude test, though it should be noted that item difficulty is modeled differently in an IRT framework. IRT yields standard errors of measurement that are conditional on the participant’s ability, so having hard and easy items can help produce better estimates of high- and low-performing participants’ abilities, respectively. This is different from the classical test theory where the standard error of measurement is the same for all observed scores on an assessment.

While simple to calculate, the p value requires cautious interpretation. As Crocker and Algina note, the p value is a function of the number of participants who know the answer to the item plus the number of participants who were able to correctly guess the answer to the item. In an open response item, that latter group is likely very small (absent any cluing in the assessment form), but in a typical multiple choice item, a number of participants may answer correctly, based on their best educated guess.

Recall also that p values are statistics—measures from a sample. Your interpretation of a p value should be informed by your knowledge of the sample. For example, if you have delivered an assessment, but only advanced students have been scheduled to take it, then the p value will be higher than it might be when delivered to a more representative sample.

Since the p value is a statistic, we can calculate the standard error of that statistic to get a sense of how stable the statistic is. The standard error will decrease with larger sample sizes. In the example below, 500 participants responded to this item, and 284 participants answered the item correctly, so the p value is 284/500 = 0.568. The standard error of the statistic is ± 0.022. If these 500 participants were to answer this item over and over again (and no additional learning took place), we would expect the p value for this item to fall in the range of 0.568 ± 0.022 about 68% of the time.

item analysis report 2


Item p value and standard error of the statistic from Questionmark’s Item Analysis Report

Research Design Validity: Applications in Assessment Management

Austin FosseyPosted by Austin Fossey

I would like to wrap up our discussions about validity by talking briefly about the validity of research designs.

We have already discussed criterion, construct, and content validity, which are the stanchions of validity in an assessment. We have also talked about new proponents of argument-based validity and the more abstract concept of face validity.

While all of these concepts relate to the validity of the assessment instrument, we must also consider the validity of the research used in assessment management and the validity of the research that an assessment or survey supports.

In their 1963 book, Experimental and Quasi-Experimental Designs for Research, Donald Campbell and Julian Stanley describe two research design concepts: internal validity and external validity.

Internal validity is the idea that observed differences in a dependent variable (e.g. test score) are directly related to an independent variable (e.g., participant’s true ability). External validity experimentalvalidity refers to how generalizable our results are. For example, would we expect the same results with other samples of participants, other research conditions, or other operational conditions?

The item analysis report, which provides statistics about the difficulty and discrimination of an item, is an example of research that is used for assessment management. Assessments managers often use these statistics to decide if an unscored field test item is fit to become a scored operational item on an assessment.

When we use the item analysis report to decide if the item is worth keeping, we are conducting research. The internal validity of the research may be threatened if something other than participant ability is affecting the item statistics.

For example, I recall a company that field tested two new test forms, and later found out that one participant had been trying to sabotage the statistics by persuading others to purposefully get a low score on the assessment. Fortunately, this person’s online campaign was ineffective, but it is a good example of an event that could have seriously disrupted the internal validity of the item analysis research.

When considering external validity, the most common threat is a non-representative sample. When field testing items for the first time, some assessment managers will find that volunteer participants are not representative of the general population of participants.

In some of my past experiences, I have had samples of field test volunteers who have been either high- ability participants or who are planning to teach a test prep workshop. We would not expect the item statistics from this sample to remain stable when the items go live in the general population.

So how can we control these threats? Try using separate groups of participants so you can compare results. Be consistent in how assessments are administered, and when items are not administered to all participants, make sure they are randomly assigned. Document your sample to demonstrate that it is representative of your participant population, and when possible, try to replicate your findings.

Using the Item Analysis Report

Here’s some basic information about the Item Analysis Report, which was recently added to Questionmark Analytics:

What it does: The item analysis report provides an in-depth Classical Test Theory psychometric analysis of item performance. It enables you to drill-down into specific item statistics and performance data. The report includes key item statistics including item difficulty p-value, high-low discrimination, item-total correlation discrimination and item reliability. It also provides assessment statistics relating to the amount of time taken and the scores achieved.

Who should use it: Assessment, learning and education professionals can use this report to determine how well questions perform psychometrically.

How it looks: The report includes an assessment level overview graph and summary table. The overview graph plots a single point for each item in the summary table. Each question is plotted in terms of its item difficulty p-value (X-axis) and by its item-total correlation discrimination (Y-axis):

  • Questions that have high (acceptable) discrimination will appear near the top of the graph
  • Questions with low (unacceptable) discrimination will appear at the bottom of the graph
  • Difficult questions will appear to the left of the graph
  • Easy questions will appear to the right of the graph

The summary table beneath the scatter plot graph contains a line item for every question on the assessment. The table provides information on the question order, question wording and description, and summary information regarding the item difficulty p-values and the item-total correlation discrimination for each question. You can select an item in the table to navigate to the details of the selected item. And you can sort on each column to get different views of question performance. For example, you can sort the questions by difficulty to look at the hardest questions at the top of the table.

By clicking on any of the rows in the summary table one can go to a detailed item view of question-level information.