Item analysis: Selecting items for the test form – Part 2

Austin Fossey-42Posted by Austin Fossey

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

IT 2

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.

IT2

The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 1

Austin Fossey-42Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.

But don’t worry, this is actually a new topic for the blog.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. 

Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.

When building a test form, item statistics can help us in two ways.

  • They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
  • They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.

In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.

Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.

Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.

Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).

In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

 

 

Item Analysis Report – High-Low Discrimination

Austin Fossey-42Posted by Austin Fossey

In our discussion about correlational item discrimination, I mentioned that there are several other ways to quantify discrimination. One of the simplest ways to calculate discrimination is the High-Low Discrimination index, which is included on the item detail views in Questionmark’s Item Analysis Report.

To calculate the High-Low Discrimination value, we simply subtract the percentage of low-scoring participants who got the item correct from the percentage of high-scoring participants who got the item correct. If 30% of our low-scoring participants answered correctly, and 80% of our high-scoring participants answered correctly, then the High-Low Discrimination is 0.80 – 0.30 = 0.50.

But what is the cut point between high and low scorers? In his article, “Selection of Upper and Lower Groups for the Validation of Test Items,” Kelley demonstrated that the High-Low Discrimination index may be more stable when we define the upper and lower groups as participants with the top 27% and bottom 27% of total scores, respectively. This is the same method that is used to define the upper and lower groups in Questionmark’s Item Analysis Report.

The interpretation of High-Low Discrimination is similar to the interpretation of correlational indices: positive values indicate good discrimination, values near zero indicate that there is little discrimination, and negative discrimination indicates that the item is easier for low-scoring participants.

In Measuring Educational Achievement, Ebel recommended the following cut points for interpreting High-Low Discrimination (D):

Capture Blog 18

In Introduction to Classical and Modern Test Theory, Crocker and Algina note that there are some drawbacks to the High-Low Discrimination index. First, it is more common to see items with the same p value having large discrepancies in their High-Low Discrimination values. Second, unlike correlation discrimination indices, High-Low Discrimination can only be calculated for dichotomous items. Finally, the High-Low Discrimination does not have a defined sampling distribution, which means that confidence intervals cannot be calculated, and practitioners cannot determine whether there are statistical differences in High-Low Discrimination values.

Nevertheless, High-Low Discrimination is easy to calculate and interpret, so it is still a very useful tool for item analysis, especially in small-scale assessment. The figure below shows an example of the High-Low Discrimination value on the item detail view of the Item Analysis Report.

high_low_discrimination

High-Low Discrimination value on the item detail page of Questionmark’s Item Analysis Report.

Item Analysis Analytics Part 8: Some problematic questions

greg_pope-150x1502

Posted by Greg Pope

In my last post, I showed a few more examples of item analyses where we drilled down into why some questions had problems. I thought it might be useful  to show a few examples of some questions that have bad and downright terrible psychometric performance to show the ugly side of item analysis.

Below is an example of a question that is fairly terrible in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 65, which is not so good: there are too few participants in the sample to be able to make sound judgements about the psychometric performance of the question
  • Next we see that 25 participants didn’t answer the question (“Number not Answered” = 25), which means there was a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is hard with 20% of participants ‘getting it right.’
  • The “Item Discrimination” indicates very low discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘More than 40’ at only 5%. This means that of the participants with high overall exam scores, 27% selected the correct answer versus 22% of the participants with the lowest overall exam scores. This is a very small difference between the Upper and Lower groups. Participants who know the material should have got the question right more often.
  • The “Item Total Correlation” reflects the Item Discrimination with a negative value of -0.01. A value like this would definitely not meet most organizations’ internal criteria in terms of what is considered an acceptable item. Negative item-total correlations are a major red flag!
  • Finally we look at the Outcome information to see how the distracters perform. We find that participants are all over the map selecting distracters in an erratic way. When I look at the question wording I realize how vague and arbitrary this question is: the number of questions that should be in an assessment depends on numerous factors and contexts. It is impossible to say that in any context a certain number of questions are required. It looks like the Upper Group are selecting the response options ’21-40’ and ‘More than 40’ response options more than the other two options, which have smaller numbers of questions. This makes sense from a participant guessing perspective, because in many assessment contexts having more questions than fewer questions is better for reliability.

The psychometricians, SMEs, and test developers reviewing this question would need to send the SME who wrote this question back to basic authoring training to ensure that they know how to write questions that are clear and concise. This question does not really have a correct answer and needs to be re-written to clarify the context and provide many more details to the participants. I would even be tempted to throw out questions along this content line, because how long an assessment should be has no one “right answer.” How long an assessment should be depends on so many things that there will always be room for ambiguity, so it would be quite challenging to write a question that performs well statistically on this topic.

part-8-pic-1

Below is an example of a question that is downright awful in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 268, which is really good. That is a nice healthy sample. Nothing wrong here, let’s move on.
  • Next we see that 56 participants didn’t answer the question (“Number not Answered” = 56), which means there was a problem with people not finishing or finding the questions confusing and giving up. It gets worse, much, much worse.
  • The “P Value Proportion Correct” shows us that this question is really hard, with 16% of participants ‘getting it right.’
  • The “Item Discrimination” indicates a negative discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘44123’ at  -23%. This means that of the participants with high overall exam scores, 12% selected the correct answer versus 35% of the participants with the lowest overall exam scores. What the heck is going on? This means that participants with the highest overall assessment scores are selecting the correct answer LESS OFTEN than participants with the lowest overall assessment scores. That is not good at all; lets dig deeper.
  • The “Item Total Correlation” reflects the Item Discrimination with a large negative value of -0.26. This is a clear indication that there is something incredibly wrong with this question.
  • Finally we look at the Outcome information to see how the distracters perform. This is where the true psychometric horror of this question is manifested. There is neither rhyme nor reason here: participants, regardless of their performance on the overall assessment, are all over the place in terms of selecting response options. You might as well have blindfolded everyone taking this question and had them randomly select their answers. This must have been extremely frustrating for the participants who had to take this question and would have likely led to many participants thinking that the organization administering this question did not know what they were doing.

The psychometricians, SMEs, and test developers reviewing this question would need to provide a pink slip to the SME who wrote this question immediately. Clearly the SME failed basic question authoring training. This question makes no sense and was written in such a way to suggest that the author was under the influence, or otherwise not in a right state of mind, when crafting this question. What is this question testing? How can anyone possibly make sense of this and come up with a correct answer? Is there a correct answer? This question is not salvageable and should be stricken from the Perception repository without a second thought. A question like this should have never gotten in front of a participant to take, let alone 268 participants. The panel reviewing questions should review their processes to ensure that in the future questions like this are weeded out before an assessment goes out live for people to take.

part-8-pic-2

Item Analysis Analytics Part 7: The psychometric good, bad and ugly

greg_pope-150x1502

Posted by Greg Pope

A few posts ago I showed an example item analysis report for a question that performed well statistically and a question that did not perform well statistically. The latter turned out to be a mis-keyed item. I thought it might be interesting to drill into a few more item analysis cases of questions that have interesting psychometric performance. I hope this will help all of you out there recognize the patterns of the psychometric good, bad and ugly in terms of question performance.

The question below is an example of a question that is borderline in terms of psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 116, which is a decent sample of participants to evaluate the psychometric performance of this question.
  • Next we see everyone answered the question (“Number not Answered” = 0) which means there probably wasn’t a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is average to easy, with 65% of participants “getting it right.”
  • The “Item Discrimination” indicates mediocre discrimination at best, with the difference between the upper and lower group in terms of the proportion selecting the correct answer of ‘Leptokurtic’ at 20%. This means that of the participants with high overall exam scores, 75% selected the correct answer versus 55% of the participants with the lowest overall exam scores. I would have liked to see a larger difference between the Upper and Lower groups.
  • The “Item Total Correlation” backs the Item Discrimination up with a lacklustre value of 0.14. A value like this would likely not meet many organizations’ internal criteria in terms of what is considered a “good” item.
  • Finally, we look at the Outcome information to see how the distracters perform. We find that each distracter pulls some participants, with ‘Platykurtic’ pulling the most participants and quite a large number of the Upper group (22%) selecting this distracter. If I were to guess what is happening, I would say that because the correct option and the distracters are so similar, and because this topic is so obscure you really need to know your material, participants get confused between the correct answer of ‘Leptokurtic’ and the distracter ‘Platykurtic’

The psychometricians, SMEs, and test developers reviewing this question would need to talk with instructors to find out more about how this topic was taught and understand where the problem lies: Is it a problem with the question wording or a problem with instruction and retention/recall of material? If it is a question wording problem, revisions can be made and the question re-beta tested. If the problem is in how the material is being taught, then instructional coaching can occur and the question re-beta tested as is to see if improvements in the psychometric performance of the question occur.

greg-11

The question below is an example of a question that has a classic problem. Here are some reasons why it is problematic:

  • Going from left to right, first we see that the “Number of Results” is 175. That is a fairly healthy sample, nothing wrong there.
  • Next we see everyone answered the question (“Number not Answered” = 0), which means there probably wasn’t a problem with people not finishing or finding the question confusing and giving up
  • The “P Value Proportion Correct” shows us that this question is easy, with 83% of participants ‘getting it right’. There is nothing immediately wrong with an easy question, so let’s look further.
  • The “Item Discrimination” indicates reasonable discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘Cronbach’s Alpha’ at 38%. This means that of the participants with high overall exam scores, 98% selected the correct answer versus 60% of the participants with the lowest overall exam scores. That is a nice difference between the Upper and Lower groups, with almost 100% of the Upper group choosing the correct answer. Obviously, this question is easy for participants who know their stuff!
  • The “Item Total Correlation” backs the Item Discrimination up with a value of 0.39. This value backs up the “Item Discrimination” statistics and would meet most organizations’ internal criteria in terms of what is considered a “good” item.
  • Finally, we look at the Outcome information to see how the distracters perform. Well, two of the distracters don’t pull any participants! This is a waste of good question real estate: Participants have to read through four alternatives when there are only two they even consider as being the correct answer.

The psychometricians, SMEs, and test developers reviewing this question would likely ask the SME who developed the question to come up with better distracters that would draw more participants. Clearly, ‘Bob’s Alpha’ is a joke distracter that participants dismiss immediately as is the ‘KR-1,000,000’, I mean Kuder-Richardson formula one million. Let’s get serious here!

part-8-pic-21

Item Analysis Analytics Part 4: The Nitty-Gritty of Item Analysis

 

greg_pope-150x1502

Posted by Greg Pope

In my previous blog post I highlighted some of the essential things to look for in a typical Item Analysis Report. Now I will dive into the nitty-gritty of item analysis, looking at example questions and explaining how to use the Questionmark Item Analysis Report in an applied context for a State Capitals Exam.

The Questionmark Item Analysis Report first produces an overview of question performance both in terms of the difficulty of questions and in terms of the discrimination of questions (upper minus lower groups). These overview charts give you a “bird’s eye view” of how the questions composing an assessment perform. In the example below we see that we have a range of questions in terms of their difficulty (“Item Difficulty Level Histogram”), with some harder questions (the bars on the left), most average-difficulty questions (bars in the middle), and some easier questions (the bars on the right). In terms of discrimination (“Discrimination Indices Histogram”) we see that we have many questions that have high discrimination as evidenced by the bars being pushed up to the right (more questions on the assessment have higher discrimination statistics).

part-4-picture-1

Overall, if I were building a typical criterion-referenced assessment with a pass score around 50% I would be quite happy with this picture. We have more questions functioning at the pass score point with a range of questions surrounding it and lots of highly discriminating questions. We do have one rogue question on the far left with a very low discrimination index, which we need to look at.

The next step is to drill down into each question to ensure that each question performs as it should. Let’s look at two questions from this assessment, one question that performs well and one question that does not perform so well.

The question below is an example of a question that performs nicely. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 175, which is a nice sample of participants to evaluate the psychometric performance of this question.
  • Next we see thateveryone answered the question (“Number not Answered” = 0), which means there probably wasn’t a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is just above the pass score where 61% of participants ‘got it right.’ Nothing wrong with that: the question is neither too easy nor too hard.
  • The “Item Discrimination” indicates good discrimination, with the difference between the upper and lower group in terms of the proportion selecting the correct answer of ‘Salem’ at 48%. This means that of the participants with high overall exam scores, 88% selected the correct answer versus only 40% of the participants with the lowest overall exam scores. This is a nice, expected pattern.
  • The “Item Total Correlation” backs the Item Discrimination up with a strong value of 0.40. This means that of all participants who answered the questions, the pattern of high scorers getting the question right more than low scorers holds true.
  • Finally we look at the Outcome information to see how the distracters perform. We find that each distracter pulled some participants, with ‘Portland’ pulling the most participants, especially from the “Lower Group.” This pattern makes sense because those with poor state capital knowledge may make the common mistake of selecting Portland as the capital of Oregon.

The psychometricians, SMEs, and test developers reviewing this question all have smiles on their faces when they see the item analysis for this item.

part-4-picture-2

Next we look at that rogue question that does not perform so well in terms of discrimination-–the one we saw in the Discrimination Indices Histogram. When we look into the question we understand why it was flagged:

  • Going from left to right, first we see that the “Number of Results” is 175, which is again a nice sample size: nothing wrong here.
  • Next we see everyone answered the question, which is good.
  • The first red flag comes from the “P Value Proportion Correct” as this question is quite difficult (only 35% of participants selected the correct answer). This is not in and of itself a bad thing so we can keep this in memory as we move on,
  • The “Item Discrimination” indicates a major problem, a negative discrimination value. This means that participants with the lowest exam scores selected the correct answer more than participants with the highest exam scores. This is not the expected pattern we are looking for: Houston, this question has a problem!
  • The “Item Total Correlation” backs up the Item Discrimination with a high negative value.
  • To find out more about what is going on we delve into the Outcome information area to see how the distracters perform. We find that the keyed-correct answer of Nampa is not showing the expected pattern of upper minus lower proportions. We do, however, find that the distracter “Boise” is showing the expected pattern of the Upper Group (86%) selecting this response option much more than the Lower Group (15%). Wait a second…I think I know what is wrong with this one, it has been mis-keyed! Someone accidently assigned a score of 1 to Nampa rather than Boise.

part-4-picture-3

No problem: the administrator pulls the data into the Results Management System (RMS), changes the keyed correct answer to Boise, and presto, we now have defensible statistics that we can work with for this question.

part-4-picture-4

The psychometricians, SMEs, and test developers reviewing this question had a frown on their faces at first but those frowns were turned upside down when they realized it is just a simple mis-keyed question.

In my next blog post I would like share some observations on the relationship between Outcome Discrimination and Outcome Correlation.

Are you ready for some light relief after pondering all these statistics? Then have some fun with our own State Capitals Quiz.