Item Analysis Analytics: The White Paper

greg_pope-150x1502

Posted by Greg Pope

I had a great time putting together an eight-part series on Item Analysis Analytics for this blog and was pleased with the interest it received.

When a reader asked if it would be possible to present all the posts in a single document I thought hey, let’s present the content of these articles in the form of a Questionmark White Paper! So here it is for you to download with our compliments.

I hope the paper helps you in your efforts to create test questions that make the grade!

Peer Discussion – Hot Topics in Assessment

sarah-small

Posted By Sarah ElkinsYU7U6622_JPG

During the recent Questionmark European Users Conference in Manchester, Stefanie Moerbeek, Senior Coordinator for Examination Development at EXIN, and Greg Pope, Questionmark’s Analytics and Psychometrics Manager, facilitated a best practice session that gave delegates the opportunity to participate in a peer discussion on Hot Topics in the Assessment Industry.

Stefanie and Greg join me in this podcast to share the outcomes of this session and provide an overview of topics such as beta testing questions, using randomization within examinations and dealing with intellectual property theft of exams.

Item Analysis Analytics Part 8: Some problematic questions

greg_pope-150x1502

Posted by Greg Pope

In my last post, I showed a few more examples of item analyses where we drilled down into why some questions had problems. I thought it might be useful  to show a few examples of some questions that have bad and downright terrible psychometric performance to show the ugly side of item analysis.

Below is an example of a question that is fairly terrible in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 65, which is not so good: there are too few participants in the sample to be able to make sound judgements about the psychometric performance of the question
  • Next we see that 25 participants didn’t answer the question (“Number not Answered” = 25), which means there was a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is hard with 20% of participants ‘getting it right.’
  • The “Item Discrimination” indicates very low discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘More than 40’ at only 5%. This means that of the participants with high overall exam scores, 27% selected the correct answer versus 22% of the participants with the lowest overall exam scores. This is a very small difference between the Upper and Lower groups. Participants who know the material should have got the question right more often.
  • The “Item Total Correlation” reflects the Item Discrimination with a negative value of -0.01. A value like this would definitely not meet most organizations’ internal criteria in terms of what is considered an acceptable item. Negative item-total correlations are a major red flag!
  • Finally we look at the Outcome information to see how the distracters perform. We find that participants are all over the map selecting distracters in an erratic way. When I look at the question wording I realize how vague and arbitrary this question is: the number of questions that should be in an assessment depends on numerous factors and contexts. It is impossible to say that in any context a certain number of questions are required. It looks like the Upper Group are selecting the response options ‘21-40’ and ‘More than 40’ response options more than the other two options, which have smaller numbers of questions. This makes sense from a participant guessing perspective, because in many assessment contexts having more questions than fewer questions is better for reliability.

The psychometricians, SMEs, and test developers reviewing this question would need to send the SME who wrote this question back to basic authoring training to ensure that they know how to write questions that are clear and concise. This question does not really have a correct answer and needs to be re-written to clarify the context and provide many more details to the participants. I would even be tempted to throw out questions along this content line, because how long an assessment should be has no one “right answer.” How long an assessment should be depends on so many things that there will always be room for ambiguity, so it would be quite challenging to write a question that performs well statistically on this topic.

part-8-pic-1

Below is an example of a question that is downright awful in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 268, which is really good. That is a nice healthy sample. Nothing wrong here, let’s move on.
  • Next we see that 56 participants didn’t answer the question (“Number not Answered” = 56), which means there was a problem with people not finishing or finding the questions confusing and giving up. It gets worse, much, much worse.
  • The “P Value Proportion Correct” shows us that this question is really hard, with 16% of participants ‘getting it right.’
  • The “Item Discrimination” indicates a negative discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘44123’ at  -23%. This means that of the participants with high overall exam scores, 12% selected the correct answer versus 35% of the participants with the lowest overall exam scores. What the heck is going on? This means that participants with the highest overall assessment scores are selecting the correct answer LESS OFTEN than participants with the lowest overall assessment scores. That is not good at all; lets dig deeper.
  • The “Item Total Correlation” reflects the Item Discrimination with a large negative value of -0.26. This is a clear indication that there is something incredibly wrong with this question.
  • Finally we look at the Outcome information to see how the distracters perform. This is where the true psychometric horror of this question is manifested. There is neither rhyme nor reason here: participants, regardless of their performance on the overall assessment, are all over the place in terms of selecting response options. You might as well have blindfolded everyone taking this question and had them randomly select their answers. This must have been extremely frustrating for the participants who had to take this question and would have likely led to many participants thinking that the organization administering this question did not know what they were doing.

The psychometricians, SMEs, and test developers reviewing this question would need to provide a pink slip to the SME who wrote this question immediately. Clearly the SME failed basic question authoring training. This question makes no sense and was written in such a way to suggest that the author was under the influence, or otherwise not in a right state of mind, when crafting this question. What is this question testing? How can anyone possibly make sense of this and come up with a correct answer? Is there a correct answer? This question is not salvageable and should be stricken from the Perception repository without a second thought. A question like this should have never gotten in front of a participant to take, let alone 268 participants. The panel reviewing questions should review their processes to ensure that in the future questions like this are weeded out before an assessment goes out live for people to take.

part-8-pic-2

Item Analysis Analytics Part 7: The psychometric good, bad and ugly

greg_pope-150x1502

Posted by Greg Pope

A few posts ago I showed an example item analysis report for a question that performed well statistically and a question that did not perform well statistically. The latter turned out to be a mis-keyed item. I thought it might be interesting to drill into a few more item analysis cases of questions that have interesting psychometric performance. I hope this will help all of you out there recognize the patterns of the psychometric good, bad and ugly in terms of question performance.

The question below is an example of a question that is borderline in terms of psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 116, which is a decent sample of participants to evaluate the psychometric performance of this question.
  • Next we see everyone answered the question (“Number not Answered” = 0) which means there probably wasn’t a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is average to easy, with 65% of participants “getting it right.”
  • The “Item Discrimination” indicates mediocre discrimination at best, with the difference between the upper and lower group in terms of the proportion selecting the correct answer of ‘Leptokurtic’ at 20%. This means that of the participants with high overall exam scores, 75% selected the correct answer versus 55% of the participants with the lowest overall exam scores. I would have liked to see a larger difference between the Upper and Lower groups.
  • The “Item Total Correlation” backs the Item Discrimination up with a lacklustre value of 0.14. A value like this would likely not meet many organizations’ internal criteria in terms of what is considered a “good” item.
  • Finally, we look at the Outcome information to see how the distracters perform. We find that each distracter pulls some participants, with ‘Platykurtic’ pulling the most participants and quite a large number of the Upper group (22%) selecting this distracter. If I were to guess what is happening, I would say that because the correct option and the distracters are so similar, and because this topic is so obscure you really need to know your material, participants get confused between the correct answer of ‘Leptokurtic’ and the distracter ‘Platykurtic’

The psychometricians, SMEs, and test developers reviewing this question would need to talk with instructors to find out more about how this topic was taught and understand where the problem lies: Is it a problem with the question wording or a problem with instruction and retention/recall of material? If it is a question wording problem, revisions can be made and the question re-beta tested. If the problem is in how the material is being taught, then instructional coaching can occur and the question re-beta tested as is to see if improvements in the psychometric performance of the question occur.

greg-11

The question below is an example of a question that has a classic problem. Here are some reasons why it is problematic:

  • Going from left to right, first we see that the “Number of Results” is 175. That is a fairly healthy sample, nothing wrong there.
  • Next we see everyone answered the question (“Number not Answered” = 0), which means there probably wasn’t a problem with people not finishing or finding the question confusing and giving up
  • The “P Value Proportion Correct” shows us that this question is easy, with 83% of participants ‘getting it right’. There is nothing immediately wrong with an easy question, so let’s look further.
  • The “Item Discrimination” indicates reasonable discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘Cronbach’s Alpha’ at 38%. This means that of the participants with high overall exam scores, 98% selected the correct answer versus 60% of the participants with the lowest overall exam scores. That is a nice difference between the Upper and Lower groups, with almost 100% of the Upper group choosing the correct answer. Obviously, this question is easy for participants who know their stuff!
  • The “Item Total Correlation” backs the Item Discrimination up with a value of 0.39. This value backs up the “Item Discrimination” statistics and would meet most organizations’ internal criteria in terms of what is considered a “good” item.
  • Finally, we look at the Outcome information to see how the distracters perform. Well, two of the distracters don’t pull any participants! This is a waste of good question real estate: Participants have to read through four alternatives when there are only two they even consider as being the correct answer.

The psychometricians, SMEs, and test developers reviewing this question would likely ask the SME who developed the question to come up with better distracters that would draw more participants. Clearly, ‘Bob’s Alpha’ is a joke distracter that participants dismiss immediately as is the ‘KR-1,000,000’, I mean Kuder-Richardson formula one million. Let’s get serious here!

part-8-pic-21

Item Analysis Analytics Part 6: Determining Whether a Question Makes the Grade

greg_pope-150x1502

Posted by Greg Pope

In my previous blog post I talked about outcome discrimination and outcome correlation and their relationship to one another. Now I will provide some criteria that can be used for outcome discrimination and outcome correlation coefficients to judge whether a question is making the grade in terms of psychometric quality.

Outcome discrimination (high-low)

part-6-pic-1

Outcome correlation (Point-biserial correlation)

part-6-pic2

I’ll be back with more juicy psychometrics soon!

Item Analysis Analytics Part 5: Outcome Discrimination and Outcome Correlation

greg_pope-150x1502

Posted by Greg Pope

In my previous blog post I dived into some details of item analysis, looking at example questions and how to use the Questionmark Perception Item Analysis Report in an applied context. I thought it might be useful in this post to talk about outcome discrimination and outcome correlation, as people sometimes ask me how are these different or the same, when should I use one or the other, and so on. The fact of the matter is that you can use one or the other and often it comes down to preference as they both yield quite similar results.

Outcome discrimination is the proportion of the top (27% according to assessment score) of participants who selected a response option minus the lowest (27% according to assessment score) of participants who selected each response option to the question. What you would expect is that participants with the highest assessment scores should select the correct response option more often than participants with the lowest assessment scores. Similarly, participants with the highest assessment scores should select the incorrect distracters less often compared to the participants with the lowest assessment scores.

Outcome correlation is a point-biserial correlation that correlates the outcomes scores that participants achieve to the assessment scores that they achieve. So rather than comparing only the top and bottom 27% of participants, the outcome correlation looks at all participants using a standard correlation approach.

If you are thinking that outcome discrimination and outcome correlation sound like they might be related to one another, you are right! High outcome discrimination statistics generally will result in high outcome correlations. In other words, outcome discrimination and outcome correlation statistics are highly correlated with one another. How correlated are they? Well, I looked at many real-life questions from Item Analysis Reports that customers have shared with me and found a positive correlation of 0.962, which is really high.

part-5

In my next post I will provide some criteria that can be used for outcome discrimination and outcome correlation coefficients to judge whether a question is meeting the grade in terms of psychometric quality.

Research Survey for Test Takers: You Can Help

greg_pope-150x1502

Posted by Greg Pope

I am working with Dr. Bruno Zumbo, professor at the University of British Columbia, on  a research study about the beliefs of people who are waiting to take, or have taken, a certification or licensure examination.

In this initial study we want to document people’s attitudes and beliefs regarding taking these exams as well as issues in the area of certification and licensure testing. This research is designed to help certification and licensing organizations improve high-stakes exams by shedding light on test takers’ perspectives.

To complete our research, we need input from anyone who is planning to take or has already taken a certification or licensing exam. If you are a test taker we thank you in advance for answering a 35-question survey that will take 5 or 10 minutes to complete. This is an opportunity to weigh in on important issues in the testing industry. If you are a test taker, please take the survey!  If you know certification or licensing exam participants, we’d appreciate it if you could encourage them to take it too.

We will report on the results of our research this fall and appreciate your help!

Item Analysis Analytics Part 4: The Nitty-Gritty of Item Analysis

 

greg_pope-150x1502

Posted by Greg Pope

In my previous blog post I highlighted some of the essential things to look for in a typical Item Analysis Report. Now I will dive into the nitty-gritty of item analysis, looking at example questions and explaining how to use the Questionmark Item Analysis Report in an applied context for a State Capitals Exam.

The Questionmark Item Analysis Report first produces an overview of question performance both in terms of the difficulty of questions and in terms of the discrimination of questions (upper minus lower groups). These overview charts give you a “bird’s eye view” of how the questions composing an assessment perform. In the example below we see that we have a range of questions in terms of their difficulty (“Item Difficulty Level Histogram”), with some harder questions (the bars on the left), most average-difficulty questions (bars in the middle), and some easier questions (the bars on the right). In terms of discrimination (“Discrimination Indices Histogram”) we see that we have many questions that have high discrimination as evidenced by the bars being pushed up to the right (more questions on the assessment have higher discrimination statistics).

part-4-picture-1

Overall, if I were building a typical criterion-referenced assessment with a pass score around 50% I would be quite happy with this picture. We have more questions functioning at the pass score point with a range of questions surrounding it and lots of highly discriminating questions. We do have one rogue question on the far left with a very low discrimination index, which we need to look at.

The next step is to drill down into each question to ensure that each question performs as it should. Let’s look at two questions from this assessment, one question that performs well and one question that does not perform so well.

The question below is an example of a question that performs nicely. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 175, which is a nice sample of participants to evaluate the psychometric performance of this question.
  • Next we see thateveryone answered the question (“Number not Answered” = 0), which means there probably wasn’t a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is just above the pass score where 61% of participants ‘got it right.’ Nothing wrong with that: the question is neither too easy nor too hard.
  • The “Item Discrimination” indicates good discrimination, with the difference between the upper and lower group in terms of the proportion selecting the correct answer of ‘Salem’ at 48%. This means that of the participants with high overall exam scores, 88% selected the correct answer versus only 40% of the participants with the lowest overall exam scores. This is a nice, expected pattern.
  • The “Item Total Correlation” backs the Item Discrimination up with a strong value of 0.40. This means that of all participants who answered the questions, the pattern of high scorers getting the question right more than low scorers holds true.
  • Finally we look at the Outcome information to see how the distracters perform. We find that each distracter pulled some participants, with ‘Portland’ pulling the most participants, especially from the “Lower Group.” This pattern makes sense because those with poor state capital knowledge may make the common mistake of selecting Portland as the capital of Oregon.

The psychometricians, SMEs, and test developers reviewing this question all have smiles on their faces when they see the item analysis for this item.

part-4-picture-2

Next we look at that rogue question that does not perform so well in terms of discrimination-–the one we saw in the Discrimination Indices Histogram. When we look into the question we understand why it was flagged:

  • Going from left to right, first we see that the “Number of Results” is 175, which is again a nice sample size: nothing wrong here.
  • Next we see everyone answered the question, which is good.
  • The first red flag comes from the “P Value Proportion Correct” as this question is quite difficult (only 35% of participants selected the correct answer). This is not in and of itself a bad thing so we can keep this in memory as we move on,
  • The “Item Discrimination” indicates a major problem, a negative discrimination value. This means that participants with the lowest exam scores selected the correct answer more than participants with the highest exam scores. This is not the expected pattern we are looking for: Houston, this question has a problem!
  • The “Item Total Correlation” backs up the Item Discrimination with a high negative value.
  • To find out more about what is going on we delve into the Outcome information area to see how the distracters perform. We find that the keyed-correct answer of Nampa is not showing the expected pattern of upper minus lower proportions. We do, however, find that the distracter “Boise” is showing the expected pattern of the Upper Group (86%) selecting this response option much more than the Lower Group (15%). Wait a second…I think I know what is wrong with this one, it has been mis-keyed! Someone accidently assigned a score of 1 to Nampa rather than Boise.

part-4-picture-3

No problem: the administrator pulls the data into the Results Management System (RMS), changes the keyed correct answer to Boise, and presto, we now have defensible statistics that we can work with for this question.

part-4-picture-4

The psychometricians, SMEs, and test developers reviewing this question had a frown on their faces at first but those frowns were turned upside down when they realized it is just a simple mis-keyed question.

In my next blog post I would like share some observations on the relationship between Outcome Discrimination and Outcome Correlation.

Are you ready for some light relief after pondering all these statistics? Then have some fun with our own State Capitals Quiz.

Item Analysis Analytics Part 3: What to Look for in an Item Analysis Report

greg_pope-150x1502

Posted by Greg Pope

In my last blog post I talked about the high level purpose and process of conducting an item analysis. Now I will describe some of the essential things to look for in a typical Item Analysis Report.

greg-post-part-31

You may sometimes see “Alpha if item deleted” statistics in Item Analysis Reports. These statistics provide information about whether the internal consistency reliability (e.g., Cronbach’s Alpha) will increase if the question is deleted from the assessment. An increase in the reliability value indicates that the question is not performing well psychometrically. Many Item Analysis Reports do not display the “Alpha if item deleted” statistic because the item-total correlation coefficient provides basically the same information. Questions with higher item-total correlation coefficient values will contribute to higher internal consistency reliability values, and lower item-total correlation coefficient values will contribute to lower internal consistency reliability values.

Other statistics you might see are variations of the point-biserial item-total correlation coefficient such as “Corrected Point-biserial correlation,” “biserial correlation” or “corrected biserial correlation.” The “corrected” in these refers to taking out the question scores from the calculations so that the question being examined is not “contributing to itself” in terms of the statistics.

A great resource for more information on item analysis is Chapter 8 of Dr. Steven J. Osterlind’s book Constructing Test Items: Multiple-Choice, Constructed-Response, Performance and Other Formats (2nd edition).

In my next post I will dive into the nitty-gritty of item analysis. I will look at example questions and how to use the Questionmark Item Analysis Report in an applied context. Stay tuned to the Questionmark Blog…

Next Page »