Should I include really easy or really hard questions on my assessments?

greg_pope-150x1502

Posted by Greg Pope

I thought it might be fun to discuss something that many people have asked me about over the years: “Should I include really easy or really hard questions on my assessments?” It is difficult to provide a simple “Yes” or “No” answer because, as with so many things in testing, it depends! However, I can provide some food for thought that may help you when building your assessments.

We can define easy questions as those with high p-values (item difficulty statistics) such as 0.9 to 1.0 (90-100% of participants answer the question correctly). We can define hard questions as those with low p-values such as 0.15 to 0 (15-0% answer the question correctly). These ranges are fairly arbitrary: some organizations in some contexts may consider greater than 0.8 easy and less than 0.25 difficult.

When considering how easy or difficult questions should be, start by asking, “What is the purpose of the assessment program and the assessments being developed?” If the purpose of an assessment is to provide a knowledge check and facilitate learning during a course, then maybe a short formative quiz would be appropriate. In this case, one can be fairly flexible in selecting questions to include on the quiz. Having some easier and harder questions is probably just fine. If the purpose of an assessment is to measure a participant’s ability to process information quickly and accurately under duress, then a speed test would likely be appropriate. In that case, a large number of low-difficulty questions should be included on the assessment.

However, in many common situations having very difficult or very easy questions on an assessment may not make a great deal of sense. For a criterion referenced example, if the purpose of an assessment is to certify participants as knowledgeable and skilful enough to do a certain job competently (e.g., crane operation), the difficulty of questions  would need careful scrutiny. The exam may have a cut score that participants need to achieve in order to be considered good enough (e.g., 60+%). Here are a few reasons why having many very easy or very hard questions on this type of assessment may not make sense:

Very easy items won’t contribute a great deal to the measurement of the construct

A very easy item that almost every participant gets right doesn’t tell us a great deal about what the participant knows and can do. A question like: “Cranes are big. Yes/No” doesn’t tell us a great deal about whether someone has the knowledge or skills to operate a crane. Very easy questions, in this context, are almost like “give-away” questions that contribute virtually nothing to the measurement of the construct. One would get almost the same measurement information (or lack thereof) from asking a question like “What is your shoe size?” because everyone (or mostly everyone) would get it correct.

Tricky to balance blueprint

Assessment construction generally requires following a blueprint that needs to be balanced in terms of question content, difficulty, and other factors. It is often very difficult to balance these blueprints for all factors, and using extreme questions makes this all the more challenging because there are generally more questions available that are of average rather than extreme difficulty.

Potentially not enough questions providing information near the cut score

In a criterion referenced exam with a cut score of 60% one would want the most measurement information in the exam near this cut score. What do I mean by this? Well, questions with p-values around 0.60 will provide the most information regarding whether participants just have the knowledge and skills to pass or just don’t have the knowledge and skills to pass. This topic requires a more detailed look at assessment development techniques that I will elaborate on soon in an upcoming blog post!

Effect of question difficulty on question discrimination

The difficulty of questions affects the discrimination (item-total correlation) statistics of the question. Extremely easy or extremely hard questions have a harder time obtaining those high discrimination statistics that we look for. In the graph below, I show the relationship between question difficulty p-values and item-total correlation discrimination statistics. Notice that the questions (the little diamonds) that have very low and very high p-values also have very low discrimination statistics and those around 0.5 have the highest discrimination statistics.

Item Analysis Analytics Part 8: Some problematic questions

greg_pope-150x1502

Posted by Greg Pope

In my last post, I showed a few more examples of item analyses where we drilled down into why some questions had problems. I thought it might be useful  to show a few examples of some questions that have bad and downright terrible psychometric performance to show the ugly side of item analysis.

Below is an example of a question that is fairly terrible in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 65, which is not so good: there are too few participants in the sample to be able to make sound judgements about the psychometric performance of the question
  • Next we see that 25 participants didn’t answer the question (“Number not Answered” = 25), which means there was a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is hard with 20% of participants ‘getting it right.’
  • The “Item Discrimination” indicates very low discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘More than 40’ at only 5%. This means that of the participants with high overall exam scores, 27% selected the correct answer versus 22% of the participants with the lowest overall exam scores. This is a very small difference between the Upper and Lower groups. Participants who know the material should have got the question right more often.
  • The “Item Total Correlation” reflects the Item Discrimination with a negative value of -0.01. A value like this would definitely not meet most organizations’ internal criteria in terms of what is considered an acceptable item. Negative item-total correlations are a major red flag!
  • Finally we look at the Outcome information to see how the distracters perform. We find that participants are all over the map selecting distracters in an erratic way. When I look at the question wording I realize how vague and arbitrary this question is: the number of questions that should be in an assessment depends on numerous factors and contexts. It is impossible to say that in any context a certain number of questions are required. It looks like the Upper Group are selecting the response options ’21-40’ and ‘More than 40’ response options more than the other two options, which have smaller numbers of questions. This makes sense from a participant guessing perspective, because in many assessment contexts having more questions than fewer questions is better for reliability.

The psychometricians, SMEs, and test developers reviewing this question would need to send the SME who wrote this question back to basic authoring training to ensure that they know how to write questions that are clear and concise. This question does not really have a correct answer and needs to be re-written to clarify the context and provide many more details to the participants. I would even be tempted to throw out questions along this content line, because how long an assessment should be has no one “right answer.” How long an assessment should be depends on so many things that there will always be room for ambiguity, so it would be quite challenging to write a question that performs well statistically on this topic.

part-8-pic-1

Below is an example of a question that is downright awful in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 268, which is really good. That is a nice healthy sample. Nothing wrong here, let’s move on.
  • Next we see that 56 participants didn’t answer the question (“Number not Answered” = 56), which means there was a problem with people not finishing or finding the questions confusing and giving up. It gets worse, much, much worse.
  • The “P Value Proportion Correct” shows us that this question is really hard, with 16% of participants ‘getting it right.’
  • The “Item Discrimination” indicates a negative discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘44123’ at  -23%. This means that of the participants with high overall exam scores, 12% selected the correct answer versus 35% of the participants with the lowest overall exam scores. What the heck is going on? This means that participants with the highest overall assessment scores are selecting the correct answer LESS OFTEN than participants with the lowest overall assessment scores. That is not good at all; lets dig deeper.
  • The “Item Total Correlation” reflects the Item Discrimination with a large negative value of -0.26. This is a clear indication that there is something incredibly wrong with this question.
  • Finally we look at the Outcome information to see how the distracters perform. This is where the true psychometric horror of this question is manifested. There is neither rhyme nor reason here: participants, regardless of their performance on the overall assessment, are all over the place in terms of selecting response options. You might as well have blindfolded everyone taking this question and had them randomly select their answers. This must have been extremely frustrating for the participants who had to take this question and would have likely led to many participants thinking that the organization administering this question did not know what they were doing.

The psychometricians, SMEs, and test developers reviewing this question would need to provide a pink slip to the SME who wrote this question immediately. Clearly the SME failed basic question authoring training. This question makes no sense and was written in such a way to suggest that the author was under the influence, or otherwise not in a right state of mind, when crafting this question. What is this question testing? How can anyone possibly make sense of this and come up with a correct answer? Is there a correct answer? This question is not salvageable and should be stricken from the Perception repository without a second thought. A question like this should have never gotten in front of a participant to take, let alone 268 participants. The panel reviewing questions should review their processes to ensure that in the future questions like this are weeded out before an assessment goes out live for people to take.

part-8-pic-2

Item Analysis Analytics Part 7: The psychometric good, bad and ugly

greg_pope-150x1502

Posted by Greg Pope

A few posts ago I showed an example item analysis report for a question that performed well statistically and a question that did not perform well statistically. The latter turned out to be a mis-keyed item. I thought it might be interesting to drill into a few more item analysis cases of questions that have interesting psychometric performance. I hope this will help all of you out there recognize the patterns of the psychometric good, bad and ugly in terms of question performance.

The question below is an example of a question that is borderline in terms of psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 116, which is a decent sample of participants to evaluate the psychometric performance of this question.
  • Next we see everyone answered the question (“Number not Answered” = 0) which means there probably wasn’t a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is average to easy, with 65% of participants “getting it right.”
  • The “Item Discrimination” indicates mediocre discrimination at best, with the difference between the upper and lower group in terms of the proportion selecting the correct answer of ‘Leptokurtic’ at 20%. This means that of the participants with high overall exam scores, 75% selected the correct answer versus 55% of the participants with the lowest overall exam scores. I would have liked to see a larger difference between the Upper and Lower groups.
  • The “Item Total Correlation” backs the Item Discrimination up with a lacklustre value of 0.14. A value like this would likely not meet many organizations’ internal criteria in terms of what is considered a “good” item.
  • Finally, we look at the Outcome information to see how the distracters perform. We find that each distracter pulls some participants, with ‘Platykurtic’ pulling the most participants and quite a large number of the Upper group (22%) selecting this distracter. If I were to guess what is happening, I would say that because the correct option and the distracters are so similar, and because this topic is so obscure you really need to know your material, participants get confused between the correct answer of ‘Leptokurtic’ and the distracter ‘Platykurtic’

The psychometricians, SMEs, and test developers reviewing this question would need to talk with instructors to find out more about how this topic was taught and understand where the problem lies: Is it a problem with the question wording or a problem with instruction and retention/recall of material? If it is a question wording problem, revisions can be made and the question re-beta tested. If the problem is in how the material is being taught, then instructional coaching can occur and the question re-beta tested as is to see if improvements in the psychometric performance of the question occur.

greg-11

The question below is an example of a question that has a classic problem. Here are some reasons why it is problematic:

  • Going from left to right, first we see that the “Number of Results” is 175. That is a fairly healthy sample, nothing wrong there.
  • Next we see everyone answered the question (“Number not Answered” = 0), which means there probably wasn’t a problem with people not finishing or finding the question confusing and giving up
  • The “P Value Proportion Correct” shows us that this question is easy, with 83% of participants ‘getting it right’. There is nothing immediately wrong with an easy question, so let’s look further.
  • The “Item Discrimination” indicates reasonable discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘Cronbach’s Alpha’ at 38%. This means that of the participants with high overall exam scores, 98% selected the correct answer versus 60% of the participants with the lowest overall exam scores. That is a nice difference between the Upper and Lower groups, with almost 100% of the Upper group choosing the correct answer. Obviously, this question is easy for participants who know their stuff!
  • The “Item Total Correlation” backs the Item Discrimination up with a value of 0.39. This value backs up the “Item Discrimination” statistics and would meet most organizations’ internal criteria in terms of what is considered a “good” item.
  • Finally, we look at the Outcome information to see how the distracters perform. Well, two of the distracters don’t pull any participants! This is a waste of good question real estate: Participants have to read through four alternatives when there are only two they even consider as being the correct answer.

The psychometricians, SMEs, and test developers reviewing this question would likely ask the SME who developed the question to come up with better distracters that would draw more participants. Clearly, ‘Bob’s Alpha’ is a joke distracter that participants dismiss immediately as is the ‘KR-1,000,000’, I mean Kuder-Richardson formula one million. Let’s get serious here!

part-8-pic-21