I thought it might be fun to discuss something that many people have asked me about over the years: “Should I include really easy or really hard questions on my assessments?” It is difficult to provide a simple “Yes” or “No” answer because, as with so many things in testing, it depends! However, I can provide some food for thought that may help you when building your assessments.
We can define easy questions as those with high p-values (item difficulty statistics) such as 0.9 to 1.0 (90-100% of participants answer the question correctly). We can define hard questions as those with low p-values such as 0.15 to 0 (15-0% answer the question correctly). These ranges are fairly arbitrary: some organizations in some contexts may consider greater than 0.8 easy and less than 0.25 difficult.
When considering how easy or difficult questions should be, start by asking, “What is the purpose of the assessment program and the assessments being developed?” If the purpose of an assessment is to provide a knowledge check and facilitate learning during a course, then maybe a short formative quiz would be appropriate. In this case, one can be fairly flexible in selecting questions to include on the quiz. Having some easier and harder questions is probably just fine. If the purpose of an assessment is to measure a participant’s ability to process information quickly and accurately under duress, then a speed test would likely be appropriate. In that case, a large number of low-difficulty questions should be included on the assessment.
However, in many common situations having very difficult or very easy questions on an assessment may not make a great deal of sense. For a criterion referenced example, if the purpose of an assessment is to certify participants as knowledgeable and skilful enough to do a certain job competently (e.g., crane operation), the difficulty of questions would need careful scrutiny. The exam may have a cut score that participants need to achieve in order to be considered good enough (e.g., 60+%). Here are a few reasons why having many very easy or very hard questions on this type of assessment may not make sense:
Very easy items won’t contribute a great deal to the measurement of the construct
A very easy item that almost every participant gets right doesn’t tell us a great deal about what the participant knows and can do. A question like: “Cranes are big. Yes/No” doesn’t tell us a great deal about whether someone has the knowledge or skills to operate a crane. Very easy questions, in this context, are almost like “give-away” questions that contribute virtually nothing to the measurement of the construct. One would get almost the same measurement information (or lack thereof) from asking a question like “What is your shoe size?” because everyone (or mostly everyone) would get it correct.
Tricky to balance blueprint
Assessment construction generally requires following a blueprint that needs to be balanced in terms of question content, difficulty, and other factors. It is often very difficult to balance these blueprints for all factors, and using extreme questions makes this all the more challenging because there are generally more questions available that are of average rather than extreme difficulty.
Potentially not enough questions providing information near the cut score
In a criterion referenced exam with a cut score of 60% one would want the most measurement information in the exam near this cut score. What do I mean by this? Well, questions with p-values around 0.60 will provide the most information regarding whether participants just have the knowledge and skills to pass or just don’t have the knowledge and skills to pass. This topic requires a more detailed look at assessment development techniques that I will elaborate on soon in an upcoming blog post!
Effect of question difficulty on question discrimination
The difficulty of questions affects the discrimination (item-total correlation) statistics of the question. Extremely easy or extremely hard questions have a harder time obtaining those high discrimination statistics that we look for. In the graph below, I show the relationship between question difficulty p-values and item-total correlation discrimination statistics. Notice that the questions (the little diamonds) that have very low and very high p-values also have very low discrimination statistics and those around 0.5 have the highest discrimination statistics.