Should I include really easy or really hard questions on my assessments?


Posted by Greg Pope

I thought it might be fun to discuss something that many people have asked me about over the years: “Should I include really easy or really hard questions on my assessments?” It is difficult to provide a simple “Yes” or “No” answer because, as with so many things in testing, it depends! However, I can provide some food for thought that may help you when building your assessments.

We can define easy questions as those with high p-values (item difficulty statistics) such as 0.9 to 1.0 (90-100% of participants answer the question correctly). We can define hard questions as those with low p-values such as 0.15 to 0 (15-0% answer the question correctly). These ranges are fairly arbitrary: some organizations in some contexts may consider greater than 0.8 easy and less than 0.25 difficult.

When considering how easy or difficult questions should be, start by asking, “What is the purpose of the assessment program and the assessments being developed?” If the purpose of an assessment is to provide a knowledge check and facilitate learning during a course, then maybe a short formative quiz would be appropriate. In this case, one can be fairly flexible in selecting questions to include on the quiz. Having some easier and harder questions is probably just fine. If the purpose of an assessment is to measure a participant’s ability to process information quickly and accurately under duress, then a speed test would likely be appropriate. In that case, a large number of low-difficulty questions should be included on the assessment.

However, in many common situations having very difficult or very easy questions on an assessment may not make a great deal of sense. For a criterion referenced example, if the purpose of an assessment is to certify participants as knowledgeable and skilful enough to do a certain job competently (e.g., crane operation), the difficulty of questions  would need careful scrutiny. The exam may have a cut score that participants need to achieve in order to be considered good enough (e.g., 60+%). Here are a few reasons why having many very easy or very hard questions on this type of assessment may not make sense:

Very easy items won’t contribute a great deal to the measurement of the construct

A very easy item that almost every participant gets right doesn’t tell us a great deal about what the participant knows and can do. A question like: “Cranes are big. Yes/No” doesn’t tell us a great deal about whether someone has the knowledge or skills to operate a crane. Very easy questions, in this context, are almost like “give-away” questions that contribute virtually nothing to the measurement of the construct. One would get almost the same measurement information (or lack thereof) from asking a question like “What is your shoe size?” because everyone (or mostly everyone) would get it correct.

Tricky to balance blueprint

Assessment construction generally requires following a blueprint that needs to be balanced in terms of question content, difficulty, and other factors. It is often very difficult to balance these blueprints for all factors, and using extreme questions makes this all the more challenging because there are generally more questions available that are of average rather than extreme difficulty.

Potentially not enough questions providing information near the cut score

In a criterion referenced exam with a cut score of 60% one would want the most measurement information in the exam near this cut score. What do I mean by this? Well, questions with p-values around 0.60 will provide the most information regarding whether participants just have the knowledge and skills to pass or just don’t have the knowledge and skills to pass. This topic requires a more detailed look at assessment development techniques that I will elaborate on soon in an upcoming blog post!

Effect of question difficulty on question discrimination

The difficulty of questions affects the discrimination (item-total correlation) statistics of the question. Extremely easy or extremely hard questions have a harder time obtaining those high discrimination statistics that we look for. In the graph below, I show the relationship between question difficulty p-values and item-total correlation discrimination statistics. Notice that the questions (the little diamonds) that have very low and very high p-values also have very low discrimination statistics and those around 0.5 have the highest discrimination statistics.

Psychometrics 101: How do I know if an assessment is reliable? (Part 2)


Posted by Greg Pope

In my last post I offered some general information about assessment reliability. Below are some additional specific things to consider.

  • What factors / test characteristics generally influence internal consistency reliability coefficient values?

A.    Item difficulty: Items that are extremely hard or extremely easy affect discrimination and therefore reliability. If a large number of participants do not have time to finish the test this affects item difficulty
B.    Item discrimination: Items that have higher discrimination values will contribute more to the measurement efficacy of the assessment (more discriminating questions = higher reliability). Part of this relates to sound question development, if questions are well crafted and non-ambiguously worded they are more likely to have acceptable discrimination
C.    Construct being measured: If all questions are measuring the same construct (e.g., from the same topic) reliability will be increased
D.    How many participants took the test: With very small numbers of participants the reliability coefficient will be less stable
E.    Composition of people that took the test: If the sample of participants taking an assessment is not representative (e.g., no-one studied!) the reliability will be negatively impacted
F.    How many questions are administered: Generally the more questions administered the higher the reliability (to a point, we can’t have a 10,000 question test!)
G.    Environmental administration factors: Conditions in the testing area such as noise, lighting levels, etc. can cause distraction away from the measurement of what the participants know and can do
H.    Person factors: Test anxiety, fatigue, and other human factors can reduce the accuracy of measurement of what people know and can do


For more on this subject see the Questionmark White Paper, “Defensible Assessments: What You Need to Know”

Psychometrics 101: How do I know if my assessment is reliable? (Part 1)


Posted by Greg Pope

At last week’s Questionmark Users Conference I presented a session on item and test analysis, and part of that session dealt with test score reliability.

“Reliability” is used in everyday language: “My car runs reliably” means it starts every time. In the assessment realm we talk about test score reliability, which refers to how consistently and accurately test scores measure a construct (knowledge/skills in the domain of interest such as “American History Knowledge”).

Assessments are measurement instruments; the questions composing the assessment take measurements of what people know and can do. Just as thermometers take measurements of temperature, assessment questions take measurements of psychological attributes. Like any measurement instrument, there is some imprecision in the estimates, so the test score that a person obtains (observed score) is actually composed of a theoretical “true score” (what they actually really know and can do) plus some error. Reliable test scores have the least amount of error and therefore the smallest difference between the observed score and this theoretical true score.It is hard to go into a great deal of detail here, so for a good primer into the theory check out: Traub, R.E. (1994). Reliability for the Social Sciences: Theory & Applications. Thousand Oaks: Sage.

Generally there are four approaches for measuring reliability:

1.    Internal consistency: How well do items on the test “hang together” to measure the same psychological attribute

2.    Split-half (split-forms): How well do scores on two forms (splits) of the test (first 25 items versus last 25) relate to one another

3.    Test-retest: How similar are scores obtained from multiple administrations of the same test

4.    Inter-rater reliability: How consistently do two or more raters (essay markers) obtain similar scores.

Internal consistency reliability is common and is used in our Test Analysis Report and Results Management System ,where we use Cronbach’s Alpha.

Stay tuned for Part 2 of this post, which will discuss the factors and test characteristics that generally influence internal consistency reliability coefficient values.

Psychometrics 101: Item Total Correlation


Posted by Greg Pope

I’ll be talking about a subject dear to my heart — psychometrics — at the Questionmark Users Conference April 5 -8. Here’s a sneak preview on one of my topics: item total correlation! What is it, and what does it mean?

The item total correlation is a correlation between the question score (e.g., 0 or 1 for multiple choice) and the overall assessment score (e.g., 67%). It is expected that if a participant gets a question correct they should, in general, have higher overall assessment scores than participants who get a question wrong. Similarly with essay type question scoring where a question could be scored between 0 and 5 participants who did a really good job on the essay (got a 4 or 5) should have higher overall assessment scores (maybe 85-90%). This relationship is shown in an example graph below.


This relationship in psychometrics is called ‘discrimination’ referring to how well a question differentiates between participants who know the material and those that do not know the material. Participants who know the material taught to them should get high scores on questions and high overall assessment scores. Participants who did not master the material should get low scores on questions and lower overall assessment scores. This is the relationship that an item-total correlation provides to help evaluate the performance of questions. We want to have lots of highly discriminating questions on our tests because they are the most fine-tuned measurements to find out what participants know and can do. When looking at an item-total correlation generally negative values are a major red flag it is unexpected that participants who get low scores on the questions get high scores on the assessment. This could indicate a mis-keyed question or that the question was highly ambiguous and confusing to participants. Values for an item-total correlation (point-biserial) between 0 and 0.19 may indicate that the question is not discriminating well, values between 0.2 and 0.39 indicate good discrimination, and values 0.4 and above indicate very good discrimination.