Understanding Assessment Validity and Reliability

julie-smallPosted by Julie Chazyn

Assessments are not all created equal…Those that are both reliable and valid are the superior ones that support learning and measure knowledge most effectively.  But how can  authors make sure they are producing valid, reliable assessments?

I picked up some tips about this in revisiting the Questionmark White Paper, Assessments through the Learning Process.

So, what is a reliable assessment? One that  works consistently. If a survey indicates that employees are satisfied with a course of instruction, it should show the same result if administered three days later. (This type of reliability is called test-retest reliability.) If a course instructor rates employees taking a performance test, their scores should be the same as if any other course instructor scored their performances. (This is called inter-rater reliability.)

And what is a valid  assessment? One that measures what it is supposed to measure. If a test or survey is administered to happy people, the results should show that they’re all happy. Similarly if a group of people who are all knowledgeable are tested, the test results should reveal that they’re all knowledgeable.

If an assessment is valid, it looks like the job, and the content aligns with the tasks of the job in the eyes of job experts. This type of validity is known as Content Validity. In order to insure this validity, the assessment author must first undertake a job task analysis, surveying subject matter experts (SMEs) or people on the job to determine what knowledge and skills are needed to perform job-related tasks. That information makes it possible to produce a valid test.

Good assessments are both reliable and valid. If we gave a vocabulary test twice to a group of nurses, and the scores came back exactly the same way both times, the test would be considered highly reliable. However, this reliability does not mean that the test is valid. To be valid, it would need  to measure nursing competence in addition to being reliable.

Imagine administering a test of nursing skills to a group of skilled and unskilled nurses and the scores for each examinee are different each time. The test is clearly unreliable. If it’s not reliable, it cannot be valid; fluctuating scores for the same test takers cannot be measuring anything in particular. So the test is both unreliable and invalid. The reliable and valid test of nursing skills is one that yields similar scores every time it is given to the same group of test takers and discriminates every time between good and incompetent nurses. It is consistent and it measures what it is supposed to measure.

Assessments that are both reliable and valid hit the bullseye!


For more detail on validity and reliability, check out another of our white papers, Defensible Assessments: What You Need to Know.

Psychometrics 101: How do I know if an assessment is reliable? (Part 2)


Posted by Greg Pope

In my last post I offered some general information about assessment reliability. Below are some additional specific things to consider.

  • What factors / test characteristics generally influence internal consistency reliability coefficient values?

A.    Item difficulty: Items that are extremely hard or extremely easy affect discrimination and therefore reliability. If a large number of participants do not have time to finish the test this affects item difficulty
B.    Item discrimination: Items that have higher discrimination values will contribute more to the measurement efficacy of the assessment (more discriminating questions = higher reliability). Part of this relates to sound question development, if questions are well crafted and non-ambiguously worded they are more likely to have acceptable discrimination
C.    Construct being measured: If all questions are measuring the same construct (e.g., from the same topic) reliability will be increased
D.    How many participants took the test: With very small numbers of participants the reliability coefficient will be less stable
E.    Composition of people that took the test: If the sample of participants taking an assessment is not representative (e.g., no-one studied!) the reliability will be negatively impacted
F.    How many questions are administered: Generally the more questions administered the higher the reliability (to a point, we can’t have a 10,000 question test!)
G.    Environmental administration factors: Conditions in the testing area such as noise, lighting levels, etc. can cause distraction away from the measurement of what the participants know and can do
H.    Person factors: Test anxiety, fatigue, and other human factors can reduce the accuracy of measurement of what people know and can do


For more on this subject see the Questionmark White Paper, “Defensible Assessments: What You Need to Know”

Psychometrics 101: How do I know if my assessment is reliable? (Part 1)


Posted by Greg Pope

At last week’s Questionmark Users Conference I presented a session on item and test analysis, and part of that session dealt with test score reliability.

“Reliability” is used in everyday language: “My car runs reliably” means it starts every time. In the assessment realm we talk about test score reliability, which refers to how consistently and accurately test scores measure a construct (knowledge/skills in the domain of interest such as “American History Knowledge”).

Assessments are measurement instruments; the questions composing the assessment take measurements of what people know and can do. Just as thermometers take measurements of temperature, assessment questions take measurements of psychological attributes. Like any measurement instrument, there is some imprecision in the estimates, so the test score that a person obtains (observed score) is actually composed of a theoretical “true score” (what they actually really know and can do) plus some error. Reliable test scores have the least amount of error and therefore the smallest difference between the observed score and this theoretical true score.It is hard to go into a great deal of detail here, so for a good primer into the theory check out: Traub, R.E. (1994). Reliability for the Social Sciences: Theory & Applications. Thousand Oaks: Sage.

Generally there are four approaches for measuring reliability:

1.    Internal consistency: How well do items on the test “hang together” to measure the same psychological attribute

2.    Split-half (split-forms): How well do scores on two forms (splits) of the test (first 25 items versus last 25) relate to one another

3.    Test-retest: How similar are scores obtained from multiple administrations of the same test

4.    Inter-rater reliability: How consistently do two or more raters (essay markers) obtain similar scores.

Internal consistency reliability is common and is used in our Test Analysis Report and Results Management System ,where we use Cronbach’s Alpha.

Stay tuned for Part 2 of this post, which will discuss the factors and test characteristics that generally influence internal consistency reliability coefficient values.

Psychometrics 101: Sample size and question difficulty (p-values)


Posted by Greg Pope

With just a week to go before the Questionmark Users Conference, here’s a little taste of the presentation I will be doing on  psychometrics. I will also be running a session on Item Analysis and Test Analysis.

So, let’s talk about sample size and question difficulty!

How does the number of participants that take a question relate to the robustness/stability of the question difficulty statistic (p-value)? Basically the smaller the number of participants tested the less robust/stable the statistic. So if 30 participants take a question and the p-value that appears in the Item Analysis Report is 0.600 the range that the theoretical “true” p-value (if all participants in the world took the question) could fall into 95% of the time is between 0.775 and 0.425. This means that if another 30 participants were tested you could get a p-value on the Item Analysis Report anywhere from 0.775 to 0.425 (95% confidence range). The take away is that if high stakes decisions are being made using p-values (e.g., whether to drop a question from a certification exam) the more participants that can be tested the better to get more robust results. Another example is that if you are conducting beta testing and you want to know which questions to include in your test form based on the beta test results the more participants you can beta test the better in terms of the confidence you will have in the stability of the statistics. Below is a graph that illustrates this relationship.sample-size-influences-p-value-chart1

This relationship between sample size and the stability of other statistics applies to other common statistics used in psychometrics. For example the item-total correlation (point biserial correlation coefficient) can vary a great deal when small sample sizes are used to calculate it. In the example below we see that an observed correlation of 0 can actual vary by over 0.8 (plus or minus).sample-sixe-influences-chart1

Psychometrics 101: Item Total Correlation


Posted by Greg Pope

I’ll be talking about a subject dear to my heart — psychometrics — at the Questionmark Users Conference April 5 -8. Here’s a sneak preview on one of my topics: item total correlation! What is it, and what does it mean?

The item total correlation is a correlation between the question score (e.g., 0 or 1 for multiple choice) and the overall assessment score (e.g., 67%). It is expected that if a participant gets a question correct they should, in general, have higher overall assessment scores than participants who get a question wrong. Similarly with essay type question scoring where a question could be scored between 0 and 5 participants who did a really good job on the essay (got a 4 or 5) should have higher overall assessment scores (maybe 85-90%). This relationship is shown in an example graph below.


This relationship in psychometrics is called ‘discrimination’ referring to how well a question differentiates between participants who know the material and those that do not know the material. Participants who know the material taught to them should get high scores on questions and high overall assessment scores. Participants who did not master the material should get low scores on questions and lower overall assessment scores. This is the relationship that an item-total correlation provides to help evaluate the performance of questions. We want to have lots of highly discriminating questions on our tests because they are the most fine-tuned measurements to find out what participants know and can do. When looking at an item-total correlation generally negative values are a major red flag it is unexpected that participants who get low scores on the questions get high scores on the assessment. This could indicate a mis-keyed question or that the question was highly ambiguous and confusing to participants. Values for an item-total correlation (point-biserial) between 0 and 0.19 may indicate that the question is not discriminating well, values between 0.2 and 0.39 indicate good discrimination, and values 0.4 and above indicate very good discrimination.