Understanding Assessment Validity and Reliability

julie-smallPosted by Julie Chazyn

Assessments are not all created equal…Those that are both reliable and valid are the superior ones that support learning and measure knowledge most effectively.  But how can  authors make sure they are producing valid, reliable assessments?

I picked up some tips about this in revisiting the Questionmark White Paper, Assessments through the Learning Process.

So, what is a reliable assessment? One that  works consistently. If a survey indicates that employees are satisfied with a course of instruction, it should show the same result if administered three days later. (This type of reliability is called test-retest reliability.) If a course instructor rates employees taking a performance test, their scores should be the same as if any other course instructor scored their performances. (This is called inter-rater reliability.)

And what is a valid  assessment? One that measures what it is supposed to measure. If a test or survey is administered to happy people, the results should show that they’re all happy. Similarly if a group of people who are all knowledgeable are tested, the test results should reveal that they’re all knowledgeable.

If an assessment is valid, it looks like the job, and the content aligns with the tasks of the job in the eyes of job experts. This type of validity is known as Content Validity. In order to insure this validity, the assessment author must first undertake a job task analysis, surveying subject matter experts (SMEs) or people on the job to determine what knowledge and skills are needed to perform job-related tasks. That information makes it possible to produce a valid test.

Good assessments are both reliable and valid. If we gave a vocabulary test twice to a group of nurses, and the scores came back exactly the same way both times, the test would be considered highly reliable. However, this reliability does not mean that the test is valid. To be valid, it would need  to measure nursing competence in addition to being reliable.

Imagine administering a test of nursing skills to a group of skilled and unskilled nurses and the scores for each examinee are different each time. The test is clearly unreliable. If it’s not reliable, it cannot be valid; fluctuating scores for the same test takers cannot be measuring anything in particular. So the test is both unreliable and invalid. The reliable and valid test of nursing skills is one that yields similar scores every time it is given to the same group of test takers and discriminates every time between good and incompetent nurses. It is consistent and it measures what it is supposed to measure.

Assessments that are both reliable and valid hit the bullseye!


For more detail on validity and reliability, check out another of our white papers, Defensible Assessments: What You Need to Know.

Psychometrics 101: How do I know if my assessment is reliable? (Part 1)


Posted by Greg Pope

At last week’s Questionmark Users Conference I presented a session on item and test analysis, and part of that session dealt with test score reliability.

“Reliability” is used in everyday language: “My car runs reliably” means it starts every time. In the assessment realm we talk about test score reliability, which refers to how consistently and accurately test scores measure a construct (knowledge/skills in the domain of interest such as “American History Knowledge”).

Assessments are measurement instruments; the questions composing the assessment take measurements of what people know and can do. Just as thermometers take measurements of temperature, assessment questions take measurements of psychological attributes. Like any measurement instrument, there is some imprecision in the estimates, so the test score that a person obtains (observed score) is actually composed of a theoretical “true score” (what they actually really know and can do) plus some error. Reliable test scores have the least amount of error and therefore the smallest difference between the observed score and this theoretical true score.It is hard to go into a great deal of detail here, so for a good primer into the theory check out: Traub, R.E. (1994). Reliability for the Social Sciences: Theory & Applications. Thousand Oaks: Sage.

Generally there are four approaches for measuring reliability:

1.    Internal consistency: How well do items on the test “hang together” to measure the same psychological attribute

2.    Split-half (split-forms): How well do scores on two forms (splits) of the test (first 25 items versus last 25) relate to one another

3.    Test-retest: How similar are scores obtained from multiple administrations of the same test

4.    Inter-rater reliability: How consistently do two or more raters (essay markers) obtain similar scores.

Internal consistency reliability is common and is used in our Test Analysis Report and Results Management System ,where we use Cronbach’s Alpha.

Stay tuned for Part 2 of this post, which will discuss the factors and test characteristics that generally influence internal consistency reliability coefficient values.

Delivering Assessments Securely: What are the stakes?

joan-small6Posted by Joan Phaup

Indeed, that is the question!stakesofassessmentchart

Your decisions about assessment security need to start with the stakes! This “assessment pyramid” in our white paper, Delivering Assessments Safely and Securely, helps put  various types of assessments  in perspective to help you make solid decisions about delivery and security requirements.

The paper describes various types of assessments and explains numerous delivery options so that you can select appropriate methods for deploying each type of assessment safely, securely, and cost effectively. It is designed to help you avoid over-engineering low-stakes assessments (which brings unnecessary costs and wastes time) or under-engineering high-stakes assessments (which can undermine an assessment’s face validity, not to mention people’s confidence in a testing program.)

We recently updated this paper to help organizations  address security issues related to current technology.  It also includes pointers on protecting intellectual property, discouraging cheating, and using assessment technologies  to provide different levels of security. So feel free to download the paper and put these ideas into practice!