Understanding Assessment Validity and Reliability

julie-smallPosted by Julie Chazyn

Assessments are not all created equal…Those that are both reliable and valid are the superior ones that support learning and measure knowledge most effectively.  But how can  authors make sure they are producing valid, reliable assessments?

I picked up some tips about this in revisiting the Questionmark White Paper, Assessments through the Learning Process.

So, what is a reliable assessment? One that  works consistently. If a survey indicates that employees are satisfied with a course of instruction, it should show the same result if administered three days later. (This type of reliability is called test-retest reliability.) If a course instructor rates employees taking a performance test, their scores should be the same as if any other course instructor scored their performances. (This is called inter-rater reliability.)

And what is a valid  assessment? One that measures what it is supposed to measure. If a test or survey is administered to happy people, the results should show that they’re all happy. Similarly if a group of people who are all knowledgeable are tested, the test results should reveal that they’re all knowledgeable.

If an assessment is valid, it looks like the job, and the content aligns with the tasks of the job in the eyes of job experts. This type of validity is known as Content Validity. In order to insure this validity, the assessment author must first undertake a job task analysis, surveying subject matter experts (SMEs) or people on the job to determine what knowledge and skills are needed to perform job-related tasks. That information makes it possible to produce a valid test.

Good assessments are both reliable and valid. If we gave a vocabulary test twice to a group of nurses, and the scores came back exactly the same way both times, the test would be considered highly reliable. However, this reliability does not mean that the test is valid. To be valid, it would need  to measure nursing competence in addition to being reliable.

Imagine administering a test of nursing skills to a group of skilled and unskilled nurses and the scores for each examinee are different each time. The test is clearly unreliable. If it’s not reliable, it cannot be valid; fluctuating scores for the same test takers cannot be measuring anything in particular. So the test is both unreliable and invalid. The reliable and valid test of nursing skills is one that yields similar scores every time it is given to the same group of test takers and discriminates every time between good and incompetent nurses. It is consistent and it measures what it is supposed to measure.

Assessments that are both reliable and valid hit the bullseye!


For more detail on validity and reliability, check out another of our white papers, Defensible Assessments: What You Need to Know.

Sharon Shrock & Bill Coscarelli Interview on Criterion-Referenced Testing

Posted by Joan Phaup

I enjoyed talking recently with Sharon Shrock and Bill Coscarelli, who spoke at the Questionmark 2009 Users Conference. Their keynote address covered the 25 years of progress in Criterion-Referenced Test Development and gave everyone at the conference some excellent background on this increasingly important subject.

I had some questions for them about this topic and am happy to share their answers in this podcast.

Licensing Open Standards: What Can We Learn From Open Source?

steve-smallPosted by Steve Lay

At the recent Questionmark Users Conference I gave an introductory talk on Open Source Software in Learning Education and Training.  When preparing for the talk it really came home to me how important the work of the Open Source Initiative (OSI) and Creative Commons is.  These organizations help to take a very complex subject, namely the licensing of intellectual property, and to distill it into a small set of common licenses that can be widely understood.

I’ve always been an advocate of distributing technical standards using these standard licenses where possible.  Standard licenses allow developers who use them to be confident of the legal foundations of their work without a cumbersome process of evaluating each license on a case-by-case basis.  So I was delighted to see an excellent blog post by Chuck Allen from the HR-XML consortium discussing this issue and providing some detailed analysis of several such licenses that highlight the different approaches taken by several consortia.

The community reaction to the temporary withdrawal of the draft QTI specification has already been discussed by John Kleeman in this blog, Why QTI Really Matters.  What struck me in that case was that there was uncertainty amongst community members surrounding the license and the impact of the withdrawal on their rights to develop and maintain software based on the draft.

This problem is not unique to e-learning, as Chuck Allen demonstrates with his analysis of the licenses used by the organizations he studied in the related HR field.  I’d echo his call for more convergence on the licenses used for technical standards.  In fact, I’d go further.  The W3C publish much of the core work on which the other standards rely, for example, HTML used for web pages and XML used by almost all modern standards initiatives.  Using the same approach would surely be the simplest way to license open standards based on these technologies?

Just as organizations like GNU, BSD, MIT and Apache have given their names to commonly used open source code licenses, I look forward to a time when I can choose the “W3C” open standards license and everyone will know what I mean.

Defining Assessment Terms: Tools for Getting the Right Results

julie-small1Posted by Julie Chazyn

In creating good, solid surveys, quizzes, test and exams it’s essential to understand what type of assessment will give you appropriate and actionable results.  We believe the ultimate objective of the assessment directly influences how it will be structured. This requires understanding the subtle distinctions that can mean big differences in the quality and outcomes of your assessments.  The language we use in talking about assessments needs to reflect those distinctions.

With that in mind, Questionmark CEO Eric Shepherd recently took some time to update Questionmark’s UK and US glossaries to help people understand different types of assessments.

Some of the terms that have been altered include:

Diagnostic assessment
Personality assessment
Psychological assessment
Summative assessment

We hope you will bookmark the glossary and refer back to it often!

Psychometrics 101: How do I know if an assessment is reliable? (Part 2)


Posted by Greg Pope

In my last post I offered some general information about assessment reliability. Below are some additional specific things to consider.

  • What factors / test characteristics generally influence internal consistency reliability coefficient values?

A.    Item difficulty: Items that are extremely hard or extremely easy affect discrimination and therefore reliability. If a large number of participants do not have time to finish the test this affects item difficulty
B.    Item discrimination: Items that have higher discrimination values will contribute more to the measurement efficacy of the assessment (more discriminating questions = higher reliability). Part of this relates to sound question development, if questions are well crafted and non-ambiguously worded they are more likely to have acceptable discrimination
C.    Construct being measured: If all questions are measuring the same construct (e.g., from the same topic) reliability will be increased
D.    How many participants took the test: With very small numbers of participants the reliability coefficient will be less stable
E.    Composition of people that took the test: If the sample of participants taking an assessment is not representative (e.g., no-one studied!) the reliability will be negatively impacted
F.    How many questions are administered: Generally the more questions administered the higher the reliability (to a point, we can’t have a 10,000 question test!)
G.    Environmental administration factors: Conditions in the testing area such as noise, lighting levels, etc. can cause distraction away from the measurement of what the participants know and can do
H.    Person factors: Test anxiety, fatigue, and other human factors can reduce the accuracy of measurement of what people know and can do


For more on this subject see the Questionmark White Paper, “Defensible Assessments: What You Need to Know”

Getting Ready for UK Breakfast Briefings


Posted By Sarah Elkins

Preparation for the Questionmark UK Breakfast Briefings in London, Manchester and Edinburgh is well underway now. I wanted to give a quick preview of what will be happening at them this year. We’ve traditionally used the briefings to demo recently released and upcoming features, and this year we’ve got some pretty exciting stuff to talk about!

We’re looking forward to talking about enhanced translation management capabilities, multi-lingual test delivery, and a new browser-based authoring tool for subject matter experts.  We’ll also be showing some new participant experience features and blended delivery options–both online and offline–including enhanced delivery to smart phones, PDAs, laptops and other devices. And we’ll discuss how assessments can integrate with other learning and HR technologies.

For anyone new to online assessment, we’ll be giving a quick overview of the Questionmark Perception assessment management system, as well as putting online assessment into context and demonstrating how assessments can be used in a range of settings. We’ve also heard that a lot of our customers and friends don’t always have the time and resources to develop an effective online assessment strategy, so this year we’ll also be giving an overview of the Questionmark services that can help accelerate time to deployment, automate assessment processes and integrate Questionmark products with your existing systems.

Anyone looking to attend should register on the UK Breakfast Briefing site.