Understanding Assessment Validity: Criterion Validity


Posted by Greg Pope

In my last post I discussed three of the traditionally defined types of validity: criterion-related, content-related, and construct-related. Now I will talk about how your organization could undertake a study to investigate and demonstrate criterion-related validity.

So just to recap, criterion-related validity deals with whether assessment scores obtained for participants are predictive of something related to the goal of the assessment. For example, if a training program conducts a four-day sales training course, at the end of which an exam is administered designed to measure trainees’ knowledge and skills in the area of product sales, one may wonder whether the exam results have any relationship with actual sales performance. If the sales course exam scores are found to be related to/predict “real world” sales performance to a high degree, then we can say that there is a high degree of criterion-related validity between the intermediate variable (sales course exam scores) and the final or ultimate variable (sales performance).

So how does one find out whether high scores on the sales course exam correspond to high sales performance (and whether low scores on the sales course exam correspond to low sales performance)? Well, within an organization there may be some “feeling” about this, for example instructors seeing star students in the course bring in big sales numbers, but how do we get some hard numbers to back this up? You will be glad to hear that you don’t need a supercomputer and a room full of PhDs to figure this out! All you need to get some data on this are some good assessment results and some corresponding sales numbers for people who have gone through the course.

The first step is to gather the sales course exam scores for the participants who took the exam. In Questionmark Perception you can use the Export to ASCII or Export to Excel reports to output in a nice user-friendly format the assessment scores for the participants who took the sales course exam. Next you will want to match the participants for whom you have exam scores with their sales numbers (e.g., how much has each salesperson sold in the last 3 months). You may want to wait a few months after these participants have taken the exam and have been out in the field selling for a while, or you could look at historical sales data if you have it. Now you put this data together into an Excel spreadsheet (or SPSS or other analysis tool if you are savvy with those tools) to analyze in way similar to this:

validity 2Next you may want to produce a scatter plot and conduct a correlation and trend line between sales course exam scores and sales dollars for the last three months:

validity 5 correct

We find the correlation is 0.901, which is very high positive relationship (people with higher sales course exam scores bring in more sales dollars). This would suggest a high degree of criterion-related validity in that the sales course exam scores do indeed predict sales performance.

To go one step further, you can take the equation produced in Excel included on the scatter plot trend line and for new sales people taking the sales course exam you can predict how much sales revenue they might bring in: y = 21049x – 3366.2 (y=estimated sales performance in dollars, x= sales course exam score). Suppose a new sales person (Rick Thomas) obtains a sales course exam score of 73%. Just plug this into the equation and y=21049(0.73)-3366.2 = $11,999.57. Voila! Based on his sales course exam score, Rick Thomas can expect to bring in about $12,000 in revenue in the next three months. With more people analyzed (we only have 10 in this example), the greater confidence one can have in the correlation coefficients obtained and the predictive equations garnered. In “real life” I would want as many points of data as possible: hundreds of salesperson data points or more.

I will focus on content validity in my next, so stay tuned!

Podcast: Dr. David Metcalf on New Trends in Assessment


Posted by Joan Phaup

Talking with Dr. David Metcalf, who will be the keynote speaker at the Questionmark 2010 Users Conference in Miami, is an eye-opening experience!

In his work at the University of Central Florida’s Institute for Simulation and Training, David is constantly engaged in the future of learning and assessment. As a research faculty member, he is responsible for inspiring innovation in the field of learning and performance. And as an independent researcher, analyst and consultant, he guides business transformations for learning organizations.  A recent conversation with David gave me a taste of what’s in store for us at the conference in his address, Assessments on the Move: Mobility, Mashups and More.DavidMetcalf06

The conference will be about the here-and-now as well as the future, with tech training sessions to bring people up to speed with the latest features and functions of Questionmark Perception. Case studies, best practice presentations and peer discussions will round out the program, along with drop-ins with the Questionmark technicians and focus groups led by our product managers.

Our earlybird registration deadline is looming on December 4th, so it’s a good time to get acquainted with  the conference Web site and start making plans for a trip to Miami next March 14 – 17.

In the meantime, take a few minutes to ponder the future of assessment as you listen to this podcast:

Understanding Assessment Validity: An Introduction


Posted by Greg Pope

In previous posts I discussed some of the theory and applications of classical test theory and test score reliability. For my next series of posts, I’d like to explore the exciting realm of validity. I will discuss some of the traditional thinking in the area of validity as well as some new ideas, and I’ll share applied examples of how your organization could undertake validity studies.

According to the “standards bible” of educational and psychological testing, the Standards for Educational and Psychological Testing (AERA/NCME, 1999), validity is defined as “The degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.”

The traditional thinking around validity, familiar to most people, is that there are three main types:

validity 1

The most recent thinking on validity takes a more unifying approach which I will go into in more detail in upcoming posts.

Now here is something you may have heard before: “In order for an assessment to be valid it must be reliable.” What does this mean? Well, as we learned in previous Questionmark blog posts, test score reliability refers to how consistently an assessment measures the same thing. One of the criteria to make the statement, “Yes this assessment is valid,” is that the assessment must have acceptable test reliability, such as high Cronbach’s Alpha test reliability index values as found in the Questionmark Test Analysis Report and Results Management System (RMS). Other criteria for making the statement, “Yes this assessment is valid,” is to show evidence for criterion related validity, content related validity, and construct related validity.

In my next posts on this topic I will provide some illustrative examples of how organizations may undertake investigating each of these traditionally defined types of validity for their assessment program.

Beyond Multiple Choice: Leveraging Technology for Better Assessments

Joan PhaupPosted by Joan Phaup

Our first Beyond Multiple Choice: Nine Ways to Leverage Technology for Better Assessments Web seminar met with an enthusiastic response: it filled up almost instantly! So we are offering the same seminar again on Wednesday, December 16th, at 3 p.m. Eastern Time.

Assessments play a vital role in measuring people’s knowledge, skills and attitudes. They also help organizations improve performance, manage workforce competencies, and ensure regulatory compliance. How can you create and assessments that produce appropriate, actionable results? What can you do to ensure the quality of questions and the security of assessments all the way from authoring and scheduling to administration, reporting, and analysis? How can you make the best use of online authoring, reporting, analytical, and security tools? These and many other questions will be addressed in this free, hour-long session, which will include opportunities for you to ask questions of your own.

We welcome you to learn more about this Webinar and register online.

Assessment Standards 101: IMS QTI XML

john_smallPosted by John Kleeman

This is the second of a series of blog posts on assessment standards. Today I’d like to focus on the IMS QTI (Question and Test Interoperability) Specification.

It’s worth mentioning the difference between Specifications and Standards: Specifications are documents that industry bodies have agreed on (like IMS QTI XML), while Standards have been published and committed to by a formal legal body (like AICC or HTML). A Specification is less formal than a Standard but still can be very useful for interoperability.

Questionmark was one of the originators of QTI. When we migrated our assessment platform from Windows to the Web in the 1990s, our customers had to migrate their questions from one platform to the other. As you will know, it takes a lot of time to write high quality questions, and so it’s important to be able to carry them forward independently of technology. We knew that we’d be improving our software over the years and we wanted to ensure the easy transfer of questions from one version to the next. So we came up with QML (Question Markup Language), an open and platform-independent method of maintaining questions that makes it easy for customers to move forward in the future.

Although QML did solve the problem of moving questions between Questionmark versions, we met many customers who had difficulty bringing content created in another vendor’s proprietary format  into Questionmark. We  wanted to help them, and we also wanted to embrace openness and allow Questionmark customers to export out their questions in a standard format if they ever wanted to leave us. So we worked with other vendors within the umbrella of the IMS Global Learning Consortium to come up with QTI XML, a language that describes questions in a technology-neutral way.  I was involved in the work defining IMS QTI as were several of my colleagues: Paul Roberts did a lot of technical design, Eric Shepherd led the IMS working group that made QTI version 1, and Steve Lay (before joining Questionmark) led the version 2 project.

Here is a fragment of QTI XML and you can see that it is a just-about-human-readable way of describing a question.

<?xml version="1.0" standalone="no"?>
<!DOCTYPE questestinterop SYSTEM "ims_qtiasiv1p2.dtd">
<item title="USA" ident="3230731328031646">
<mattext texttype="text/html"><![CDATA[<P>Washington DC is the capital of the USA</P>]]></mattext>
<response_lid ident="1">
<render_choice shuffle="No">
<response_label ident="A">
<material> <mattext texttype="text/html"><![CDATA[True]]></mattext> </material>
<response_label ident="B">
<material> <mattext texttype="text/html"><![CDATA[False]]></mattext> </material>
<outcomes> <decvar/> </outcomes>
<respcondition title="0 True" >
<conditionvar> <varequal respident="1">A</varequal> </conditionvar>
<setvar action="Set">1</setvar> <displayfeedback linkrefid="0 True"/>
<respcondition title="1 False" >
<conditionvar> <varequal respident="1">B</varequal> </conditionvar>
<setvar action="Set">0</setvar> <displayfeedback linkrefid="1 False"/>
<itemfeedback ident="0 True" view="Candidate">
<itemfeedback ident="1 False" view="Candidate">
QTI XML has successfully established itself as a way of exchanging questions. For a long time, it was the most downloaded of all the IMS specifications, and many vendors support it. One problem with the language is that it allows description of a very wide variety of possible questions, not just those that are commonly used, and so it’s quite complex. Another problem is that (partly as it is a Specification, not a Standard) there’s ambiguity and disagreement on some of the finer points. In practice, you can exchange questions using QTI XML, especially multiple choice questions, but you often have to clean them up a bit to deal with different assumptions in different tools. At present, QTI version 1.2 is the reigning version, but IMS are working on an improved QTI version 2, and one day this will probably take over from version 1.

Tips for preventing cheating and ensuring assessment security: Part 3

julie-smallPosted by Julie Chazyn

My previous post offered four tips on making your assessments more secure and preventing cheating.  Aside from verifying IP addresses and running a Trojan horse or stealth items to help detect whether a participant has memorized the answer key, there are some physical actions you can take to avoid the problem and reduce the temptation to cheat.

Proper seating arrangements for participants

Seating participants with adequate space between them and giving them limited ability to see another participant‘s screen or paper are important strategies for enhancing test security. The proctor should be aware of cheating techniques such as the ―”flying V” seating arrangement where the “giver” at the point of the V feeds information to a number of “receivers” behind them. The givers and receivers can communicate in a number of ways, using sign language, dropping notes on the floor, etc. (Dr. Gregory Cizek’s book “Cheating on Tests: How to Do it, Detect it, and Prevent it,” will tell you  more about this and other aspects of cheating.)

Example of the “flying V” answer copying formation (Cizek, 1999):

Using unique make-up exams

Many organizations offer make-up exams for participants who were sick or had legitimate excuses for not being able to take an assessment at the scheduled date and time. If you use the same exam that was administered at the scheduled date and time for their make-up exam, you open yourself to risks of the exam form being compromised. Sometimes the make-up exams are not administered in the same strict proctored environment as the scheduled exam, allowing participants the opportunity to cheat or steal content.

Using more constructed response questions

Constructed response questions, like essay or short answer questions, provide less opportunity for participants to cheat because they require them to produce unique answers to questions. There is no answer key to steal, and participants who copied other people’s constructed response answers are easily identified via a side-by-side comparison of answers.

I hope you enjoyed this three part series on preventing cheating.  You will find more information about  various means for deploying many different types of assessments in our white paper, “Delivering  Assessments Safely and Securely.”