Understanding Assessment Validity: New Perspectives

greg_pope-150x1502

Posted by Greg Pope

In my last post I discussed specific aspects of construct validity. I’m capping off this series with a discussion of modern views and thinking on validity.

Dr. Bruno D. Zumbo

Recently my former graduate supervisor, Dr. Bruno D. Zumbo at the University of British Columbia, wrote a fascinating chapter in the new book, The Concept of Validity: Revisions, New Directions and Applications, edited by Dr. Robert W. Lissitz. Bruno’s chapter, “Validity as Contextualized and Pragmatic Explanation, and its Implications for Validation Practice,” provides a great modern perspective on validity.

The chapter has two aims: to provide an overview of what Bruno considers to be the concept of validity, and to discuss the implications for the process of validation.

Something I really liked about the chapter was its focus on why we conduct psychometric analyses digging into how our assessments perform. As Bruno discusses, the real purpose of all the psychometric analysis we do is to support or provide evidence for the claims that we make about the validity of the assessment measures we gather. For example, the reason we would do a Differential Functioning Analysis (DIF), in which we ensure that test questions are not biased against/towards a certain group, is not only to protect test developers against lawsuits but also to weed out invalidity in order to help us set where the inferential limits of assessment results are.

Bruno drives home the point that examining validity is an ongoing process of validation. One doesn’t just do a validity study or two and then be done: validation is an ongoing process in which multilevel construct validation occurs and procedures are tied in to program evaluation and assessment quality processes.

I would highly recommend that people interested in diving more into the theoretical and practical details of validity check out this book, which includes chapters from many highly respected psychometrics and testing industry experts.

I hope that this series on validity has been useful and interesting! Stay tuned for more psychometric tidbits in upcoming posts.

————–

Editor’s Note: Greg will be doing a presentation at the Questionmark Users Conference on Conducting Validity Studies within Your Organization. The conference will take place in Miami March 14 – 17. Learn more at www.questionmark.com/go/conference

Understanding Assessment Validity: Construct Validity

greg_pope-150x1502

Posted by Greg Pope

In my last post I discussed content validity. In this post I will talk about construct validity. Construct validity refers to whether/how well an assessment, or topics within an assessment, measure the educational/psychological constructs that the assessment was designed to measure. For example, if the construct to be measured is “sales knowledge and skills,” then the assessment designed to measure this construct should show evidence of actually measuring this “sales knowledge and skills” construct.

It will come as no surprise that measuring psychological constructs is a complicated thing to do. Human psychological constructs such as “depression,” “extroversion” or “sales knowledge and skills” are not as straightforward to measure as more tangible physical “constructs” such as temperature, length, or distance. However, luckily there are approaches which allow us to determine how well our assessments accomplish the measurement of these complex psychological constructs.

Construct validity is composed of a few areas with convergent and discriminant validity being the core:

validity 7In my next post I will drill down more into some of these areas of construct validity.

Understanding Assessment Validity: Content Validity

greg_pope-150x1502

Posted by Greg Pope

In my last post I discussed criterion validity and showed how an organization can go about doing a simple criterion-related validity study with little more than Excel and a smile. In this post I will talk about content validity, what it is and how one can undertake a content-related validity study.

Content validity deals with whether the assessment content and composition are appropriate, given what is being measured. For example, does the test content reflect the knowledge/skills required to do a job or demonstrate that one grasps the course content sufficiently? In the example I discussed in the last post regarding the sales course exam, one would want to ensure that the questions on the exam cover the course content area of focus appropriately, in appropriate ratios. For example, if 40% of the four-day sales course deals with product demo techniques then we would want about 40% of the questions on the exam to measure knowledge/skills in the area of demo skills.

I like to think of content validity in two slices. The first slice of the content validity pie is addressed when an assessment is first being developed: content validity should be one of the primary considerations in assembling the assessment. Developing a “test blueprint” that outlines the relative weightings of content covered in a course and how that maps onto the number of questions in an assessment is a great way to help ensure content validity from the start. Questions are of course classified when they are being authored as fitting into the specific topics and subtopics. Before an assessment is put into production to be administered to actual participants, an independent group of subject matter experts should review the assessment and compare the questions included on the assessment against a blueprint. An example of a test blueprint is provided below for the sales course exam, which has 20 questions in total.

validity 4

The second slice of content validity is addressed after an assessment has been created. There are a number of methods available in the academic literature outlining how to conduct a content validity study. One way, developed by Lawshe in the mid 1970s, is to get a panel of subject matter experts to rate each question on an assessment in terms of whether the knowledge or skills measured by each question is “essential,” “useful, but not essential,” or “not necessary” to the performance of what is being measured (i.e., the construct). The more SMEs who agree that items are essential, the higher the content validity. Lawshe also developed a funky formula called the “content validity ratio” (CVR) that can be calculated for each question. The average of the CVR across all questions on the assessment can be taken as a measure of the overall content validity of the assessment.

validity 5

You can use Questionmark Perception to easily conduct a CVR study by taking an image of each question on an assessment (e.g., sales course exam) and creating a survey question for each assessment question to be reviewed by the SME panel, similar to the example below.

validity 6You can then use the Questionmark Survey Report or other Questionmark reports to review and present the content validity results.

So how does “face validity” relate to content validity? Well, face validity is more about the subjective perception of what the assessment is trying to measure than about conducting validity studies. For example, if our sales people sat down after the four-day sales course to take the sales course exam and all the questions on the exam were asking about things that didn’t seem related to the information they just learned on the course (e.g., what kind of car they would like to drive or how far they can hit a golf ball), the sales people would not feel that the exam was very “face valid” as it doesn’t appear to measure what it is supposed to measure. Face validity, therefore, has to do with whether an assessment looks valid or feels valid to the participant. However, face validity is somewhat important:  if participants or instructors don’t buy in to the assessment being administered, they may not take it seriously,  they may complain about and appeal their results more often, and so on.

In my next post I will turn the dial up to 11 and discuss the ins and outs of construct validity.

Understanding Assessment Validity: Criterion Validity

greg_pope-150x1502

Posted by Greg Pope

In my last post I discussed three of the traditionally defined types of validity: criterion-related, content-related, and construct-related. Now I will talk about how your organization could undertake a study to investigate and demonstrate criterion-related validity.

So just to recap, criterion-related validity deals with whether assessment scores obtained for participants are predictive of something related to the goal of the assessment. For example, if a training program conducts a four-day sales training course, at the end of which an exam is administered designed to measure trainees’ knowledge and skills in the area of product sales, one may wonder whether the exam results have any relationship with actual sales performance. If the sales course exam scores are found to be related to/predict “real world” sales performance to a high degree, then we can say that there is a high degree of criterion-related validity between the intermediate variable (sales course exam scores) and the final or ultimate variable (sales performance).

So how does one find out whether high scores on the sales course exam correspond to high sales performance (and whether low scores on the sales course exam correspond to low sales performance)? Well, within an organization there may be some “feeling” about this, for example instructors seeing star students in the course bring in big sales numbers, but how do we get some hard numbers to back this up? You will be glad to hear that you don’t need a supercomputer and a room full of PhDs to figure this out! All you need to get some data on this are some good assessment results and some corresponding sales numbers for people who have gone through the course.

The first step is to gather the sales course exam scores for the participants who took the exam. In Questionmark Perception you can use the Export to ASCII or Export to Excel reports to output in a nice user-friendly format the assessment scores for the participants who took the sales course exam. Next you will want to match the participants for whom you have exam scores with their sales numbers (e.g., how much has each salesperson sold in the last 3 months). You may want to wait a few months after these participants have taken the exam and have been out in the field selling for a while, or you could look at historical sales data if you have it. Now you put this data together into an Excel spreadsheet (or SPSS or other analysis tool if you are savvy with those tools) to analyze in way similar to this:

validity 2Next you may want to produce a scatter plot and conduct a correlation and trend line between sales course exam scores and sales dollars for the last three months:

validity 5 correct

We find the correlation is 0.901, which is very high positive relationship (people with higher sales course exam scores bring in more sales dollars). This would suggest a high degree of criterion-related validity in that the sales course exam scores do indeed predict sales performance.

To go one step further, you can take the equation produced in Excel included on the scatter plot trend line and for new sales people taking the sales course exam you can predict how much sales revenue they might bring in: y = 21049x – 3366.2 (y=estimated sales performance in dollars, x= sales course exam score). Suppose a new sales person (Rick Thomas) obtains a sales course exam score of 73%. Just plug this into the equation and y=21049(0.73)-3366.2 = $11,999.57. Voila! Based on his sales course exam score, Rick Thomas can expect to bring in about $12,000 in revenue in the next three months. With more people analyzed (we only have 10 in this example), the greater confidence one can have in the correlation coefficients obtained and the predictive equations garnered. In “real life” I would want as many points of data as possible: hundreds of salesperson data points or more.

I will focus on content validity in my next, so stay tuned!

Understanding Assessment Validity: An Introduction

greg_pope-150x1502

Posted by Greg Pope

In previous posts I discussed some of the theory and applications of classical test theory and test score reliability. For my next series of posts, I’d like to explore the exciting realm of validity. I will discuss some of the traditional thinking in the area of validity as well as some new ideas, and I’ll share applied examples of how your organization could undertake validity studies.

According to the “standards bible” of educational and psychological testing, the Standards for Educational and Psychological Testing (AERA/NCME, 1999), validity is defined as “The degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.”

The traditional thinking around validity, familiar to most people, is that there are three main types:

validity 1

The most recent thinking on validity takes a more unifying approach which I will go into in more detail in upcoming posts.

Now here is something you may have heard before: “In order for an assessment to be valid it must be reliable.” What does this mean? Well, as we learned in previous Questionmark blog posts, test score reliability refers to how consistently an assessment measures the same thing. One of the criteria to make the statement, “Yes this assessment is valid,” is that the assessment must have acceptable test reliability, such as high Cronbach’s Alpha test reliability index values as found in the Questionmark Test Analysis Report and Results Management System (RMS). Other criteria for making the statement, “Yes this assessment is valid,” is to show evidence for criterion related validity, content related validity, and construct related validity.

In my next posts on this topic I will provide some illustrative examples of how organizations may undertake investigating each of these traditionally defined types of validity for their assessment program.

Item Analysis Analytics: The White Paper

greg_pope-150x1502

Posted by Greg Pope

I had a great time putting together an eight-part series on Item Analysis Analytics for this blog and was pleased with the interest it received.

When a reader asked if it would be possible to present all the posts in a single document I thought hey, let’s present the content of these articles in the form of a Questionmark White Paper! So here it is for you to download with our compliments.

I hope the paper helps you in your efforts to create test questions that make the grade!

Peer Discussion – Hot Topics in Assessment

sarah-small

Posted By Sarah ElkinsYU7U6622_JPG

During the recent Questionmark European Users Conference in Manchester, Stefanie Moerbeek, Senior Coordinator for Examination Development at EXIN, and Greg Pope, Questionmark’s Analytics and Psychometrics Manager, facilitated a best practice session that gave delegates the opportunity to participate in a peer discussion on Hot Topics in the Assessment Industry.

Stefanie and Greg join me in this podcast to share the outcomes of this session and provide an overview of topics such as beta testing questions, using randomization within examinations and dealing with intellectual property theft of exams.

Results Management System Quiz: Test your knowledge!

greg_pope-150x1502

Posted by Greg Pope

Organizations involved in medium and high-stakes testing must employ sound test development, administration and scoring processes to help ensure fair, reliable and valid assessments.

Knowledge Check

But despite everyone’s best efforts, there are times when it’s necessary to review and potentially modify test results to provide information and certificates that fairly reflect what was being measured.That’s where the Questionmark RMS, or Results
Management System comes in: It enables organizations to analyze, edit and publish assessment results in an informed and defensible way.

I have created a quiz on RMS to test your knowledge. Take assessment one and see how well you do. All the answers for the questions are available on the Questionmark web site, so if you study hard you can get a perfect score and impress your friends and colleagues.  Good luck!

.

Item Analysis Analytics Part 8: Some problematic questions

greg_pope-150x1502

Posted by Greg Pope

In my last post, I showed a few more examples of item analyses where we drilled down into why some questions had problems. I thought it might be useful  to show a few examples of some questions that have bad and downright terrible psychometric performance to show the ugly side of item analysis.

Below is an example of a question that is fairly terrible in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 65, which is not so good: there are too few participants in the sample to be able to make sound judgements about the psychometric performance of the question
  • Next we see that 25 participants didn’t answer the question (“Number not Answered” = 25), which means there was a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is hard with 20% of participants ‘getting it right.’
  • The “Item Discrimination” indicates very low discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘More than 40’ at only 5%. This means that of the participants with high overall exam scores, 27% selected the correct answer versus 22% of the participants with the lowest overall exam scores. This is a very small difference between the Upper and Lower groups. Participants who know the material should have got the question right more often.
  • The “Item Total Correlation” reflects the Item Discrimination with a negative value of -0.01. A value like this would definitely not meet most organizations’ internal criteria in terms of what is considered an acceptable item. Negative item-total correlations are a major red flag!
  • Finally we look at the Outcome information to see how the distracters perform. We find that participants are all over the map selecting distracters in an erratic way. When I look at the question wording I realize how vague and arbitrary this question is: the number of questions that should be in an assessment depends on numerous factors and contexts. It is impossible to say that in any context a certain number of questions are required. It looks like the Upper Group are selecting the response options ‘21-40’ and ‘More than 40’ response options more than the other two options, which have smaller numbers of questions. This makes sense from a participant guessing perspective, because in many assessment contexts having more questions than fewer questions is better for reliability.

The psychometricians, SMEs, and test developers reviewing this question would need to send the SME who wrote this question back to basic authoring training to ensure that they know how to write questions that are clear and concise. This question does not really have a correct answer and needs to be re-written to clarify the context and provide many more details to the participants. I would even be tempted to throw out questions along this content line, because how long an assessment should be has no one “right answer.” How long an assessment should be depends on so many things that there will always be room for ambiguity, so it would be quite challenging to write a question that performs well statistically on this topic.

part-8-pic-1

Below is an example of a question that is downright awful in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 268, which is really good. That is a nice healthy sample. Nothing wrong here, let’s move on.
  • Next we see that 56 participants didn’t answer the question (“Number not Answered” = 56), which means there was a problem with people not finishing or finding the questions confusing and giving up. It gets worse, much, much worse.
  • The “P Value Proportion Correct” shows us that this question is really hard, with 16% of participants ‘getting it right.’
  • The “Item Discrimination” indicates a negative discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘44123’ at  -23%. This means that of the participants with high overall exam scores, 12% selected the correct answer versus 35% of the participants with the lowest overall exam scores. What the heck is going on? This means that participants with the highest overall assessment scores are selecting the correct answer LESS OFTEN than participants with the lowest overall assessment scores. That is not good at all; lets dig deeper.
  • The “Item Total Correlation” reflects the Item Discrimination with a large negative value of -0.26. This is a clear indication that there is something incredibly wrong with this question.
  • Finally we look at the Outcome information to see how the distracters perform. This is where the true psychometric horror of this question is manifested. There is neither rhyme nor reason here: participants, regardless of their performance on the overall assessment, are all over the place in terms of selecting response options. You might as well have blindfolded everyone taking this question and had them randomly select their answers. This must have been extremely frustrating for the participants who had to take this question and would have likely led to many participants thinking that the organization administering this question did not know what they were doing.

The psychometricians, SMEs, and test developers reviewing this question would need to provide a pink slip to the SME who wrote this question immediately. Clearly the SME failed basic question authoring training. This question makes no sense and was written in such a way to suggest that the author was under the influence, or otherwise not in a right state of mind, when crafting this question. What is this question testing? How can anyone possibly make sense of this and come up with a correct answer? Is there a correct answer? This question is not salvageable and should be stricken from the Perception repository without a second thought. A question like this should have never gotten in front of a participant to take, let alone 268 participants. The panel reviewing questions should review their processes to ensure that in the future questions like this are weeded out before an assessment goes out live for people to take.

part-8-pic-2

Next Page »