Six tips to increase content validity in competence tests and exams

Posted by John Kleeman

Content validity is one of the most important criteria on which to judge a test, exam or quiz. This blog post explains what content validity is, why it matters and how to increase it when using competence tests and exams within regulatory compliance and other work settings.

What is content validity?

An assessment has content validity if the content of the assessment matches what is being measured, i.e. it reflects the knowledge/skills required to do a job or demonstrate that the participant grasps course content sufficiently.

Content validity is often measured by having a group of subject matter experts (SMEs) verify that the test measures what it is supposed to measure.

Why does content validity matter?

If an assessment doesn’t have content validity, then the test isn’t actually testing what it seeks to, or it misses important aspects of job skills.

Would you want to fly in a plane, where the pilot knows how to take off but not land? Obviously not! Assessments for airline pilots take account all job functions including landing in emergency scenarios.

Similarly, if you are testing your employees to ensure competence for regulatory compliance purposes, or before you let them sell your products, you need to ensure the tests have content validity – that is to say they cover the job skills required.

Additionally to these common sense reasons, if you use an assessment without content validity to make decisions about people, you could face a lawsuit. See this blog post which describes a US lawsuit where a court ruled that because a policing test didn’t match the job skills, it couldn’t be used fairly for promotion purposes.

How can you increase content validity?

Here are some tips to get you started. For a deeper dive, Questionmark has several white papers that will help, and I also recommend Shrock & Coscarelli’s excellent book “Criterion-Referenced Test Development”.

  1. Conduct a job task analysis (JTA). A JTA is a survey which asks experts in the job role what tasks are important and how often they are done. A JTA gives you the information to define assessment topics in terms of what the job needs. Questionmark has a JTA question type which makes it easy to deliver and report on JTAs.
  2. Define the topics in the test before authoring. Use an item bank to store questions, and define the topics carefully before you start writing the questions. See Know what your questions are about before you deliver the test for some more reasoning on this.
  3. You can poll subject matter experts to check content validity for an existing test. If you have an existing assessment, and you need to check its content validity, get a panel of SMEs (experts) to rate each question as to whether it is  “essential,” “useful, but not essential,” or “not necessary” to the performance of what is being measured. The more SMEs who agree that items are essential, the higher the content validity. See Understanding Assessment Validity- Content Validity for a way to do this within Questionmark software.
  4. Use item analysis reporting. Item analysis reports flag questions which are don’t correlate well with the rest of the assessment. Questionmark has an easy to understand item analysis report which will flag potential questions for review. One of the reasons a question might get flagged is because participants who do well on other questions don’t do well on this question – this could indicate the question lacks content validity.
  5. Involve Subject Matter Experts (SMEs). It might sound obvious, but the more you involve SMEs in your assessment development, the more content validity you are likely to get. Use an assessment management system which is easy for busy SMEs to use, and involve SMEs in writing and reviewing questions.
  6. Review and update tests frequently. Skills required for jobs change quickly with changing technology and changing regulations.  Many workplace tests that were valid two years ago, are not valid today. Use an item bank with a search facility to manage your questions, and review and update or retire questions that are no longer relevant.

I hope this blog post reminds you why content validity matters and gives helpful tips to improve the content validity of your tests. If you are using a Learning Management System to create and deliver assessments, you may struggle to obtain and demonstrate content validity. If you want to see how Questionmark software can help manage your assessments, request a personalized demo today.

 

Item Development – Organizing a content review committee (Part 2)

Austin Fossey-42Posted by Austin Fossey

In my last post, I explained the function of a content review committee and the importance of having a systematic review process. Today I’ll provide some suggestions for how you can use the content review process to simultaneously collect content validity evidence without having to do a lot of extra work.

If you want to get some extra mileage out of your content review committee, why not tack on a content validity study? Instead of asking them if an item has been assigned to the correct area of the specifications, ask them to each write down how they would have classified the item’s content. You can then see if topics picked by your content review committee correspond with the topics that your item writers assigned to the items.

There are several ways to conduct content validity studies, and a content validity study might not be sufficient evidence to support the overall validity of the assessment results. A full review of validity concepts is outside the scope of this article, but one way to check whether items match their intended topics is to have your committee members rate how well they
think an item matches each topic on the specifications. A score of 1 means they think the item matches, a score of -1 means they think it does not match, and a score of 0 means that they are not sure.

If each committee member provides their own ratings, you can calculate the index of congruence , which was proposed by Richard Rovinelli and Ron Hambleton. You can then create a table of these indices to see whether the committee’s classifications correspond to the content classifications given by your item writers.

The chart below compares item writers’ topic assignments for two items and the index of congruence determined by a content committee’s ratings of the two items on an assessment with ten topics. We see that both groups agreed that Item 1 belonged to Topic 5 and Item 2 belonged to Topic 1. We also see that the content review committee was uncertain on whether or not Item 1 measured Topic 2, and we see that some of the committee members felt that Item 2 measured  Topic 7.

ID2

Comparison of content review committee’s index of congruence and item writers’ classifications of two items on an assessment with ten topics.

 

Item Development – Organizing a content review committee (Part 1)

AustinPosted by Austin Fossey

Once your items have passed through an initial round of edits, it is time for a content review committee to examine them. Remember that you should document the qualifications of your committee members, and if possible, recruit different people than those used to write the items or conduct other reviews.

In their chapter in Educational Measurement (4 th ed.), Cynthia Shmeiser and Catherine Welch explain that the primary function of the content review committee is to verify the accuracy of the items with regard to the defined domain, including content and cognitive classification of items. The committee might answer questions like:

  • Given the information in the stem, is the item key the correct answer in all situations?
  • Is enough information provided in the item for candidates to choose an answer?
  • Given the information in the stem, are the distractors incorrect in all situations?
  • Would a participant with specialized knowledge interpret the item and the options differently from the general population of participants?
  • Is the item tagged to the correct area of the specifications (e.g., topic, subdomain)?
  • Does the item function at the intended cognitive level?

Other content review goals may be added depending on your specific testing purpose. For example, in their chapter in Educational Measurement (4th ed.), Brian Clauser, Melissa Margolis, and Susan Case observe that for certification and licensure exams, a content review committee might determine whether items are relevant to new practitioners—the intended audience for such assessments.

Shmeiser and Welch also recommend that the review process be systematic, implying that the committee should apply a consistent level of scrutiny and decision criteria for each item they review. But how can you as the test developer keep things systematic?

One way is to use a checklist of the acceptance criteria for each item. By using a checklist, you can ensure that the committee reviews and signs off on each aspect of the item’s content. The checklist can also provide a standardized format for documenting problems that need to be addressed by the item writers. These checklists can be used to report the results of the content review, and they can be kept as supporting documentation for the Test Development and Revision requirements specified by the Standards for Educational and Psychological Testing.

In my next post, I’ll suggest some ways for you, as a test developer, to leverage your content review committee to gather content validity evidence for your assessment.

For best practice guidance and practical advice for the five key stages of test and exam development, check out our white paper: 5 Steps to Better Tests.

Research Design Validity: Applications in Assessment Management

Austin FosseyPosted by Austin Fossey

I would like to wrap up our discussions about validity by talking briefly about the validity of research designs.

We have already discussed criterion, construct, and content validity, which are the stanchions of validity in an assessment. We have also talked about new proponents of argument-based validity and the more abstract concept of face validity.

While all of these concepts relate to the validity of the assessment instrument, we must also consider the validity of the research used in assessment management and the validity of the research that an assessment or survey supports.

In their 1963 book, Experimental and Quasi-Experimental Designs for Research, Donald Campbell and Julian Stanley describe two research design concepts: internal validity and external validity.

Internal validity is the idea that observed differences in a dependent variable (e.g. test score) are directly related to an independent variable (e.g., participant’s true ability). External validity experimentalvalidity refers to how generalizable our results are. For example, would we expect the same results with other samples of participants, other research conditions, or other operational conditions?

The item analysis report, which provides statistics about the difficulty and discrimination of an item, is an example of research that is used for assessment management. Assessments managers often use these statistics to decide if an unscored field test item is fit to become a scored operational item on an assessment.

When we use the item analysis report to decide if the item is worth keeping, we are conducting research. The internal validity of the research may be threatened if something other than participant ability is affecting the item statistics.

For example, I recall a company that field tested two new test forms, and later found out that one participant had been trying to sabotage the statistics by persuading others to purposefully get a low score on the assessment. Fortunately, this person’s online campaign was ineffective, but it is a good example of an event that could have seriously disrupted the internal validity of the item analysis research.

When considering external validity, the most common threat is a non-representative sample. When field testing items for the first time, some assessment managers will find that volunteer participants are not representative of the general population of participants.

In some of my past experiences, I have had samples of field test volunteers who have been either high- ability participants or who are planning to teach a test prep workshop. We would not expect the item statistics from this sample to remain stable when the items go live in the general population.

So how can we control these threats? Try using separate groups of participants so you can compare results. Be consistent in how assessments are administered, and when items are not administered to all participants, make sure they are randomly assigned. Document your sample to demonstrate that it is representative of your participant population, and when possible, try to replicate your findings.

Argument-Based Validity: Defending an Assessment’s Inferences and Consequences

Austin FosseyPosted by Austin Fossey

We began discussing assessment validity in this blog a while back, and we have previously covered the core concepts of criterion, construct, and content validity.

Though those are staples, I think any discussion of validity would be lacking if we didn’t give a nod to Lyle F. Bachman’s article, Building and Supporting a Case for Test Use (2005).

This article discusses validity practices and the adaptation of Stephen Toulmin’s Model of Argumentation to assessments. Bachman explains how this model provides a system for linking assessment (or survey) scores, assessment inferences, and assessment consequences.

Bachman summarizes other authors’ ongoing discussions of argument-based validity, which in my opinion gets down to one core idea: assessment results need to be convincing. A test developer may need to be able to defend an assessment by providing a convincing argument for why the consequences of the test results are valid.

You may have been in a situation where you thought, “Wow, I just can’t believe that person passed that test!” Of course you would be too polite to say anything, but the doubt would still be there deep down in your heart. It would be nice if a friendly test developer would step in and explain to you, point by point, the evidence and reasoning for why it was okay to believe the results.

Bachman describes a simple process for how one might structure these validity arguments using Toulmin’s structure. From my experience, people seem to like the Toulmin approach because it’s easy to understand and easy to communicate to stakeholders. Toulmin’s structure includes the following elements:

    • Data
    • A warrant with backing evidence
    • A rebuttal with rebuttal evidence
    • A claim

Toulmin

With this model, you make a claim based on the data from the participant’s performance. You support that claim with a warrant, which has its own backing research and data (e.g., a validity study, a standard setting study). You then also have to refute any alternative explanations that might be used as a rebuttal (e.g., a bias review).

Bachman extends this line of thinking by suggesting that test developers should be able to create this argument structure for both the validity inference of the assessment as well as the uses of the assessment. After all, there are plenty of valid assessments that get used in invalid ways. He defines four types of warrants we should consider when using the results to make a decision, which are paraphrased as follows:

  • Is the interpretation of the score relevant to the decision being made?
  • Is the interpretation of the score useful for the decision being made?
  • Are the intended consequences of the assessment beneficial for the stakeholders?
  • Does the assessment provide sufficient information for making the decision?

Even if you don’t follow through with a whole set of documents built around this process, these are good questions to ask about your assessment. Consider alternative arguments for why participants may be passing or failing, and be sure you can convincingly refute them in the event of a challenge.

Think critically about whether or not your assessment is measuring what it claims to measure, and then think about what backing evidence or resources could help you make that interpretation.

Understanding Assessment Validity: Content Validity

greg_pope-150x1502

Posted by Greg Pope

In my last post I discussed criterion validity and showed how an organization can go about doing a simple criterion-related validity study with little more than Excel and a smile. In this post I will talk about content validity, what it is and how one can undertake a content-related validity study.

Content validity deals with whether the assessment content and composition are appropriate, given what is being measured. For example, does the test content reflect the knowledge/skills required to do a job or demonstrate that one grasps the course content sufficiently? In the example I discussed in the last post regarding the sales course exam, one would want to ensure that the questions on the exam cover the course content area of focus appropriately, in appropriate ratios. For example, if 40% of the four-day sales course deals with product demo techniques then we would want about 40% of the questions on the exam to measure knowledge/skills in the area of demo skills.

I like to think of content validity in two slices. The first slice of the content validity pie is addressed when an assessment is first being developed: content validity should be one of the primary considerations in assembling the assessment. Developing a “test blueprint” that outlines the relative weightings of content covered in a course and how that maps onto the number of questions in an assessment is a great way to help ensure content validity from the start. Questions are of course classified when they are being authored as fitting into the specific topics and subtopics. Before an assessment is put into production to be administered to actual participants, an independent group of subject matter experts should review the assessment and compare the questions included on the assessment against a blueprint. An example of a test blueprint is provided below for the sales course exam, which has 20 questions in total.

validity 4

The second slice of content validity is addressed after an assessment has been created. There are a number of methods available in the academic literature outlining how to conduct a content validity study. One way, developed by Lawshe in the mid 1970s, is to get a panel of subject matter experts to rate each question on an assessment in terms of whether the knowledge or skills measured by each question is “essential,” “useful, but not essential,” or “not necessary” to the performance of what is being measured (i.e., the construct). The more SMEs who agree that items are essential, the higher the content validity. Lawshe also developed a funky formula called the “content validity ratio” (CVR) that can be calculated for each question. The average of the CVR across all questions on the assessment can be taken as a measure of the overall content validity of the assessment.

validity 5

You can use Questionmark Perception to easily conduct a CVR study by taking an image of each question on an assessment (e.g., sales course exam) and creating a survey question for each assessment question to be reviewed by the SME panel, similar to the example below.

validity 6You can then use the Questionmark Survey Report or other Questionmark reports to review and present the content validity results.

So how does “face validity” relate to content validity? Well, face validity is more about the subjective perception of what the assessment is trying to measure than about conducting validity studies. For example, if our sales people sat down after the four-day sales course to take the sales course exam and all the questions on the exam were asking about things that didn’t seem related to the information they just learned on the course (e.g., what kind of car they would like to drive or how far they can hit a golf ball), the sales people would not feel that the exam was very “face valid” as it doesn’t appear to measure what it is supposed to measure. Face validity, therefore, has to do with whether an assessment looks valid or feels valid to the participant. However, face validity is somewhat important:  if participants or instructors don’t buy in to the assessment being administered, they may not take it seriously,  they may complain about and appeal their results more often, and so on.

In my next post I will turn the dial up to 11 and discuss the ins and outs of construct validity.