Standard Setting: Compromise and Normative Methods

Austin Fossey

Posted by Austin Fossey

We have discussed the Angoff and Bookmark methods of standard setting, which are two commonly used methods, but there are many more. I would again refer the interested reader to Hambleton and Pitoniak’s chapter in Educational Measurement (4th ed.) for descriptions of other criterion-referenced methods.

Though criterion-referenced assessment is the typical standard-setting scenario, cut scores may also be determined for normative assessments. In these cases, the cut score is often not set to make an inference about the participant, but instead set to help make an operational decision.

A common example of a normative standard is when the pass rate is set based on information that is unrelated to participants’ performance. A company may decide to hire the ten highest-scoring candidates, not because the other candidates are not qualified, but because there are only ten open positions. Of course if the candidate pool is weak overall, even the ten highest performers may still turn out to be lousy employees.

We may also set normative standards based on risk tolerance. You may recall from our post about criterion validity that test developers may use a secondary measure that they expect to correlate with performance on the assessment. An employer may wish to set a cut score to minimize type I errors (false positives) because of the risk involved. For example, ability to fly a plane safely may correlate strongly with aviation test scores, but because of the risk involved if we let an unqualified person fly a plane, we may want to set the cut score high even though we will exclude some qualified pilots.

aviation 1

Normative Standard Setting with Secondary Criterion Measure

The opposite scenario may occur as well. If Type I errors have little risk, an employer may set the cut score low to make sure that all qualified candidates are identified. Unqualified candidates who happen to pass may be identified for additional training through subsequent assessments or workplace observation.

If we decided to use a normative approach to standard setting, we need to be sure that there is justification, and the cut score should not be used to classify individuals. A normative standard by its nature implies that not everyone will pass the assessment, regardless of their individual abilities, which is why it would be inappropriate for most cases in education or certification assessment.

Hambleton and Pitoniak also describe one final class of standard-setting methods called compromise methods. Compromise methods combine the judgment of the standard setters with information about the political realities of different pass rates. One example is the Hofstee Method, where stand setters define the highest acceptable cut score (1), the lowest acceptable cut score (2), highest acceptable fail rate (3), and the lowest acceptable fail rate (4). These are plotted against a curve of participants’ score data, and the intersection is used as a cut score.

 aviation 2Hofstee Method ExampleAdapted from Educational Measurement (Ed. Brennan, 2006)

Understanding Assessment Validity: Content Validity


Posted by Greg Pope

In my last post I discussed criterion validity and showed how an organization can go about doing a simple criterion-related validity study with little more than Excel and a smile. In this post I will talk about content validity, what it is and how one can undertake a content-related validity study.

Content validity deals with whether the assessment content and composition are appropriate, given what is being measured. For example, does the test content reflect the knowledge/skills required to do a job or demonstrate that one grasps the course content sufficiently? In the example I discussed in the last post regarding the sales course exam, one would want to ensure that the questions on the exam cover the course content area of focus appropriately, in appropriate ratios. For example, if 40% of the four-day sales course deals with product demo techniques then we would want about 40% of the questions on the exam to measure knowledge/skills in the area of demo skills.

I like to think of content validity in two slices. The first slice of the content validity pie is addressed when an assessment is first being developed: content validity should be one of the primary considerations in assembling the assessment. Developing a “test blueprint” that outlines the relative weightings of content covered in a course and how that maps onto the number of questions in an assessment is a great way to help ensure content validity from the start. Questions are of course classified when they are being authored as fitting into the specific topics and subtopics. Before an assessment is put into production to be administered to actual participants, an independent group of subject matter experts should review the assessment and compare the questions included on the assessment against a blueprint. An example of a test blueprint is provided below for the sales course exam, which has 20 questions in total.

validity 4

The second slice of content validity is addressed after an assessment has been created. There are a number of methods available in the academic literature outlining how to conduct a content validity study. One way, developed by Lawshe in the mid 1970s, is to get a panel of subject matter experts to rate each question on an assessment in terms of whether the knowledge or skills measured by each question is “essential,” “useful, but not essential,” or “not necessary” to the performance of what is being measured (i.e., the construct). The more SMEs who agree that items are essential, the higher the content validity. Lawshe also developed a funky formula called the “content validity ratio” (CVR) that can be calculated for each question. The average of the CVR across all questions on the assessment can be taken as a measure of the overall content validity of the assessment.

validity 5

You can use Questionmark Perception to easily conduct a CVR study by taking an image of each question on an assessment (e.g., sales course exam) and creating a survey question for each assessment question to be reviewed by the SME panel, similar to the example below.

validity 6You can then use the Questionmark Survey Report or other Questionmark reports to review and present the content validity results.

So how does “face validity” relate to content validity? Well, face validity is more about the subjective perception of what the assessment is trying to measure than about conducting validity studies. For example, if our sales people sat down after the four-day sales course to take the sales course exam and all the questions on the exam were asking about things that didn’t seem related to the information they just learned on the course (e.g., what kind of car they would like to drive or how far they can hit a golf ball), the sales people would not feel that the exam was very “face valid” as it doesn’t appear to measure what it is supposed to measure. Face validity, therefore, has to do with whether an assessment looks valid or feels valid to the participant. However, face validity is somewhat important:  if participants or instructors don’t buy in to the assessment being administered, they may not take it seriously,  they may complain about and appeal their results more often, and so on.

In my next post I will turn the dial up to 11 and discuss the ins and outs of construct validity.

Understanding Assessment Validity: An Introduction


Posted by Greg Pope

In previous posts I discussed some of the theory and applications of classical test theory and test score reliability. For my next series of posts, I’d like to explore the exciting realm of validity. I will discuss some of the traditional thinking in the area of validity as well as some new ideas, and I’ll share applied examples of how your organization could undertake validity studies.

According to the “standards bible” of educational and psychological testing, the Standards for Educational and Psychological Testing (AERA/NCME, 1999), validity is defined as “The degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.”

The traditional thinking around validity, familiar to most people, is that there are three main types:

validity 1

The most recent thinking on validity takes a more unifying approach which I will go into in more detail in upcoming posts.

Now here is something you may have heard before: “In order for an assessment to be valid it must be reliable.” What does this mean? Well, as we learned in previous Questionmark blog posts, test score reliability refers to how consistently an assessment measures the same thing. One of the criteria to make the statement, “Yes this assessment is valid,” is that the assessment must have acceptable test reliability, such as high Cronbach’s Alpha test reliability index values as found in the Questionmark Test Analysis Report and Results Management System (RMS). Other criteria for making the statement, “Yes this assessment is valid,” is to show evidence for criterion related validity, content related validity, and construct related validity.

In my next posts on this topic I will provide some illustrative examples of how organizations may undertake investigating each of these traditionally defined types of validity for their assessment program.