Research Design Validity: Applications in Assessment Management

Austin FosseyPosted by Austin Fossey

I would like to wrap up our discussions about validity by talking briefly about the validity of research designs.

We have already discussed criterion, construct, and content validity, which are the stanchions of validity in an assessment. We have also talked about new proponents of argument-based validity and the more abstract concept of face validity.

While all of these concepts relate to the validity of the assessment instrument, we must also consider the validity of the research used in assessment management and the validity of the research that an assessment or survey supports.

In their 1963 book, Experimental and Quasi-Experimental Designs for Research, Donald Campbell and Julian Stanley describe two research design concepts: internal validity and external validity.

Internal validity is the idea that observed differences in a dependent variable (e.g. test score) are directly related to an independent variable (e.g., participant’s true ability). External validity experimentalvalidity refers to how generalizable our results are. For example, would we expect the same results with other samples of participants, other research conditions, or other operational conditions?

The item analysis report, which provides statistics about the difficulty and discrimination of an item, is an example of research that is used for assessment management. Assessments managers often use these statistics to decide if an unscored field test item is fit to become a scored operational item on an assessment.

When we use the item analysis report to decide if the item is worth keeping, we are conducting research. The internal validity of the research may be threatened if something other than participant ability is affecting the item statistics.

For example, I recall a company that field tested two new test forms, and later found out that one participant had been trying to sabotage the statistics by persuading others to purposefully get a low score on the assessment. Fortunately, this person’s online campaign was ineffective, but it is a good example of an event that could have seriously disrupted the internal validity of the item analysis research.

When considering external validity, the most common threat is a non-representative sample. When field testing items for the first time, some assessment managers will find that volunteer participants are not representative of the general population of participants.

In some of my past experiences, I have had samples of field test volunteers who have been either high- ability participants or who are planning to teach a test prep workshop. We would not expect the item statistics from this sample to remain stable when the items go live in the general population.

So how can we control these threats? Try using separate groups of participants so you can compare results. Be consistent in how assessments are administered, and when items are not administered to all participants, make sure they are randomly assigned. Document your sample to demonstrate that it is representative of your participant population, and when possible, try to replicate your findings.

Argument-Based Validity: Defending an Assessment’s Inferences and Consequences

Austin FosseyPosted by Austin Fossey

We began discussing assessment validity in this blog a while back, and we have previously covered the core concepts of criterion, construct, and content validity.

Though those are staples, I think any discussion of validity would be lacking if we didn’t give a nod to Lyle F. Bachman’s article, Building and Supporting a Case for Test Use (2005).

This article discusses validity practices and the adaptation of Stephen Toulmin’s Model of Argumentation to assessments. Bachman explains how this model provides a system for linking assessment (or survey) scores, assessment inferences, and assessment consequences.

Bachman summarizes other authors’ ongoing discussions of argument-based validity, which in my opinion gets down to one core idea: assessment results need to be convincing. A test developer may need to be able to defend an assessment by providing a convincing argument for why the consequences of the test results are valid.

You may have been in a situation where you thought, “Wow, I just can’t believe that person passed that test!” Of course you would be too polite to say anything, but the doubt would still be there deep down in your heart. It would be nice if a friendly test developer would step in and explain to you, point by point, the evidence and reasoning for why it was okay to believe the results.

Bachman describes a simple process for how one might structure these validity arguments using Toulmin’s structure. From my experience, people seem to like the Toulmin approach because it’s easy to understand and easy to communicate to stakeholders. Toulmin’s structure includes the following elements:

    • Data
    • A warrant with backing evidence
    • A rebuttal with rebuttal evidence
    • A claim

Toulmin

With this model, you make a claim based on the data from the participant’s performance. You support that claim with a warrant, which has its own backing research and data (e.g., a validity study, a standard setting study). You then also have to refute any alternative explanations that might be used as a rebuttal (e.g., a bias review).

Bachman extends this line of thinking by suggesting that test developers should be able to create this argument structure for both the validity inference of the assessment as well as the uses of the assessment. After all, there are plenty of valid assessments that get used in invalid ways. He defines four types of warrants we should consider when using the results to make a decision, which are paraphrased as follows:

  • Is the interpretation of the score relevant to the decision being made?
  • Is the interpretation of the score useful for the decision being made?
  • Are the intended consequences of the assessment beneficial for the stakeholders?
  • Does the assessment provide sufficient information for making the decision?

Even if you don’t follow through with a whole set of documents built around this process, these are good questions to ask about your assessment. Consider alternative arguments for why participants may be passing or failing, and be sure you can convincingly refute them in the event of a challenge.

Think critically about whether or not your assessment is measuring what it claims to measure, and then think about what backing evidence or resources could help you make that interpretation.