Trustworthy Assessment Results – A Question of Transparency

Austin FosseyPosted by Austin Fossey

Do you trust the results of your test? Like many questions in psychometrics, the answer is that it depends. Like the trust between two people, trustworthy assessment results have to be earned by the testing body.

trustMany of us want to implicitly trust the testing body, be it a certification organization, a department of education, or our HR department. When I fill a car with gas, I don’t want to have to siphon the gas out to make sure the amount of gas matches the volume on the pump—I just assume it’s accurate. We put the same faith in our testing bodies.

Just as gas pumps are certified and periodically calibrated, many high-stakes assessment programs are also reviewed. In the U.S., state testing programs are reviewed by the U.S. Department of Education, peer review groups, and technical advisory boards. Certification and licensure programs are sometimes reviewed by third-party accreditation programs, though these accreditations usually only look to see that certain requirements are met without evaluating how well they were executed.

In her op-ed, Can We Trust Assessment Results?, Eva Baker argues that the trustworthiness of assessment results is dependent on the transparency of the testing program. I agree with her. Participants should be able to easily get information on the purpose of the assessment, the content that is covered, and how the assessment was developed. Baker also adds that appropriate validity studies should be conducted and shared. I was especially pleased to see Baker propose that “good transparency occurs when test content can be clearly summarized without giving away the specific questions.”

For test results to be trustworthy, transparency also needs to extend beyond the development of the assessment to include its maintenance. Participants and other stakeholders should have confidence that the testing body is monitoring its assessments, and that a plan is in place should their results become compromised.

In their article, Cheating: Its Implications for ABFM Examinees, Kenneth Royal and James Puffer discuss cases where widespread cheating affects the statistics of the assessment, which in turn mislead test developers by making items appear easier. The effect can be an assessment that yields invalid results. Though specific security measures should be kept confidential, testing bodies should have a public-facing security plan that explains their policies for addressing improprieties. This plan should address policies for the participants as
well as for how the testing body will handle test design decisions that have been impacted by compromised results.

Even under ideal circumstances, mistakes can happen. Readers may recall that, in 2006, thousands of students received incorrect scores on the SAT, arguably one of the best-developed and carefully scrutinized assessments in U.S. education. The College Board (the testing body that runs the SAT) handled the situation as well as they could, publicly sharing the impact of the issue, the reasons it happened, and their policies for how they would handle the incorrect results. Others will feel differently, but I trust SAT scores more now that I have observed how the College Board communicated and rectified the mistake.

Most testing programs are well-run, professional operations backed by qualified teams of test developers, but there are the occasional junk testing programs such as predatory certificate programs, that yield useless, untrustworthy results. It can be difficult to tell the difference, but like Eva Baker, I believe that organizational transparency is the right way for a testing body to earn the trust of its stakeholders.

Assessment Report Design: Reporting Multiple Chunks of Information

Austin FosseyPosted by Austin Fossey

We have discussed aspects of report design in previous posts, but I was recently asked whether an assessment report should report just one thing or multiple pieces of information. My response is that it depends on the intended use of the assessment results, but in general, I find that a reporting tool is more useful for a stakeholder if it can report multiple things at once.

This is not to say that more data are always better. A report that is cluttered or that has too much information will be difficult to interpret, and users may not be able to fish out the data they need from the display. Many researchers recommend keeping simple, clean layouts for reports while efficiently displaying relevant information to the user (e.g., Goodman & Hambleton, 2004; Wainer, 1984).

But what information is relevant? Again, it will depend on the user and the use case for the assessment, but consider the types of data we have for an assessment. We have information about the participants, information about the administration, information about the content, and information about performance (e.g., scores). These data dimensions can each provide different paths of inquiry for someone making inferences about the assessment results.

There are times when we may only care about one facet of this datascape, but these data provide context for each other, and understanding that context provides a richer interpretation.

Hattie (2009) recommended that a report should have a major theme; that theme should be emphasized with between five to nine “chunks” of information. He also recommended
that the user have control of the report to be able to explore the data as desired.

Consider the Questionmark Analytics Score List Report: Assessment Results View. The major theme for the report is to communicate the scores of multiple participants. The report arguably contains five primary chunks of information: aggregate scores for groups of participants, aggregate score bands for groups of participants, scores for individual participants, score bands for individual participants, and information about the administration of the assessment to individual participants.

Through design elements and onscreen tools that give the user the ability to explore the data, this report with five chunks of information can provide context for each participant’s score. The user can sort participants to find the high- and low-performing participants, compare a participant to the entire sample of participants, or compare the participant to their group’s performance. The user can also compare the performance of groups of participants to see if certain groups are performing better than others.

rep 1

Assessment Results View in the Questionmark Analytics Score List Report

Online reporting also makes it easy to let users navigate between related reports, thus expanding the power of the reporting system. In the Score List Report, the user can quickly jump from Assessment Results to Topic Results or Item Results to make comparisons at different levels of the content. Similar functionality exists in the Questionmark Analytics Item Analysis Report, which allows the user to navigate directly from a Summary View comparing item statistics for different items to an Item Detail view that provides a more granular look at item performance through interpretive text and an option analysis table.

Analyzing multiple groups with the JTA Demographic Report

Austin FosseyPosted by Austin Fossey

In my previous post, I talked about how the Job Task Analysis (JTA) Summary Report can be used by subject matter experts (SMEs) to inform their decisions about what content to include in an assessment.

In many JTA studies, we might survey multiple populations of stakeholders who may have different opinions about what content should be on the assessment. The populations we select will be guided by theory or previous research. For example, for a certification assessment, we might survey the practitioners who will be candidates for certification, their managers, and their clients—because our subject matter experts theorize that each of these populations will have different yet relevant opinions about what a competent candidate must know and be able to do in order to be certified.

Instead of requiring you to create multiple JTA survey instruments for each population in the study, Questionmark Analytics allows you to analyze the responses from different groups of survey participants using the JTA Demographic Report.

This report provides demographic comparisons of aggregated JTA responses for each of the populations in the study. Users can simply add a demographic question to their survey so that this information can be used by the JTA Demographic Report. In our earlier example, we might ask survey participants to identify themselves as a practitioner, manager, or client, and then this data would be used to compare results in the report.

As with the JTA Summary Report, there are no requirements for how SMEs must use these data. The interpretations will either be framed out by the test developer using theory or prior research, or the interpretations will be left completely to the SMEs’ expert judgment.

SMEs might wish to investigate topics where populations differed in their ratings, or they may wish to select only those topics where there was universal agreement. They may wish to prioritize or weight certain populations’ opinions, especially if a population is less knowledgeable about the content than others.

The JTA Demographic Report provides a frequency distribution table for each task on the survey, organized by dimension. A chart gives a visual indicator to show differences in response distributions between groups.

JTA2

Response distribution table and chart comparing JTA responses from nurses and doctors using the Questionmark JTA Demographic Report.

Discussing data mining at NCME

Austin FosseyPosted by Austin Fossey

We will wrap up our discussion of themes at the National Council for Measurement in Education (NCME) annual meeting with an overview of the inescapable discussion about working with complex — and often messy– data sets.

It was clear from many of the presentations and poster sessions that technology is driving the direction of assessment, for better or for worse (or as Damian Betebenner put it, “technology eats statistics”). Advances in technology have allowed researchers to examine new statistical models for scoring participants, identify aberrant responses, score performance tasks, identify sources of construct-irrelevant variance, diversify item formats, and improve reporting methods.

As the symbiotic knot between technology and assessment grows tighter, many researchers and test developers are in the unexpected position of having too much data. This is especially true in complex assessment environments that yield log files with staggering amounts of information about a participant’s actions within an assessment.

Log files can track many types of data in an assessment, such as responses, click streams, and system states. All of these data are time stamped, and if they capture the right data, they can illuminate some of the cognitive processes that are manifesting themselves through the participant’s interaction with the assessment. Raw assessment data like Questionmark’s Results API OData Feeds can also be coupled with institutional data, thus exponentially growing the types of research questions we can pursue within a single organization.

NCME attendees learned about hardware and software that captures both response variables and behavioral variables from participants as they complete an online learning task.

Several presenters discussed issues and strategies for addressing less-structured data, with many papers tackling log file data gathered as participants interact with an online assessment or other online task. Ryan Baker (International Educational Data Mining Society) gave a talk about combine the data mining of log files with field observations to identify hard-to-capture domains, like student engagement.

Baker focused on the positive aspects of having oceans of data, choosing to remain optimistic about what we can do rather than dwell on the difficulties of iterative model building in these types of research projects. He shared examples of intelligent tutoring systems designed to teach students while also gathering data about the student’s level of engagement with the lesson. These examples were peppered with entertaining videos of the researchers in classrooms playing with their phones so that individual students would not realize that they were being subtly observed by the researcher via sidelong glances.

Evidence-centered design (ECD) emerged as a consistent theme: there was a lot conversation about how researchers are designing assessments so that they yield fruitful data for
intended inferences. Nearly every presentation about assessment development referenced ECD. Valerie Shute (Florida State University) observed that five years ago, only a fraction of participants would have known about ECD, but today it is widely used by practitioners.

Discussing validity at NCME

Austin FosseyPosted by Austin Fossey

To continue last week’s discussion about big themes at the recent NCME annual meeting, I wanted to give an update on conversations about validity.

Validity is a core concept in good assessment development, which we have frequently discussed on this blog. Even though this is such a fundamental concept, our industry is still passionately debating what constitutes validity and how the term should be used.

NCME hosted a wonderful coordinated session during which some of the big names in validity theory presented their thoughts on how the term validity should be used today.validity NCME

Michael Kane (Educational Testing Service) began the discussion with his well-established views around argument-based validity. In this framework, the test developer must make a validity claim about an inference (or interpretation as Kane puts it) and then support that claim with evidence. Kane argues that validity is not a property of the test or scores, but it is instead a property of the inferences we make about those scores.

If you have read some of my previous posts on validity or attended my presentations about psychometric principles, you already know that I am a strong proponent of Kane’s view that validity refers to the interpretations and use cases—not to the instrument.

But not everyone agrees. Keith Markus (City University of New York) suggested that nitpicking about whether the test or the inference is the object of validity causes us to miss the point. The test and the inference work only as a combination, so validity (as a term and as a research goal) should be applied to these as a pair.

Pamela Moss (University of Michigan) argued that we need to shift the focus of validity study away from intended inferences and use cases to the actual use cases. Moss believes that the actual use cases of assessment results can be quite varied and nuanced, but we are really more interested in these real-world impacts. She proposed that we work to validate what she called “conceptual uses.” For example, if we want to use education assessments to improve learning, then we need to research why students earn low scores.

Greg Cizek (University of North Carolina) disagreed with Kane’s approach, saying that the evidence we gather to support an inference says nothing about the use cases, and vice versa. Cizek argued that we make two inferential leaps: one from the score to the inference, and one from the inference to the use case. So we should gather evidence that supports both inferential leaps.

Though I see Cizek’s point, I feel that it would not drastically change how I would approach a validity study in practice. After all, you cannot have a use case without making an inference, so I would likely just tackle the inferences and their associated use cases jointly.

Steve Sireci (University of Massachusetts) felt similarly. Sireci is one of my favorite presenters on the topic of validity, plus he gets extra points for matching his red tie and shirt to the color theme on his slides. Sireci posed this question: can we have an inference without having a use case? If so, then we have a “useless” test, and while there may be useless tests out there, we usually only care about the tests that get used. As a result, Sireci suggested that we must validate the inference, but that this validation must also demonstrate that the inference is appropriate for the intended uses.

Giving meaning to assessment scores

Austin FosseyPosted by Austin Fossey

As discussed in previous posts, validity refers to the proper inferences for and uses of assessment results. Assessment results are often in the form of assessment scores, and the valid inferences may depend heavily on how we format, label, report, and distribute those scores.

At the core of most assessment results are raw scores. Raw scores are simply the number of points earned by participants based on their responses to items in an assessment. Raw scores are convenient because they are easy to calculate and easy to communicate to participants and stakeholders. However, their interpretation may be constrained.

In their chapter in Educational Measurement (4th ed.), Cohen and Wollack explain that “raw scores have little clear meaning beyond the particular set of questions and the specific test administration.” This is often fine when our inferences are intended to be limited to a specific assessment administration, but what about further inferences?

Peterson, Kolen, and Hoover stated in their chapter in Educational Measurement (3rd ed.) that “the main purpose of scaling is to aid users in interpreting test results.” So when other inferences need to be made about the participants’ results, it is common to transform participants’ scored responses into a more meaningful measure.

When raw scores do not support the desired inference, then we may need to create a scale score. In his chapter in Educational Measurement (4th ed.), Kolen explains that “scaling is the process of associating numbers or other ordered indicators with the performance of examinees.” Scaling examples include percentage scores to be used for topic comparisons within an assessment, equating scores so that scores form multiple forms can be used interchangeably, or scaling IRT theta values so that all reported scores are positive values. SAT scores are examples of the latter two cases. There are many scaling procedures, and a full discussion is not possible here. (If you’d like to know more about this, I’d suggest reading Kolen’s chapter, referenced above).

Cohen and Wollack also describe two types of derived scores: developmental scores and within-group scores. These derived scores are designed to support specific types of inferences. Developmental scores show a student’s progress in relation to defined developmental milestones, such as grade equivalency scores used in education assessments. Within-group scores demonstrate a participant’s normative performance relative to a sample of participants. Within-group scores include standardized z scores, percentiles, and stanines.

Normative

Examples of within-group scores plotted against a normal distribution of participants’ observed scores.

Sometimes numerical scores cannot support the inference we want, and we give meaning to the assessment scores with a different ordered indicator. A common example is the use of performance level descriptors (PLDs, also known as achievement level descriptors or score band definitions). PLDs describe the average performance, abilities, or knowledge of participants who earn scores within a defined range. PLDs are often very detailed, though shortened versions may be used for reporting. In addition to the PLDs, performance levels (e.g., Pass/Fail, Does Not Meet/Meets/Exceeds) provide labels that tell users how to interpret the scores. In some assessment designs, performance levels and PLDs are reported without any scores. For example, an assessment may continue until a certain error threshold is met to determine which performance level should be assigned to the participant’s performance. If the participant performs very well consistently from the start, the assessment might end early and simply assign a “Pass” performance level rather than making the participant answer more items.

« Previous PageNext Page »