Assessment Report Design: Reporting Multiple Chunks of Information

Austin FosseyPosted by Austin Fossey

We have discussed aspects of report design in previous posts, but I was recently asked whether an assessment report should report just one thing or multiple pieces of information. My response is that it depends on the intended use of the assessment results, but in general, I find that a reporting tool is more useful for a stakeholder if it can report multiple things at once.

This is not to say that more data are always better. A report that is cluttered or that has too much information will be difficult to interpret, and users may not be able to fish out the data they need from the display. Many researchers recommend keeping simple, clean layouts for reports while efficiently displaying relevant information to the user (e.g., Goodman & Hambleton, 2004; Wainer, 1984).

But what information is relevant? Again, it will depend on the user and the use case for the assessment, but consider the types of data we have for an assessment. We have information about the participants, information about the administration, information about the content, and information about performance (e.g., scores). These data dimensions can each provide different paths of inquiry for someone making inferences about the assessment results.

There are times when we may only care about one facet of this datascape, but these data provide context for each other, and understanding that context provides a richer interpretation.

Hattie (2009) recommended that a report should have a major theme; that theme should be emphasized with between five to nine “chunks” of information. He also recommended
that the user have control of the report to be able to explore the data as desired.

Consider the Questionmark Analytics Score List Report: Assessment Results View. The major theme for the report is to communicate the scores of multiple participants. The report arguably contains five primary chunks of information: aggregate scores for groups of participants, aggregate score bands for groups of participants, scores for individual participants, score bands for individual participants, and information about the administration of the assessment to individual participants.

Through design elements and onscreen tools that give the user the ability to explore the data, this report with five chunks of information can provide context for each participant’s score. The user can sort participants to find the high- and low-performing participants, compare a participant to the entire sample of participants, or compare the participant to their group’s performance. The user can also compare the performance of groups of participants to see if certain groups are performing better than others.

rep 1

Assessment Results View in the Questionmark Analytics Score List Report

Online reporting also makes it easy to let users navigate between related reports, thus expanding the power of the reporting system. In the Score List Report, the user can quickly jump from Assessment Results to Topic Results or Item Results to make comparisons at different levels of the content. Similar functionality exists in the Questionmark Analytics Item Analysis Report, which allows the user to navigate directly from a Summary View comparing item statistics for different items to an Item Detail view that provides a more granular look at item performance through interpretive text and an option analysis table.

Analyzing multiple groups with the JTA Demographic Report

Austin FosseyPosted by Austin Fossey

In my previous post, I talked about how the Job Task Analysis (JTA) Summary Report can be used by subject matter experts (SMEs) to inform their decisions about what content to include in an assessment.

In many JTA studies, we might survey multiple populations of stakeholders who may have different opinions about what content should be on the assessment. The populations we select will be guided by theory or previous research. For example, for a certification assessment, we might survey the practitioners who will be candidates for certification, their managers, and their clients—because our subject matter experts theorize that each of these populations will have different yet relevant opinions about what a competent candidate must know and be able to do in order to be certified.

Instead of requiring you to create multiple JTA survey instruments for each population in the study, Questionmark Analytics allows you to analyze the responses from different groups of survey participants using the JTA Demographic Report.

This report provides demographic comparisons of aggregated JTA responses for each of the populations in the study. Users can simply add a demographic question to their survey so that this information can be used by the JTA Demographic Report. In our earlier example, we might ask survey participants to identify themselves as a practitioner, manager, or client, and then this data would be used to compare results in the report.

As with the JTA Summary Report, there are no requirements for how SMEs must use these data. The interpretations will either be framed out by the test developer using theory or prior research, or the interpretations will be left completely to the SMEs’ expert judgment.

SMEs might wish to investigate topics where populations differed in their ratings, or they may wish to select only those topics where there was universal agreement. They may wish to prioritize or weight certain populations’ opinions, especially if a population is less knowledgeable about the content than others.

The JTA Demographic Report provides a frequency distribution table for each task on the survey, organized by dimension. A chart gives a visual indicator to show differences in response distributions between groups.


Response distribution table and chart comparing JTA responses from nurses and doctors using the Questionmark JTA Demographic Report.

Discussing data mining at NCME

Austin FosseyPosted by Austin Fossey

We will wrap up our discussion of themes at the National Council for Measurement in Education (NCME) annual meeting with an overview of the inescapable discussion about working with complex — and often messy– data sets.

It was clear from many of the presentations and poster sessions that technology is driving the direction of assessment, for better or for worse (or as Damian Betebenner put it, “technology eats statistics”). Advances in technology have allowed researchers to examine new statistical models for scoring participants, identify aberrant responses, score performance tasks, identify sources of construct-irrelevant variance, diversify item formats, and improve reporting methods.

As the symbiotic knot between technology and assessment grows tighter, many researchers and test developers are in the unexpected position of having too much data. This is especially true in complex assessment environments that yield log files with staggering amounts of information about a participant’s actions within an assessment.

Log files can track many types of data in an assessment, such as responses, click streams, and system states. All of these data are time stamped, and if they capture the right data, they can illuminate some of the cognitive processes that are manifesting themselves through the participant’s interaction with the assessment. Raw assessment data like Questionmark’s Results API OData Feeds can also be coupled with institutional data, thus exponentially growing the types of research questions we can pursue within a single organization.

NCME attendees learned about hardware and software that captures both response variables and behavioral variables from participants as they complete an online learning task.

Several presenters discussed issues and strategies for addressing less-structured data, with many papers tackling log file data gathered as participants interact with an online assessment or other online task. Ryan Baker (International Educational Data Mining Society) gave a talk about combine the data mining of log files with field observations to identify hard-to-capture domains, like student engagement.

Baker focused on the positive aspects of having oceans of data, choosing to remain optimistic about what we can do rather than dwell on the difficulties of iterative model building in these types of research projects. He shared examples of intelligent tutoring systems designed to teach students while also gathering data about the student’s level of engagement with the lesson. These examples were peppered with entertaining videos of the researchers in classrooms playing with their phones so that individual students would not realize that they were being subtly observed by the researcher via sidelong glances.

Evidence-centered design (ECD) emerged as a consistent theme: there was a lot conversation about how researchers are designing assessments so that they yield fruitful data for
intended inferences. Nearly every presentation about assessment development referenced ECD. Valerie Shute (Florida State University) observed that five years ago, only a fraction of participants would have known about ECD, but today it is widely used by practitioners.

Discussing validity at NCME

Austin FosseyPosted by Austin Fossey

To continue last week’s discussion about big themes at the recent NCME annual meeting, I wanted to give an update on conversations about validity.

Validity is a core concept in good assessment development, which we have frequently discussed on this blog. Even though this is such a fundamental concept, our industry is still passionately debating what constitutes validity and how the term should be used.

NCME hosted a wonderful coordinated session during which some of the big names in validity theory presented their thoughts on how the term validity should be used today.validity NCME

Michael Kane (Educational Testing Service) began the discussion with his well-established views around argument-based validity. In this framework, the test developer must make a validity claim about an inference (or interpretation as Kane puts it) and then support that claim with evidence. Kane argues that validity is not a property of the test or scores, but it is instead a property of the inferences we make about those scores.

If you have read some of my previous posts on validity or attended my presentations about psychometric principles, you already know that I am a strong proponent of Kane’s view that validity refers to the interpretations and use cases—not to the instrument.

But not everyone agrees. Keith Markus (City University of New York) suggested that nitpicking about whether the test or the inference is the object of validity causes us to miss the point. The test and the inference work only as a combination, so validity (as a term and as a research goal) should be applied to these as a pair.

Pamela Moss (University of Michigan) argued that we need to shift the focus of validity study away from intended inferences and use cases to the actual use cases. Moss believes that the actual use cases of assessment results can be quite varied and nuanced, but we are really more interested in these real-world impacts. She proposed that we work to validate what she called “conceptual uses.” For example, if we want to use education assessments to improve learning, then we need to research why students earn low scores.

Greg Cizek (University of North Carolina) disagreed with Kane’s approach, saying that the evidence we gather to support an inference says nothing about the use cases, and vice versa. Cizek argued that we make two inferential leaps: one from the score to the inference, and one from the inference to the use case. So we should gather evidence that supports both inferential leaps.

Though I see Cizek’s point, I feel that it would not drastically change how I would approach a validity study in practice. After all, you cannot have a use case without making an inference, so I would likely just tackle the inferences and their associated use cases jointly.

Steve Sireci (University of Massachusetts) felt similarly. Sireci is one of my favorite presenters on the topic of validity, plus he gets extra points for matching his red tie and shirt to the color theme on his slides. Sireci posed this question: can we have an inference without having a use case? If so, then we have a “useless” test, and while there may be useless tests out there, we usually only care about the tests that get used. As a result, Sireci suggested that we must validate the inference, but that this validation must also demonstrate that the inference is appropriate for the intended uses.

Giving meaning to assessment scores

Austin FosseyPosted by Austin Fossey

As discussed in previous posts, validity refers to the proper inferences for and uses of assessment results. Assessment results are often in the form of assessment scores, and the valid inferences may depend heavily on how we format, label, report, and distribute those scores.

At the core of most assessment results are raw scores. Raw scores are simply the number of points earned by participants based on their responses to items in an assessment. Raw scores are convenient because they are easy to calculate and easy to communicate to participants and stakeholders. However, their interpretation may be constrained.

In their chapter in Educational Measurement (4th ed.), Cohen and Wollack explain that “raw scores have little clear meaning beyond the particular set of questions and the specific test administration.” This is often fine when our inferences are intended to be limited to a specific assessment administration, but what about further inferences?

Peterson, Kolen, and Hoover stated in their chapter in Educational Measurement (3rd ed.) that “the main purpose of scaling is to aid users in interpreting test results.” So when other inferences need to be made about the participants’ results, it is common to transform participants’ scored responses into a more meaningful measure.

When raw scores do not support the desired inference, then we may need to create a scale score. In his chapter in Educational Measurement (4th ed.), Kolen explains that “scaling is the process of associating numbers or other ordered indicators with the performance of examinees.” Scaling examples include percentage scores to be used for topic comparisons within an assessment, equating scores so that scores form multiple forms can be used interchangeably, or scaling IRT theta values so that all reported scores are positive values. SAT scores are examples of the latter two cases. There are many scaling procedures, and a full discussion is not possible here. (If you’d like to know more about this, I’d suggest reading Kolen’s chapter, referenced above).

Cohen and Wollack also describe two types of derived scores: developmental scores and within-group scores. These derived scores are designed to support specific types of inferences. Developmental scores show a student’s progress in relation to defined developmental milestones, such as grade equivalency scores used in education assessments. Within-group scores demonstrate a participant’s normative performance relative to a sample of participants. Within-group scores include standardized z scores, percentiles, and stanines.


Examples of within-group scores plotted against a normal distribution of participants’ observed scores.

Sometimes numerical scores cannot support the inference we want, and we give meaning to the assessment scores with a different ordered indicator. A common example is the use of performance level descriptors (PLDs, also known as achievement level descriptors or score band definitions). PLDs describe the average performance, abilities, or knowledge of participants who earn scores within a defined range. PLDs are often very detailed, though shortened versions may be used for reporting. In addition to the PLDs, performance levels (e.g., Pass/Fail, Does Not Meet/Meets/Exceeds) provide labels that tell users how to interpret the scores. In some assessment designs, performance levels and PLDs are reported without any scores. For example, an assessment may continue until a certain error threshold is met to determine which performance level should be assigned to the participant’s performance. If the participant performs very well consistently from the start, the assessment might end early and simply assign a “Pass” performance level rather than making the participant answer more items.

An easier approach to job task analysis: Q&A

Julie Delazyn HeadshotPosted by Julie Delazyn

Part of the assessment development process is understanding what needs to be tested. When you are testing what someone needs to know in order for them to do their job well, subject matter experts can help you harvest evidence for your test items by observing people at work. That traditionally manual process can take a lot of time and money.

Questionmark’s new job task analysis (JTA) capabilities enable SMEs to harvest information straight from the person doing the job. These tools also offer an easier way to see the frequency, importance, difficulty and applicability of a task in order to know if it’s something that needs to be included in an assessment.

Now that JTA question authoring, assessment creation and reporting are available to users of  Questionmark OnDemand and Questionmark Perception 5.7 I wanted to understand what makes this special and important. Questionmark Product Manager Jim Farrell, who has been working on the JTA question since its conception, was kind enough to speak to me about  its value, why it was created, and how it can now benefit our customers.

Here is a snippet of our conversation:

So … first things first … what exactly IS job task analysis and how would our customers benefit from using it?

Job task analysis, JTA, is a survey that you send out that has a list of tasks, which are broken down into dimensions. Those dimensions are typically difficulty, importance, frequency, and applicability. You want to find out things like this from someone who fills out the surveys: Do they find the job difficult? Do they deem it important? And how frequently do they do it? When you correlate all this data you’ll quickly see the items that are more important to test on and collect information on.

We have a JTA question type in Questionmark Live where you can either build your task list and your dimensions or you can import your tasks through a simple import process—so if you have a spreadsheet with all of your tasks you can easily import it. You would then add those to a survey and send them out to collect information. We also have two JTA reports that allow you to break down results by the actual dimension—just look at the difficulty for all the tasks—or you can look at a summary view of all of your tasks and all the dimensions all at
one time; have a snapshot.

That sounds very interesting and easy to use! I’m interested in how did question type actually came to be.

We initially developed the job task analysis survey for the US Navy. Prior to this, trainers would have to travel with paper and clipboards to submarines, battleships and aircraft carriers and watch sailors and others in the navy do their jobs. We developed the JTA survey to help them be more efficient to collect this data more easily and a lot more quickly than they did before.

What do you think is most valuable and exciting about JTA?

To me, the value comes in the ease of creating the questions and sending them out. And I am probably most excited for our customers. Most customers probably harvest information with paper and clipboard and walking around and watching people do their jobs. That’s a very expensive and time-consuming task, so by being able to send this survey out directly to subject matter experts you’re getting more authentic data because you are getting it right form the SMEs rather than from someone observing the behavior.


It was fascinating for me to understand how JTA was created and how it works … Do you find this kind of question type interesting? How do you see yourself using it? Please share your thoughts below!

« Previous PageNext Page »