Assessment Report Design: Reporting Multiple Chunks of Information

Austin FosseyPosted by Austin Fossey

We have discussed aspects of report design in previous posts, but I was recently asked whether an assessment report should report just one thing or multiple pieces of information. My response is that it depends on the intended use of the assessment results, but in general, I find that a reporting tool is more useful for a stakeholder if it can report multiple things at once.

This is not to say that more data are always better. A report that is cluttered or that has too much information will be difficult to interpret, and users may not be able to fish out the data they need from the display. Many researchers recommend keeping simple, clean layouts for reports while efficiently displaying relevant information to the user (e.g., Goodman & Hambleton, 2004; Wainer, 1984).

But what information is relevant? Again, it will depend on the user and the use case for the assessment, but consider the types of data we have for an assessment. We have information about the participants, information about the administration, information about the content, and information about performance (e.g., scores). These data dimensions can each provide different paths of inquiry for someone making inferences about the assessment results.

There are times when we may only care about one facet of this datascape, but these data provide context for each other, and understanding that context provides a richer interpretation.

Hattie (2009) recommended that a report should have a major theme; that theme should be emphasized with between five to nine “chunks” of information. He also recommended
that the user have control of the report to be able to explore the data as desired.

Consider the Questionmark Analytics Score List Report: Assessment Results View. The major theme for the report is to communicate the scores of multiple participants. The report arguably contains five primary chunks of information: aggregate scores for groups of participants, aggregate score bands for groups of participants, scores for individual participants, score bands for individual participants, and information about the administration of the assessment to individual participants.

Through design elements and onscreen tools that give the user the ability to explore the data, this report with five chunks of information can provide context for each participant’s score. The user can sort participants to find the high- and low-performing participants, compare a participant to the entire sample of participants, or compare the participant to their group’s performance. The user can also compare the performance of groups of participants to see if certain groups are performing better than others.

rep 1

Assessment Results View in the Questionmark Analytics Score List Report

Online reporting also makes it easy to let users navigate between related reports, thus expanding the power of the reporting system. In the Score List Report, the user can quickly jump from Assessment Results to Topic Results or Item Results to make comparisons at different levels of the content. Similar functionality exists in the Questionmark Analytics Item Analysis Report, which allows the user to navigate directly from a Summary View comparing item statistics for different items to an Item Detail view that provides a more granular look at item performance through interpretive text and an option analysis table.

Item Analysis Analytics Part 4: The Nitty-Gritty of Item Analysis

 

greg_pope-150x1502

Posted by Greg Pope

In my previous blog post I highlighted some of the essential things to look for in a typical Item Analysis Report. Now I will dive into the nitty-gritty of item analysis, looking at example questions and explaining how to use the Questionmark Item Analysis Report in an applied context for a State Capitals Exam.

The Questionmark Item Analysis Report first produces an overview of question performance both in terms of the difficulty of questions and in terms of the discrimination of questions (upper minus lower groups). These overview charts give you a “bird’s eye view” of how the questions composing an assessment perform. In the example below we see that we have a range of questions in terms of their difficulty (“Item Difficulty Level Histogram”), with some harder questions (the bars on the left), most average-difficulty questions (bars in the middle), and some easier questions (the bars on the right). In terms of discrimination (“Discrimination Indices Histogram”) we see that we have many questions that have high discrimination as evidenced by the bars being pushed up to the right (more questions on the assessment have higher discrimination statistics).

part-4-picture-1

Overall, if I were building a typical criterion-referenced assessment with a pass score around 50% I would be quite happy with this picture. We have more questions functioning at the pass score point with a range of questions surrounding it and lots of highly discriminating questions. We do have one rogue question on the far left with a very low discrimination index, which we need to look at.

The next step is to drill down into each question to ensure that each question performs as it should. Let’s look at two questions from this assessment, one question that performs well and one question that does not perform so well.

The question below is an example of a question that performs nicely. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 175, which is a nice sample of participants to evaluate the psychometric performance of this question.
  • Next we see thateveryone answered the question (“Number not Answered” = 0), which means there probably wasn’t a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is just above the pass score where 61% of participants ‘got it right.’ Nothing wrong with that: the question is neither too easy nor too hard.
  • The “Item Discrimination” indicates good discrimination, with the difference between the upper and lower group in terms of the proportion selecting the correct answer of ‘Salem’ at 48%. This means that of the participants with high overall exam scores, 88% selected the correct answer versus only 40% of the participants with the lowest overall exam scores. This is a nice, expected pattern.
  • The “Item Total Correlation” backs the Item Discrimination up with a strong value of 0.40. This means that of all participants who answered the questions, the pattern of high scorers getting the question right more than low scorers holds true.
  • Finally we look at the Outcome information to see how the distracters perform. We find that each distracter pulled some participants, with ‘Portland’ pulling the most participants, especially from the “Lower Group.” This pattern makes sense because those with poor state capital knowledge may make the common mistake of selecting Portland as the capital of Oregon.

The psychometricians, SMEs, and test developers reviewing this question all have smiles on their faces when they see the item analysis for this item.

part-4-picture-2

Next we look at that rogue question that does not perform so well in terms of discrimination-–the one we saw in the Discrimination Indices Histogram. When we look into the question we understand why it was flagged:

  • Going from left to right, first we see that the “Number of Results” is 175, which is again a nice sample size: nothing wrong here.
  • Next we see everyone answered the question, which is good.
  • The first red flag comes from the “P Value Proportion Correct” as this question is quite difficult (only 35% of participants selected the correct answer). This is not in and of itself a bad thing so we can keep this in memory as we move on,
  • The “Item Discrimination” indicates a major problem, a negative discrimination value. This means that participants with the lowest exam scores selected the correct answer more than participants with the highest exam scores. This is not the expected pattern we are looking for: Houston, this question has a problem!
  • The “Item Total Correlation” backs up the Item Discrimination with a high negative value.
  • To find out more about what is going on we delve into the Outcome information area to see how the distracters perform. We find that the keyed-correct answer of Nampa is not showing the expected pattern of upper minus lower proportions. We do, however, find that the distracter “Boise” is showing the expected pattern of the Upper Group (86%) selecting this response option much more than the Lower Group (15%). Wait a second…I think I know what is wrong with this one, it has been mis-keyed! Someone accidently assigned a score of 1 to Nampa rather than Boise.

part-4-picture-3

No problem: the administrator pulls the data into the Results Management System (RMS), changes the keyed correct answer to Boise, and presto, we now have defensible statistics that we can work with for this question.

part-4-picture-4

The psychometricians, SMEs, and test developers reviewing this question had a frown on their faces at first but those frowns were turned upside down when they realized it is just a simple mis-keyed question.

In my next blog post I would like share some observations on the relationship between Outcome Discrimination and Outcome Correlation.

Are you ready for some light relief after pondering all these statistics? Then have some fun with our own State Capitals Quiz.

Psychometrics 101: Sample size and question difficulty (p-values)

greg_pope-150x1502

Posted by Greg Pope

With just a week to go before the Questionmark Users Conference, here’s a little taste of the presentation I will be doing on  psychometrics. I will also be running a session on Item Analysis and Test Analysis.

So, let’s talk about sample size and question difficulty!

How does the number of participants that take a question relate to the robustness/stability of the question difficulty statistic (p-value)? Basically the smaller the number of participants tested the less robust/stable the statistic. So if 30 participants take a question and the p-value that appears in the Item Analysis Report is 0.600 the range that the theoretical “true” p-value (if all participants in the world took the question) could fall into 95% of the time is between 0.775 and 0.425. This means that if another 30 participants were tested you could get a p-value on the Item Analysis Report anywhere from 0.775 to 0.425 (95% confidence range). The take away is that if high stakes decisions are being made using p-values (e.g., whether to drop a question from a certification exam) the more participants that can be tested the better to get more robust results. Another example is that if you are conducting beta testing and you want to know which questions to include in your test form based on the beta test results the more participants you can beta test the better in terms of the confidence you will have in the stability of the statistics. Below is a graph that illustrates this relationship.sample-size-influences-p-value-chart1

This relationship between sample size and the stability of other statistics applies to other common statistics used in psychometrics. For example the item-total correlation (point biserial correlation coefficient) can vary a great deal when small sample sizes are used to calculate it. In the example below we see that an observed correlation of 0 can actual vary by over 0.8 (plus or minus).sample-sixe-influences-chart1