Know what your questions are about before you deliver the test

Austin Fossey-42Posted by Austin Fossey

A few months ago, I had an interesting conversation with an assessment manager at an educational institution—not a Questionmark customer, mind you. Finding nothing else in common, we eventually began discussing assessment design.

At this institution (which will remain anonymous), he admitted that they are often pressed for time in their assessment development cycle. There is not enough time to do all of the item development work they need to do before their students take the assessment. To get around this, their item writers draft all of the items, conduct an editorial review, and then deliver the items. The items are assigned topics after administration, and students’ total scores and topic scores are calculated from there. He asked me if Questionmark software allows test developers to assign topics and calculate topic scores after assessing the students, and I answered truthfully that it does not.

But why not? Is there a reason test developers should not do what is being practiced at this institution? Yes, there are in fact two reasons. Get ready for some psychometric finger-wagging.

Consider what this institution is doing. The items are drafted and subjected to an editorial review, but no one ever classifies the items within a topic until after the test has been administered. Recall what people typically do during a content review prior to administration:

  • Remove items that are not relevant to the domain.
  • Ensure that the blueprint is covered.
  • Check that items are assigned to the correct topic.

If topics are not assigned until after the participants have already tested, we risk the validity of the results and the legal defensibility of the test. If we have delivered items that are not relevant to the domain, we have wasted participants’ time and will need to adjust their total score. Okay, we can manage that by telling the participants ahead of time that some of the test items might not count. But if we have not asked the correct number of questions for a given area of the blueprint, the entire assessment score will be worthless—a threat to validity known as construct underrepresentation or construct deficiency in The Standards for Educational and Psychological Testing.

For example, if we were supposed to deliver 20 items from Topic A, but find out after the fact that only 12 items have been classified as belonging to Topic A, then there is little we can do about it besides rebuilding the test form and making everyone take the test again.

The Standards provide helpful guidance in these matters. For this particular case, the Standards point out that:

“The test developer is responsible for documenting that the items selected for the test meet the requirements of the test specifications. In particular, the set of items selected for a new test form . . . must meet both content and psychometric specifications.” (p. 82)

Publications describing best practices for test development also specify that the content must be determined before delivering an operational form. For example, in their chapter in Educational Measurement (4th Edition), Cynthia Schmeiser and Catherine Welch note the importance of conducting a content review of items before field testing, as well a final content review of a draft test form before it becomes operational.

In Introduction to Classical and Modern Test Theory, Linda Crocker and James Algina also made an interesting observation about classroom assessments, noting that students expect to be graded on all of the items they have been asked to answer. Even if notified in advance that some items might not be counted (as one might do in field testing), students might not consider it fair that their score is based on a yet-to-be-determined subset of items that may not fully represent the content that is supposed to be covered.

This is why Questionmark’s software is designed the way it is. When creating an item, item writers must assign an item to a topic, and items can be classified or labeled along other dimensions (e.g., cognitive process) using metatags. Even if an assessment program cannot muster any further content review, at least the item writer has classified items by content area. The person building the test form then has the information they need to make sure that the right questions get asked.

We have a responsibility as test developers to treat our participants fairly and ethically. If we are asking them to spend their time taking a test, then we owe them the most useful measurement that we can provide. Participants trust that we know what we are doing. If we postpone critical, basic development tasks like content identification until after participants have already given us their time, we are taking advantage of that trust.

Item Analysis Analytics Part 4: The Nitty-Gritty of Item Analysis

 

greg_pope-150x1502

Posted by Greg Pope

In my previous blog post I highlighted some of the essential things to look for in a typical Item Analysis Report. Now I will dive into the nitty-gritty of item analysis, looking at example questions and explaining how to use the Questionmark Item Analysis Report in an applied context for a State Capitals Exam.

The Questionmark Item Analysis Report first produces an overview of question performance both in terms of the difficulty of questions and in terms of the discrimination of questions (upper minus lower groups). These overview charts give you a “bird’s eye view” of how the questions composing an assessment perform. In the example below we see that we have a range of questions in terms of their difficulty (“Item Difficulty Level Histogram”), with some harder questions (the bars on the left), most average-difficulty questions (bars in the middle), and some easier questions (the bars on the right). In terms of discrimination (“Discrimination Indices Histogram”) we see that we have many questions that have high discrimination as evidenced by the bars being pushed up to the right (more questions on the assessment have higher discrimination statistics).

part-4-picture-1

Overall, if I were building a typical criterion-referenced assessment with a pass score around 50% I would be quite happy with this picture. We have more questions functioning at the pass score point with a range of questions surrounding it and lots of highly discriminating questions. We do have one rogue question on the far left with a very low discrimination index, which we need to look at.

The next step is to drill down into each question to ensure that each question performs as it should. Let’s look at two questions from this assessment, one question that performs well and one question that does not perform so well.

The question below is an example of a question that performs nicely. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 175, which is a nice sample of participants to evaluate the psychometric performance of this question.
  • Next we see thateveryone answered the question (“Number not Answered” = 0), which means there probably wasn’t a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is just above the pass score where 61% of participants ‘got it right.’ Nothing wrong with that: the question is neither too easy nor too hard.
  • The “Item Discrimination” indicates good discrimination, with the difference between the upper and lower group in terms of the proportion selecting the correct answer of ‘Salem’ at 48%. This means that of the participants with high overall exam scores, 88% selected the correct answer versus only 40% of the participants with the lowest overall exam scores. This is a nice, expected pattern.
  • The “Item Total Correlation” backs the Item Discrimination up with a strong value of 0.40. This means that of all participants who answered the questions, the pattern of high scorers getting the question right more than low scorers holds true.
  • Finally we look at the Outcome information to see how the distracters perform. We find that each distracter pulled some participants, with ‘Portland’ pulling the most participants, especially from the “Lower Group.” This pattern makes sense because those with poor state capital knowledge may make the common mistake of selecting Portland as the capital of Oregon.

The psychometricians, SMEs, and test developers reviewing this question all have smiles on their faces when they see the item analysis for this item.

part-4-picture-2

Next we look at that rogue question that does not perform so well in terms of discrimination-–the one we saw in the Discrimination Indices Histogram. When we look into the question we understand why it was flagged:

  • Going from left to right, first we see that the “Number of Results” is 175, which is again a nice sample size: nothing wrong here.
  • Next we see everyone answered the question, which is good.
  • The first red flag comes from the “P Value Proportion Correct” as this question is quite difficult (only 35% of participants selected the correct answer). This is not in and of itself a bad thing so we can keep this in memory as we move on,
  • The “Item Discrimination” indicates a major problem, a negative discrimination value. This means that participants with the lowest exam scores selected the correct answer more than participants with the highest exam scores. This is not the expected pattern we are looking for: Houston, this question has a problem!
  • The “Item Total Correlation” backs up the Item Discrimination with a high negative value.
  • To find out more about what is going on we delve into the Outcome information area to see how the distracters perform. We find that the keyed-correct answer of Nampa is not showing the expected pattern of upper minus lower proportions. We do, however, find that the distracter “Boise” is showing the expected pattern of the Upper Group (86%) selecting this response option much more than the Lower Group (15%). Wait a second…I think I know what is wrong with this one, it has been mis-keyed! Someone accidently assigned a score of 1 to Nampa rather than Boise.

part-4-picture-3

No problem: the administrator pulls the data into the Results Management System (RMS), changes the keyed correct answer to Boise, and presto, we now have defensible statistics that we can work with for this question.

part-4-picture-4

The psychometricians, SMEs, and test developers reviewing this question had a frown on their faces at first but those frowns were turned upside down when they realized it is just a simple mis-keyed question.

In my next blog post I would like share some observations on the relationship between Outcome Discrimination and Outcome Correlation.

Are you ready for some light relief after pondering all these statistics? Then have some fun with our own State Capitals Quiz.