Know what your questions are about before you deliver the test

Austin Fossey-42Posted by Austin Fossey

A few months ago, I had an interesting conversation with an assessment manager at an educational institution—not a Questionmark customer, mind you. Finding nothing else in common, we eventually began discussing assessment design.

At this institution (which will remain anonymous), he admitted that they are often pressed for time in their assessment development cycle. There is not enough time to do all of the item development work they need to do before their students take the assessment. To get around this, their item writers draft all of the items, conduct an editorial review, and then deliver the items. The items are assigned topics after administration, and students’ total scores and topic scores are calculated from there. He asked me if Questionmark software allows test developers to assign topics and calculate topic scores after assessing the students, and I answered truthfully that it does not.

But why not? Is there a reason test developers should not do what is being practiced at this institution? Yes, there are in fact two reasons. Get ready for some psychometric finger-wagging.

Consider what this institution is doing. The items are drafted and subjected to an editorial review, but no one ever classifies the items within a topic until after the test has been administered. Recall what people typically do during a content review prior to administration:

  • Remove items that are not relevant to the domain.
  • Ensure that the blueprint is covered.
  • Check that items are assigned to the correct topic.

If topics are not assigned until after the participants have already tested, we risk the validity of the results and the legal defensibility of the test. If we have delivered items that are not relevant to the domain, we have wasted participants’ time and will need to adjust their total score. Okay, we can manage that by telling the participants ahead of time that some of the test items might not count. But if we have not asked the correct number of questions for a given area of the blueprint, the entire assessment score will be worthless—a threat to validity known as construct underrepresentation or construct deficiency in The Standards for Educational and Psychological Testing.

For example, if we were supposed to deliver 20 items from Topic A, but find out after the fact that only 12 items have been classified as belonging to Topic A, then there is little we can do about it besides rebuilding the test form and making everyone take the test again.

The Standards provide helpful guidance in these matters. For this particular case, the Standards point out that:

“The test developer is responsible for documenting that the items selected for the test meet the requirements of the test specifications. In particular, the set of items selected for a new test form . . . must meet both content and psychometric specifications.” (p. 82)

Publications describing best practices for test development also specify that the content must be determined before delivering an operational form. For example, in their chapter in Educational Measurement (4th Edition), Cynthia Schmeiser and Catherine Welch note the importance of conducting a content review of items before field testing, as well a final content review of a draft test form before it becomes operational.

In Introduction to Classical and Modern Test Theory, Linda Crocker and James Algina also made an interesting observation about classroom assessments, noting that students expect to be graded on all of the items they have been asked to answer. Even if notified in advance that some items might not be counted (as one might do in field testing), students might not consider it fair that their score is based on a yet-to-be-determined subset of items that may not fully represent the content that is supposed to be covered.

This is why Questionmark’s software is designed the way it is. When creating an item, item writers must assign an item to a topic, and items can be classified or labeled along other dimensions (e.g., cognitive process) using metatags. Even if an assessment program cannot muster any further content review, at least the item writer has classified items by content area. The person building the test form then has the information they need to make sure that the right questions get asked.

We have a responsibility as test developers to treat our participants fairly and ethically. If we are asking them to spend their time taking a test, then we owe them the most useful measurement that we can provide. Participants trust that we know what we are doing. If we postpone critical, basic development tasks like content identification until after participants have already given us their time, we are taking advantage of that trust.

Psychometrics 101: Item Total Correlation

greg_pope

Posted by Greg Pope

I’ll be talking about a subject dear to my heart — psychometrics — at the Questionmark Users Conference April 5 -8. Here’s a sneak preview on one of my topics: item total correlation! What is it, and what does it mean?

The item total correlation is a correlation between the question score (e.g., 0 or 1 for multiple choice) and the overall assessment score (e.g., 67%). It is expected that if a participant gets a question correct they should, in general, have higher overall assessment scores than participants who get a question wrong. Similarly with essay type question scoring where a question could be scored between 0 and 5 participants who did a really good job on the essay (got a 4 or 5) should have higher overall assessment scores (maybe 85-90%). This relationship is shown in an example graph below.

chart-35

This relationship in psychometrics is called ‘discrimination’ referring to how well a question differentiates between participants who know the material and those that do not know the material. Participants who know the material taught to them should get high scores on questions and high overall assessment scores. Participants who did not master the material should get low scores on questions and lower overall assessment scores. This is the relationship that an item-total correlation provides to help evaluate the performance of questions. We want to have lots of highly discriminating questions on our tests because they are the most fine-tuned measurements to find out what participants know and can do. When looking at an item-total correlation generally negative values are a major red flag it is unexpected that participants who get low scores on the questions get high scores on the assessment. This could indicate a mis-keyed question or that the question was highly ambiguous and confusing to participants. Values for an item-total correlation (point-biserial) between 0 and 0.19 may indicate that the question is not discriminating well, values between 0.2 and 0.39 indicate good discrimination, and values 0.4 and above indicate very good discrimination.