When to weight items differently in CTT

Austin Fossey-42Posted by Austin Fossey

In my last post, I explained the statistical futility and interpretive quagmires that result from using negative item scores in Classical Test Theory (CTT) frameworks. In this post, I wanted to address another question I get from a lot of customers: when can we make one item worth more points?

This question has come up in a couple of cases. One customer wanted to make “hard” items on the assessment worth more points (with difficulty being determined by subject-matter experts). Another customer wanted to make certain item types worth more points across the whole assessment. In both cases, I suggested they weight all of the items equally.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

Before I reveal the rationale behind the recommendation, please permit me a moment of finger-wagging. The impetus behind these questions was that these test developers felt that some items were somehow better indicators of the construct, thus certain items seemed like more important points of evidence than others. If we frame the conversation as a question of relative importance, then one recognizes that the test blueprint document should contain all of the information about the importance of domain content, as well as how the assessment should be structured to reflect those evaluations. If the blueprint cannot answer these questions, then it may need to be modified. Okay, wagging finger back in its holster.

In general, weights should be applied at a subscore level that corresponds to the content or process areas on the blueprint. A straightforward way to achieve this structure is to present a lot of items. For example, if Topic A is supposed to be 60% of the assessment score and Topic B is supposed to be 40% of the assessment score, it might be best to ask 60 questions about Topic A and 40 questions about Topic B, all scored dichotomously [0,1].

There are times when this is not possible. Certain item formats may be scored differently or be too complex to deliver in bulk. For example, if Topic B is best assessed with long-format essay items, it might be necessary to have 60 selected response items in Topic A and four essays in Topic B—each worth ten points and scored on a rubric.

Example of a simple blueprint where items are worth more points due to their topic’s relative importance (weight)

The critical point is that the content areas (e.g., Topics) are driving the weighting, and all items within the content area are weighted the same. Thus, an item is not worth more because it is hard or because it is a certain format; it is worth more because it is in a topic that has fewer items, and all items within the topic are weighted more because of the topic’s relative importance on the test blueprint.

One final word of caution. If you do choose to weight certain dichotomous items differently, regardless of your rationale, remember that it may bias the item-total correlation discrimination. In these cases, it is best to use the item-rest correlation discrimination statistic, which is provided in Questionmark’s Item Analysis Report.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

Leave a Reply