When to Give Partial Credit for Multiple-Response Items

Austin Fossey-42 Posted by Austin Fossey

Three different customers recently asked me how to decide between scoring a multiple-response (MR) item dichotomously or polytomously; i.e., when should an MR item be scored right/wrong, and when should we give partial credit? I gave some garrulous, rambling answers, so the challenge today is for me to explain this in a single blog post that I can share the next time it comes up.

In their chapter on multiple-choice and matching exercises in Educational Assessment of Students (5th ed.), Anthony Nitko and Susan Brookhart explain that matching items (which we may extend to include MR item formats, drag-and-drop formats, survey-matrix formats, etc.) are often a collection of single-response multiple choice (MC) items. The advantage of the MR format is that is saves space and you can leverage dependencies in the questions (e.g., relationships between responses) that might be redundant if broken into separate MC items.

Given that an MR items is often a set of individually scored MC items, then a polytomously scored format almost always makes sense. From an interpretation standpoint, there are a couple of advantages for you as a test developer or instructor. First, you can differentiate between participants who know some of the answers and those who know none of the answers. This can improve the item discrimination. Second, you have more flexibility in how you choose to score and interpret the responses. In the drag-and-drop example below (a special form of an MR item), the participant has all of the dates wrong; however, the instructor may still be interested in knowing that the participant knows the correct order of events for the Stamp Act, the Townshend Act, and the Boston Massacre.

stamp 1

Example of a drag-and-drop item in Questionmark where the participant’s responses are wrong, but the order of responses is partially correct.

Are there exceptions? You know there are. This is why it is important to have a test blueprint document, which can help clarify which item formats to use and how they should be evaluated. Consider the following two variations of a learning objective on a hypothetical CPR test blueprint:

  • The participant can recall the actions that must be taken for an unresponsive victim requiring CPR.
  • The participant can recall all three actions that must be taken for an unresponsive victim requiring CPR.

The second example is likely the one that the test developer would use for the test blueprint. Why? Because someone who knows two of the three actions is not going to cut it. This is a rare all-or-nothing scenario where knowing some of the answers is essentially the same (from a qualifications standpoint) as knowing none of the answers. The language in this learning objective (“recall all three actions”) is an indicator to the test developer that if they use an MR item to assess this learning objective, they should score it dichotomously (no partial credit). The example below shows how one might design an item for this hypothetical learning objective with Questionmark’s authoring tools:

stamp 2

Example of a Questionmark authoring screen for MR item that is scored dichotomously (right/wrong).

To summarize, a test blueprint document is the best way to decide if an MR item (or variant) should be scored dichotomously or polytomously. If you do not have a test blueprint, think critically about what you are trying to measure and the interpretations you want reflected in the item score. Partial-credit scoring is desirable in most use cases, though there are occasional scenarios where an all-or-nothing scoring approach is needed—in which case the item can be scored strictly right/wrong. Finally, do not forget that you can score MR items differently within an assessment. Some MR items can be scored polytomously and others can be scored dichotomously on the same test, though it may be beneficial to notify participants when scoring rules differ for items that use the same format.

If you are interested in understanding and applying some basic principles of item development and enhancing the quality of your results, download the free white paper written by Austin: Managing Item Development for Large-Scale Assessment

Simpson’s Paradox and the Steelyard Graph

Austin Fossey-42Posted by Austin Fossey

If you work with assessment statistics or just about any branch of social science, you may be familiar with Simpson’s paradox—the idea that data trends between subgroups change or disappear when the subgroups are aggregated. There are hundreds of examples of Simpson’s paradox (and I encourage you to search some on the internet for kicks), but here is a simple example for the sake of illustration.

Simpson’s Paradox Example

Let us say that I am looking to get trained as a certified window washer so that I can wash windows on Boston’s skyscrapers. Two schools in my area offer training, and both had 300 students graduate last year. Graduates from School A had an average certification test score of 70.7%, and graduates from School B had an average score of 69.0%. Ignoring for the moment whether these differences are significant, as a student I will likely choose School A due to its higher average test scores.

But here is where the paradox happens. Consider now that I have a crippling fear of heights, which may be a hindrance for my window-washing aspirations. It turns out that School A and School B also track test scores for their graduates based on whether or not they have a fear of heights. The table below reports the average scores for these phobic subgroups.

Notice anything? The average score for people with and without a fear of heights in School B is higher than the same groups in School A. The paradox is that School A has a higher average test score overall, yet School B can boast better average test scores for students with a fear of heights and students without a fear of heights. School B’s overall average is lower because they simply had more students with a fear of heights. If we want to test the significance of these differences, we can do so with ANOVA.

Gaviria and González-Barbera’s Steelyard Graph

Simpson’s paradox occurs in many different fields, but it is sometimes difficult to explain to stakeholders. Tables (like the one above) are often used to
illustrate the subgroup differences, but in the Fall 2014 issue of Educational Measurement, José-Luis Gaviria and Coral González-Barbera from the Universidad Complutense de Madrid won the publication’s data visualization contest with their Steelyard Graph, which illustrates Simpson’s Paradox with a graph resembling a steelyard balance. The publication’s visual editor, ETS’s Katherine Furgol Castellano, wrote the discussion piece for the Steelyard Graph, praising Gaviria and González-Barbera for the simplicity of the approach and the novel yet astute strategy of representing averages with balanced levers.

The figure below illustrates the same data from the table above using Gaviria and González-Barbera’s Steelyard Graph approach. The size of the squares corresponds to the number of students, the location on the lever indicates the average subgroup score, and the triangular fulcrum represents the school’s overall average score. Notice how clear it is that the subgroups in School B have higher average scores than their counterparts in School A. The example below has only two subgroups, but the same approach can be used for more subgroups.

Simpson's 2

Example of Gaviria and González-Barbera’s Steelyard Graph to visualize Simpson’s paradox for subgroups’ average test scores.

Making a Decision when Faced with Simpson’s Paradox

When one encounters Simpson’s paradox, decision-making can be difficult, especially if there are no theories to explain why the relational pattern is different at a subgroup level. This is why exploratory analysis often must be driven by and interpreted through a lens of theory. One could come up with arbitrary subgroups that reverse the aggregate relationships, even though there is no theoretical grounding for doing so. On the other hand, relevant subgroups may remain unidentified by researchers, though the aggregate relationship may still be sufficient for decision-making.

For example, as a window-washing student seeing the phobic subgroups’ performances, I might decide that School B is the superior school for teaching the trade, regardless of which subgroup a student belongs to. This decision is based on a theory that a fear of heights may impact performance on the certification assessment, in which case School B does a better job at preparing both subgroups for their assessments. If that theory is not tenable, it may be that School A is really the better choice, but as an acrophobic would-be window washer, I will likely choose School B after seeing this graph . . . as long as the classroom is located on the ground floor.

When to weight items differently in CTT

Austin Fossey-42Posted by Austin Fossey

In my last post, I explained the statistical futility and interpretive quagmires that result from using negative item scores in Classical Test Theory (CTT) frameworks. In this post, I wanted to address another question I get from a lot of customers: when can we make one item worth more points?

This question has come up in a couple of cases. One customer wanted to make “hard” items on the assessment worth more points (with difficulty being determined by subject-matter experts). Another customer wanted to make certain item types worth more points across the whole assessment. In both cases, I suggested they weight all of the items equally.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

Before I reveal the rationale behind the recommendation, please permit me a moment of finger-wagging. The impetus behind these questions was that these test developers felt that some items were somehow better indicators of the construct, thus certain items seemed like more important points of evidence than others. If we frame the conversation as a question of relative importance, then one recognizes that the test blueprint document should contain all of the information about the importance of domain content, as well as how the assessment should be structured to reflect those evaluations. If the blueprint cannot answer these questions, then it may need to be modified. Okay, wagging finger back in its holster.

In general, weights should be applied at a subscore level that corresponds to the content or process areas on the blueprint. A straightforward way to achieve this structure is to present a lot of items. For example, if Topic A is supposed to be 60% of the assessment score and Topic B is supposed to be 40% of the assessment score, it might be best to ask 60 questions about Topic A and 40 questions about Topic B, all scored dichotomously [0,1].

There are times when this is not possible. Certain item formats may be scored differently or be too complex to deliver in bulk. For example, if Topic B is best assessed with long-format essay items, it might be necessary to have 60 selected response items in Topic A and four essays in Topic B—each worth ten points and scored on a rubric.

Example of a simple blueprint where items are worth more points due to their topic’s relative importance (weight)

The critical point is that the content areas (e.g., Topics) are driving the weighting, and all items within the content area are weighted the same. Thus, an item is not worth more because it is hard or because it is a certain format; it is worth more because it is in a topic that has fewer items, and all items within the topic are weighted more because of the topic’s relative importance on the test blueprint.

One final word of caution. If you do choose to weight certain dichotomous items differently, regardless of your rationale, remember that it may bias the item-total correlation discrimination. In these cases, it is best to use the item-rest correlation discrimination statistic, which is provided in Questionmark’s Item Analysis Report.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

An argument against using negative item scores in CTT

Austin Fossey-42Posted by Austin Fossey

Last year, a client asked for my opinion about whether or not to use negative scores on test items. For example, if a participant answers an item correctly, they would get one point, but if they answer the item incorrectly, they would lose one point. This means the item would be scored dichotomously [-1,1] instead of in the more traditional way [0,1].

I believe that negative item scores are really useful if the goal is to confuse and mislead participants. They are not appropriate for most classical test theory (CTT) assessment designs, because they do not add measurement value, and they are difficult to interpret.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015  *space is limited

Measurement value of negative item scores

Changing the item scoring format from [0,1] to [-1,1] does not change anything about your ability to measure participants—after all, the dichotomous scores are just symbols. You are simply using a different total score scale.

Consider a 60-item assessment made up of dichotomously scored items. If the items are scored [0,1], the total score scale ranges from 0 to 60 points. If scored [-1,1], the score range doubles, now ranging from -60 to 60 points.

From a statistical standpoint, nothing has changed. The item-total discrimination statistics will be the same under both designs, as will the assessment’s reliability. The standard error of measurement will double, but that is to be expected because the score range has doubled. Thus there is no change in the precision of scores or misclassification rates. How you score the items does not matter as long as they are scored dichotomously on the same scale.

The figure below illustrates the score distributions for 1,000 normally distributed assessment scores that were simulated using WinGen. This sample’s item responses have been scored with three different models: [-1,1], [0,1], and [0,2]. While this shifts and stretches the distribution of scores on to different scales, there is no change in reliability or the standard error of measurement (as a percentage of the score range).

Distribution and assessment statistics for 1,000 simulated test scores with items dichotomously scored three ways: [-1,1], [0,1], and [0,2]

 Interpretation issues of negative item scores

If the item scores do not make a difference statistically, and they are just symbols, then why not use negative scores? Remember that an item is a mechanism for collecting and quantifying evidence to support the student model, so how we score our items (and the assessment as a whole) plays a big role in how people interpret the participant’s performance.

Consider an item scored [0,1]. In a CTT model, a score of 1 represents accumulated evidence about the presence or magnitude of a construct, whereas a score of 0 suggests that no evidence was found in the response to this item.

Now suppose we took the same item and scored it [-1,1]. A score of 1 still suggests accumulated evidence, but now we are also changing the total score based on wrong answers. The interpretation is that we have collected evidence about the absence of the construct. To put it another way, the test designer is claiming to have positive evidence that the participants does not know something.

This is not an easy claim to make. In psychometrics, we can attempt to measure the presence of a hypothetical construct, but it is difficult to make a claim that a construct is not there. We can only make inferences about what we observe, and I argue that it is very difficult to build an evidentiary model for someone not knowing something.

Furthermore, negative scores negate evidence we have collected in other items. If a participant gets one item right and earns a point but then loses that point on the next item, we have essentially canceled out the information about the participant from a total score perspective. By using negative scores in a CTT model, we also introduce the possibility that someone can get a negative score on the whole test, but what would a negative score mean? This lack of interpretability is one major reason people do not use negative scores.

Consider a participant who answers 40 items correctly on the 60-item assessment I mentioned earlier. When scored [0,1], the raw score (40 points) corresponds to the number of correct responses provided by the participant. This scale is useful for calculating percentage scores (40/60 = 67% correct), setting cut scores, and supporting the interpretation of the participant’s performance.

When the same items are scored [1,-1], the participant’s score is more difficult to interpret. The participant answered 40 questions correctly, but they only get a score of 20. They know the maximum score on the assessment is 60 points, yet their raw score of 20 corresponds to a correct response rate of 67%, not 33%, since 20 points corresponds to 67% of the range between -60 to 60 points.

There are times when items need to be scored differently from other items on the assessment. Polytomous items clearly need different scoring models (though similar interpretive arguments could be leveled against people who try to score items in fractions of points), and there are times when an item may need to be weighted differently from other items. (We’ll discuss that in my next post.)Some item response theory (IRT) assessments like the SAT use negative points to correct for guessing, but this should only be done if you can demonstrate improved model fit and you have a theory and evidence to justify doing so. In general, when using CTT, negative item scores only serve to muddy the water.

Interested in learning more about classical test theory and item statistics? Psychometrician Austin Fossey will be delivering a free 75 minute online workshop — Item Analysis: Concepts and Practice Tuesday, June 23, 2015 11:00 AM – 12:15 PM EDT  *spots are limited

Is There Value in Reporting Subscores?

Austin Fossey-42Posted by Austin Fossey

The decision to report subscores (reported as Topic Scores in Questionmark’s software) can be a difficult one, and test developers often need to respond to demands from stakeholders who want to bleed as much information out of an instrument as they can. High-stakes test development is lengthy and costly, and the instruments themselves consume and collect a lot of data that can be valuable for instruction or business decisions. It makes sense that stakeholders want to get as much mileage as they can out of the instrument.

It can be anticlimactic when all of the development work results in just one score or a simple pass/fail decision. But that is after all what many instruments are designed to do. Many assessment models assume unidimensionality, so a single score or classification representing the participant’s ability is absolutely appropriate. Nevertheless, organizations often find themselves in the position of trying to wring out more information. What are my participants’ strengths and weaknesses? How effective were my instructors? There are many ways in which people will try to repurpose an assessment.

The question of whether or not to report subscores certainly falls under this category. Test blueprints often organize the instrument around content areas (e.g., Topics), and these lend themselves well to calculating subscores for each of the content areas. From a test user perspective, these scores are easy to interpret, and they are considered valuable because they show content areas where participants perform well or poorly, and because it is believed that this information can help inform instruction.

But how useful are these subscores? In their article, A Simple Equation to Predict a Subscore’s Value, Richard Feinberg and Howard Wainer explain that there are two criteria that must be met to justify reporting a subscore:

  • The subscore must be reliable.
  • The subscore must contain information that is sufficiently different from the information that is contained by the assessment’s total score.

If a subscore (or any score) is not reliable, there is no value in reporting it. The subscore will lack precision, and any decisions made on an unreliable score might not be valid. There is also little value if the subscore does not provide any new information. If the subscores are effectively redundant to the total score, then there is no need to report them. The flip side of the problem is that if subscores do not correlate with the total score, then the assessment may not be unidimensional, and then it may not make sense to report the total score. These are the problems that test developers wrestle with when they lie awake at night.

Excerpt from Questionmark’s Test Analysis Report showing low reliability of three topic scores.

As you might have guessed from the title of their article, Feinberg and Wainer have proposed a simple, empirically-based equation for determining whether or not a subscore should be reported. The equation yields a value that Sandip Sinharay and Shelby Haberman called the Value Added Ratio (VAR). If a subscore on an assessment has a VAR value greater than one, then they suggest that this justifies reporting it. All of the VAR values that are less than one, should not be reported. I encourage interested readers to check out Feinberg and Wainer’s article (which is less than two pages, so you can handle it) for the formula and step-by-step instructions for its application.


Item Development – Summary and Conclusions

Austin Fossey-42Posted by Austin Fossey

This post concludes my series on item development in large-scale assessment. I’ve discussed some key processes in developing items, including drafting items, reviewing items, editing items, and conducting an item analysis. The goal of this process is to fine-tune a set of items so that test developers have an item pool from which they can build forms for scored assessment while being confident about the quality, reliability, and validity of the items. While the series covered a variety of topics, there are a couple of key themes that were relevant to almost every step.

First, documentation is critical, and even though it seems like extra work, it does pay off. Documenting your item development process helps keep things organized and helps you reproduce processes should you need to conduct development again. Documentation is also important for organization and accountability. As noted in the posts about content review and bias review, checklists can help ensure that committee members consider a minimal set of criteria for every item, but they also provide you with documentation of each committee member’s ratings should the item ever be challenged. All of this documentation can be thought of as validity evidence—it helps support your claims about the results and refute rebuttals about possible flaws in the assessment’s content.

The other key theme is the importance of recruiting qualified and representative subject matter experts (SMEs). SMEs should be qualified to participate in their assigned task, but diversity is also an important consideration. You may want to select item writers with a variety of experience levels, or content experts who have different backgrounds. Your bias review committee should be made up of experts who can help identify both content and response bias across the demographic areas that are pertinent to your population. Where possible, it is best to keep your SME groups independent so that you do not have the same people responsible for different parts of the development cycle. As always, be sure to document the relevant demographics and qualifications of your SMEs, even if you need to keep their identities anonymous.

This series is an introduction for organizing an item development cycle, but I encourage readers to refer to the resources mentioned in the articles for
more information. This series also served as the basis for a session at the 2015 Questionmark Users Conference, which Questionmark customers can watch in the Premium section of the Learning Café.

You can link back to all of the posts in this series by clicking on the links below, and if you have any questions, please comment below!

Item Development – Managing the Process for Large-Scale Assessments

Item Development – Training Item Writers

Item Development – Five Tips for Organizing Your Drafting Process

Item Development – Benefits of editing items before the review process

Item Development – Organizing a content review committee (Part 1)

Item Development – Organizing a content review committee (Part 2)

Item Development – Organizing a bias review committee (Part 1)

Item Development – Organizing a bias review committee (Part 2)

Item Development – Conducting the final editorial review

Item Development – Planning your field test study

Item Development – Psychometric review