# Item Development Tips For Defensible Assessments

Whether you work with low-stakes assessments, small-scale classroom assessments or large-scale, high-stakes assessment, understanding and applying some basic principles of item development will greatly enhance the quality of your results.

What began as a popular 11-part blog series has morphed into a white paper: Managing Item Development for Large-Scale Assessment, which offers sound advice on how-to organize and execute item development steps that will help you create defensible assessments. These steps include:   You can download your copy of the complimentary white paper here: Managing Item Development for Large-Scale Assessment

# G Theory and Reliability for Assessments with Randomly Selected Items

Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

• Error variance.
• Variance in the item mean scores.
• Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

# Is There Value in Reporting Changes in Subscores?

Posted by Austin Fossey

I had the privilege of meeting with an organization that is reporting subscores to show how their employees are improving across multiple areas of their domain, as determined by an assessment given before and after training. They have developed some slick reports to show these scores, including the participant’s first score, second score (after training is complete), and the change in those scores.

At first glance, these reports are pretty snazzy and seem to suggest huge improvements resulting from the training, but looks can be deceiving. I immediately noticed one participant had made a subscore gain of 25%, which sounds impressive—like he or she is suddenly 25% better at the tasks in that domain—but here is the fine print: that subscore was measured with only four items. To put it another way, that 25% improvement means that the participant answered one more item correctly. Other subscores were similarly underrepresented—most with four or fewer items in their topic.

In a previous post, I reported on an article by Richard Feinberg and Howard Wainer about how to determine if a subscore is worth reporting. My two loyal readers (you know who you are) may recall that a reported subscore has to be reliable, and it must contain information that is sufficiently different from the information contained in the assessment’s total score (AKA “orthogonality”).

In an article titled Comments on “A Note on Subscores” by Samuel A. Livingston, Sandip Sinharay and Shelby Haberman defended against a critique that their previous work (which informed Feinberg and Wainer’s proposed Value Added Ratio (VAR) metric) indicated that subscores should never be reported when examining changes across administrations. Sinharay and Haberman explained that in these cases, one should examine the suitability of the change scores, not the subscores themselves. One may then find that the change scores are suitable for reporting.

A change score is the difference in scores from one administration to the next. If a participant gets a subscore of 12 on their first assessment and a subscore of 30 on their next assessment, their change score for that topic is 18. This can then be thought of as the subscore of interest, and one can then evaluate whether or not this change score is suitable for reporting.

Change scores are also used to determine if a change in scores is statistically significant for a group of participants. If we want to know whether a group of participants is performing statistically better on an assessment after completing training (at a total score or subscore level), we do not compare average scores on the two tests. Instead, we look to see if the group’s change scores across the two tests are significantly greater than zero. This is typically analyzed with a dependent samples t-test.

The reliability, orthogonality, and significance of changes in subscores are statistical concerns, but scores must be interpretable and actionable to make a claim about the validity of the assessment. This raises the concern of domain representation. Even if the statistics are fine, a subscore cannot be meaningful if the items do not sufficiently represent the domain they are supposed to measure. Making an inference about a participant’s ability in a topic based on only four items is preposterous—you do not need to know anything about statistics to come to that conclusion.

To address the concern of domain representation, high-stakes assessment programs that report subscores will typically set a minimum for the number of items that are needed to sufficiently represent a topic before a subscore is reported. For example, one program I worked for required (perhaps somewhat arbitrarily) a minimum of eight items in a topic before generating a subscore. If this domain representation criterion is met, one can presumably use methods like the VAR to then determine if the subscores meet the statistical criteria for reporting.

# 4 Ways to Identify a Content Breach

Posted by Austin Fossey

In my last post, I discussed five ways you can limit the use of breached content so that a person with unauthorized access to your test content will have limited opportunities to put that information to use; however, those measures only control the problem of a content breach. Our next goal is to identify when a content breach has occurred so that we can remedy the problem through changes to the assessment or disciplinary actions against the parties involved in the breach.

Interested in learning about item analysis or how-to take your test planning to the next level? I will be presenting a series of workshops on at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

Channel for Reporting

In most cases, you (the assessment program staff) will not be the first to find out that content has been stolen. You are far more likely to learn about the problem through a tip from another participant or stakeholder. One of the best things your organization can do to identify a content breach is to have a clear process for letting people report these concerns, as well as a detailed policy for what to do if a breach is found.

For example, you may want to have a disciplinary policy to address the investigation process, potential consequences, and an appeals process for participants who allegedly gained unauthorized access to the content (even if they did not pass the assessment). You may want to have legal resources lined up to help address non-participant parties who may be sharing your assessment content illegally (e.g., so-called “brain dump” sites). Finally, you should have an internal plan in place for what you will do if content is breached. Do you have backup items that can be inserted in the form? Can you release an updated form ahead of your republishing schedule? Will your response be different depending on the extent of the breach?

Web Patrol Monitoring

Several companies offer a web patrol service that will search the internet for pages where your assessment content has been posted without permission. Some of these companies will even purchase unauthorized practice exams that claim to have your assessment content and look for item breaches within them. Some of Questionmark’s partners provide web patrol services.

Statistical Models

There are several publicly available statistical models that can be used to identify abnormalities in participants’ response patterns or matches between a response pattern and a known content breach, such as the key patterns posted on a brain-dump site. Several companies, including some of Questionmark’s partners, have developed their own statistical methods for identifying cases where a participant may have used breached content.

In their chapter in Educational Measurement (4th ed.), Allan Cohen and James Wollack explain that all of these models tend to explore whether the amount of similarity between two sets of responses can be explained by chance alone. For example, one could look for two participants who had similar responses, possibly suggesting collusion or indicating that one participant copied the other. One could also look for similarity between a participant’s responses and the keys given in a leaked assessment form. Models also exist for identifying patterns within groups, as might be the case when a teacher chooses to provide answers to an entire class.

These models are a sophisticated way to look for breaches in content, but they are not foolproof. None of them prove that a participant was cheating, though they can provide weighty statistical evidence. Cohen and Wollack warn that several of the most popular models have been shown to suffer from liberal or conservative Type I error rates, though new models continue to improve in this area.

Item Drift

When considering content breaches, you might also be interested in cases where an item appears to become easier (or harder) for everyone over time. Consider a situation where your participant population has global access to information that changes how they respond to an item. This could be for some unsavory reasons (e.g., a lot of people stole your content), or it could be something benign, like a newsworthy event that caused your population to learn more about content related to your assessment. In these cases, you might expect certain items to become easier for everyone in the population.

To detect whether an item is becoming easier over time, we do not use the p value from Classical Test Theory. Instead, we use Item Response Theory (IRT) and a Differential Item Functioning to detect item drift, which is changes in an item’s IRT parameters over time. This is done with Thissen, Steinberg, and Wainer’s likelihood ratio test that they detailed in Test Validity. Creators of IRT assessments use item parameter drift analyses to see if an item has become easier over time. This information helps test developers make decisions about cycling out items from production or planning new calibration studies.

Interested in learning about item analysis or how-to take your test planning to the next level? I will be presenting a series of workshops at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

# 5 Ways to Limit the Use of Breached Assessment Content

In an earlier post, Questionmark’s Julie Delazyn listed 11 tips to help prevent cheating. The third item on that list related to minimizing item exposure; i.e., limiting how and when people can see an item so that content will not be leaked and used for dishonest purposes.

During a co-presentation with Manny Straehle of Assessment, Education, and Research Experts at a Certification Network Group quarterly meeting, I presented a set of considerations that can affect the severity of item exposure. My message was that although item exposure may not be a problem for some assessment programs, assessment managers should consider the design, purpose, candidate population, and level of investment for their assessment when evaluating their content security requirements.

If item exposure is a concern for your assessment program, there are two ways to mitigate the effects of leaked content: limiting opportunities to use the content, and identifying the breach so that it can be corrected. In this post, I will focus on ways to limit content-using opportunities:

Multiple Forms

Using different assessment forms lowers the number of participants who will see an item in delivery. Having multiple forms also lowers the probability that someone with access to a breached item will actually get to put that information to use. Many organizations achieve this by using multiple, equated forms which are systematically assigned to participants to limit joint cheating or to limit item exposure across multiple retakes. Some organizations also achieve this through the use of randomly generated forms like those in Linear-on-the-Fly Testing (LOFT) or empirically generated forms like those in Computer Adaptive Testing (CAT).

Frequent Republishing

Assessment forms are often cycled in and out of production on a set schedule. Decreasing the amount of time a form is in production will limit the impact of item exposure, but it also requires more content and staff resources to keep rotating forms.

Large Item Banks

Having a lot of items can help you make lots of assessment forms, but this is also important for limiting item exposure in LOFT or CAT. Item banks can also be rotated. For example, some assessment programs will use an item bank for particular testing windows or geographic regions and then switch them at the next administration.

Exposure Limits

If your item bank can support it, you may also want to put an exposure limit on items or assessment forms. For example, you might set up a rule where an assessment form remains in production until it has been delivered 5,000 times. After that, you may permanently retire that form or shelve it for a predetermined period and use it again later. An extreme example would be an assessment program that only delivers an item during a single testing window before retiring it. The limit will depend on your risk tolerance, the number of items you have available, and the number of participants taking the assessment. Exposure limits are especially important in CAT where some items will get delivered much more frequently than others due to the item selection algorithm.

Short Testing Windows

When participants are only allowed to take a test during a short time period, there are fewer opportunities for people to talk about or share content before the testing window closes. Short testing windows may be less convenient for your participant population, but you can take advantage of the extra downtime to spend time detecting item breaches, developing new content, and performing assessment maintenance.

In my next post, I will provide an overview of methods for identifying instances of an item breach.

# Item analysis: Selecting items for the test form – Part 2

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.

The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.