Is There Value in Reporting Changes in Subscores?

Austin Fossey-42Posted by Austin Fossey

I had the privilege of meeting with an organization that is reporting subscores to show how their employees are improving across multiple areas of their domain, as determined by an assessment given before and after training. They have developed some slick reports to show these scores, including the participant’s first score, second score (after training is complete), and the change in those scores.

At first glance, these reports are pretty snazzy and seem to suggest huge improvements resulting from the training, but looks can be deceiving. I immediately noticed one participant had made a subscore gain of 25%, which sounds impressive—like he or she is suddenly 25% better at the tasks in that domain—but here is the fine print: that subscore was measured with only four items. To put it another way, that 25% improvement means that the participant answered one more item correctly. Other subscores were similarly underrepresented—most with four or fewer items in their topic.

In a previous post, I reported on an article by Richard Feinberg and Howard Wainer about how to determine if a subscore is worth reporting. My two loyal readers (you know who you are) may recall that a reported subscore has to be reliable, and it must contain information that is sufficiently different from the information contained in the assessment’s total score (AKA “orthogonality”).

In an article titled Comments on “A Note on Subscores” by Samuel A. Livingston, Sandip Sinharay and Shelby Haberman defended against a critique that their previous work (which informed Feinberg and Wainer’s proposed Value Added Ratio (VAR) metric) indicated that subscores should never be reported when examining changes across administrations. Sinharay and Haberman explained that in these cases, one should examine the suitability of the change scores, not the subscores themselves. One may then find that the change scores are suitable for reporting.

A change score is the difference in scores from one administration to the next. If a participant gets a subscore of 12 on their first assessment and a subscore of 30 on their next assessment, their change score for that topic is 18. This can then be thought of as the subscore of interest, and one can then evaluate whether or not this change score is suitable for reporting.

Change scores are also used to determine if a change in scores is statistically significant for a group of participants. If we want to know whether a group of participants is performing statistically better on an assessment after completing training (at a total score or subscore level), we do not compare average scores on the two tests. Instead, we look to see if the group’s change scores across the two tests are significantly greater than zero. This is typically analyzed with a dependent samples t-test.

The reliability, orthogonality, and significance of changes in subscores are statistical concerns, but scores must be interpretable and actionable to make a claim about the validity of the assessment. This raises the concern of domain representation. Even if the statistics are fine, a subscore cannot be meaningful if the items do not sufficiently represent the domain they are supposed to measure. Making an inference about a participant’s ability in a topic based on only four items is preposterous—you do not need to know anything about statistics to come to that conclusion.

To address the concern of domain representation, high-stakes assessment programs that report subscores will typically set a minimum for the number of items that are needed to sufficiently represent a topic before a subscore is reported. For example, one program I worked for required (perhaps somewhat arbitrarily) a minimum of eight items in a topic before generating a subscore. If this domain representation criterion is met, one can presumably use methods like the VAR to then determine if the subscores meet the statistical criteria for reporting.

5 Ways to Limit the Use of Breached Assessment Content

Austin Fossey-42Posted by Austin Fossey

In an earlier post, Questionmark’s Julie Delazyn listed 11 tips to help prevent cheating. The third item on that list related to minimizing item exposure; i.e., limiting how and when people can see an item so that content will not be leaked and used for dishonest purposes.

During a co-presentation with Manny Straehle of Assessment, Education, and Research Experts at a Certification Network Group quarterly meeting, I presented a set of considerations that can affect the severity of item exposure. My message was that although item exposure may not be a problem for some assessment programs, assessment managers should consider the design, purpose, candidate population, and level of investment for their assessment when evaluating their content security requirements.

mitigating risk

If item exposure is a concern for your assessment program, there are two ways to mitigate the effects of leaked content: limiting opportunities to use the content, and identifying the breach so that it can be corrected. In this post, I will focus on ways to limit content-using opportunities:

Multiple Forms

Using different assessment forms lowers the number of participants who will see an item in delivery. Having multiple forms also lowers the probability that someone with access to a breached item will actually get to put that information to use. Many organizations achieve this by using multiple, equated forms which are systematically assigned to participants to limit joint cheating or to limit item exposure across multiple retakes. Some organizations also achieve this through the use of randomly generated forms like those in Linear-on-the-Fly Testing (LOFT) or empirically generated forms like those in Computer Adaptive Testing (CAT).

Frequent Republishing

Assessment forms are often cycled in and out of production on a set schedule. Decreasing the amount of time a form is in production will limit the impact of item exposure, but it also requires more content and staff resources to keep rotating forms.

Large Item Banks

Having a lot of items can help you make lots of assessment forms, but this is also important for limiting item exposure in LOFT or CAT. Item banks can also be rotated. For example, some assessment programs will use an item bank for particular testing windows or geographic regions and then switch them at the next administration.

Exposure Limits

If your item bank can support it, you may also want to put an exposure limit on items or assessment forms. For example, you might set up a rule where an assessment form remains in production until it has been delivered 5,000 times. After that, you may permanently retire that form or shelve it for a predetermined period and use it again later. An extreme example would be an assessment program that only delivers an item during a single testing window before retiring it. The limit will depend on your risk tolerance, the number of items you have available, and the number of participants taking the assessment. Exposure limits are especially important in CAT where some items will get delivered much more frequently than others due to the item selection algorithm.

Short Testing Windows

When participants are only allowed to take a test during a short time period, there are fewer opportunities for people to talk about or share content before the testing window closes. Short testing windows may be less convenient for your participant population, but you can take advantage of the extra downtime to spend time detecting item breaches, developing new content, and performing assessment maintenance.

In my next post, I will provide an overview of methods for identifying instances of an item breach.

Item analysis: Selecting items for the test form – Part 2

Austin Fossey-42Posted by Austin Fossey

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

IT 2

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.

IT2

The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 1

Austin Fossey-42Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.

But don’t worry, this is actually a new topic for the blog.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. 

Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.

When building a test form, item statistics can help us in two ways.

  • They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
  • They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.

In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.

Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.

Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.

Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).

In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

 

 

Writing JTA Task Statements

Austin Fossey-42Posted by Austin Fossey

One of the first steps in an evidence-centered design (ECD) approach to assessment development is a domain analysis. If you work in credentialing, licensure, or workplace assessment, you might accomplish this step with a job task analysis (JTA) study.

A JTA study gathers examples of tasks that potentially relate to a specific job. These tasks are typically harvested from existing literature or observations, reviewed by subject matter experts (SMEs), and rated by practitioners or other stakeholder groups across relevant dimensions (e.g., applicability to the job, frequency of the task). The JTA results are often used later to determine the content areas, cognitive processes, and weights that will be on the test blueprint.

 Questionmark has tools for authoring and delivering JTA items, as well as some limited analysis tools for basic response frequency distributions. But if we are conducting a JTA study, we need to start at the beginning: how do we write task statements?

One of my favorite sources on the subject is Mark Raymond and Sandra Neustel’s chapter, “Determining the Content of Credentialing Examinations,” in The Handbook of Test Development. The chapter provides information on how to organize a JTA study, how to write tasks, how to analyze the results, and how to use the results to build a test blueprint. The chapter is well-written, and easy to understand. It provides enough detail to make it useful without being too dense. If you are conducting a JTA study, I highly recommend checking out this chapter.

Raymond and Neustel explain that a task statement can refer to a physical or cognitive activity related to the job/practice. The format of a task statement should always follow a subject/verb/object format, though it might be expanded to include qualifiers for how the task should be executed, the resources needed to do the task, or the context of its application. They also underscore that most task statements should have only one action and one object. There are some exceptions to this rule, but if there are multiple actions and objects, they typically should be split into different tasks. As a hint, they suggest critiquing any task statement that has the words “and” or “or” in it.

Here is an example of a task statement from the Michigan Commission on Law Enforcement Standards’ Statewide Job Analysis of the Patrol Officer Position: Task 320: “[The patrol officer can] measure skid marks for calculation of approximate vehicle speed.”

I like this example because it is pretty specific, certainly better than just saying “determine vehicle’s speed.” It also provides a qualifier for how good their measurement needs to be (“approximate”). The context might be improved by adding more context (e.g., “using a tape measure”), but that might be understood by their participant population.

Raymond and Neustel also caution researchers to avoid words that might have multiple meanings or vague meanings. For example, the verb “instruct” could mean many different things—the practitioner might be giving some on-the-fly guidance to an individual or teaching a multi-week lecture. Raymond and Neustel underscore the difficult balance of writing task statements at a level of granularity and specificity that is appropriate for accomplishing defined goals in the workplace, but at a high enough level that we do not overwhelm the JTA participants with minutiae. The authors also advise that we avoid writing task statements that describe best practice or that might otherwise yield a biased positive response.

Early in my career, I observed a JTA SME meeting for an entry-level credential in the construction industry. In an attempt to condense the task list, the psychometrician on the project combined a bunch of seemingly related tasks into a single statement—something along the lines of “practitioners have an understanding of the causes of global warming.” This is not a task statement; it is a knowledge statement, and it would be better suited for a blueprint. It is also not very specific. But most important, it yielded a biased response from the JTA survey sample. This vague statement had the words “global warming” in it, which many would agree is a pretty serious issue, so respondents ranked it as of very high importance. The impact was that this task statement heavily influenced the topic weighting of the blueprint, but when it came time to develop the content, there was not much that could be written. Item writers were stuck having to write dozens of items for a vague yet somehow very important topic. They ended up churning out loads of questions about one of the few topics that were relevant to the practice: refrigerants. The end result was a general knowledge assessment with tons of questions about refrigerants. This experience taught me how a lack of specificity and the phrasing of task statements can undermine the entire content validity argument for an assessment’s results.

If you are new to JTA studies, it is worth mentioning that a JTA can sometimes turn into a significant undertaking. I attended one of Mark Raymond’s seminars earlier this year, and he observed anecdotally that he has had JTA studies take anywhere from three months to over a year. There are many psychometricians who specialize in JTA studies, and it may be helpful to work with them for some aspects of the project, especially when conducting a JTA for the first time. However, even if we use a psychometric consultant to conduct or analyze the JTA, learning about the process can make us better-informed consumers and allow us to handle some of work internally, potentially saving time and money.

JTA

Example of task input screen for a JTA item in Questionmark Authoring.

For more information on JTA and other reporting tools that are available with Questionmark, check out this Reporting & Analytics page

Question Type Report: Use Cases

Austin Fossey-42Posted by Austin Fossey

A client recently asked me if there is a way to count the number of each type of item in their item bank, so I pointed them toward the Question Type Report in Questionmark Analytics. While this type of frequency data can also be easily pulled using our Results API, it can be useful to have a quick overview of the number of items (split out by item type) in the item bank.

The Question Type Report does not need to be run frequently (and Analytics usage stats reflect that observation), but the data can help indicate the robustness of an item bank.

This report is most valuable in situations involving topics for a specific assessment or set of related assessments. While it might be nice to know that we have a total of 15,000 multiple choice (MC) items in the item bank, these counts are trivial unless we have a system-wide practical application—for example planning a full program translation or selling content to a partner.

This report can provide a quick profile of the population of the item bank or a topic when needed, though more detailed item tracking by status, topic, metatags, item type, and exposure is advisable for anyone managing a large-scale item development project. Below are some potential use cases for this simple report.

Test Development and Maintenance:
The Question Type Report’s value is primarily its ability to count the number of each type of item within a topic. If we know we have 80 MC items in a topic for a new assessment, and they all need be reviewed by a bias committee, then we can plan accordingly.

Form Building:
If we are equating multiple forms using a common-item design, the report can help us determine how many items go on each form and the degree to which the forms can overlap. Even if we only have one form, knowing the number of items can help a test developer check that enough items are available to match the blueprint.

Item Development:
If the report indicates that there are plenty of MC items ready for future publications, but we only have a handful of essay items to cover our existing assessment form, then we might instruct item writers to focus on developing new essay questions for the next publication of the assessment.

Question type

Example of a Question Type Report showing the frequency distribution by item type.