5 Ways to Limit the Use of Breached Assessment Content

Austin Fossey-42Posted by Austin Fossey

In an earlier post, Questionmark’s Julie Delazyn listed 11 tips to help prevent cheating. The third item on that list related to minimizing item exposure; i.e., limiting how and when people can see an item so that content will not be leaked and used for dishonest purposes.

During a co-presentation with Manny Straehle of Assessment, Education, and Research Experts at a Certification Network Group quarterly meeting, I presented a set of considerations that can affect the severity of item exposure. My message was that although item exposure may not be a problem for some assessment programs, assessment managers should consider the design, purpose, candidate population, and level of investment for their assessment when evaluating their content security requirements.

mitigating risk

If item exposure is a concern for your assessment program, there are two ways to mitigate the effects of leaked content: limiting opportunities to use the content, and identifying the breach so that it can be corrected. In this post, I will focus on ways to limit content-using opportunities:

Multiple Forms

Using different assessment forms lowers the number of participants who will see an item in delivery. Having multiple forms also lowers the probability that someone with access to a breached item will actually get to put that information to use. Many organizations achieve this by using multiple, equated forms which are systematically assigned to participants to limit joint cheating or to limit item exposure across multiple retakes. Some organizations also achieve this through the use of randomly generated forms like those in Linear-on-the-Fly Testing (LOFT) or empirically generated forms like those in Computer Adaptive Testing (CAT).

Frequent Republishing

Assessment forms are often cycled in and out of production on a set schedule. Decreasing the amount of time a form is in production will limit the impact of item exposure, but it also requires more content and staff resources to keep rotating forms.

Large Item Banks

Having a lot of items can help you make lots of assessment forms, but this is also important for limiting item exposure in LOFT or CAT. Item banks can also be rotated. For example, some assessment programs will use an item bank for particular testing windows or geographic regions and then switch them at the next administration.

Exposure Limits

If your item bank can support it, you may also want to put an exposure limit on items or assessment forms. For example, you might set up a rule where an assessment form remains in production until it has been delivered 5,000 times. After that, you may permanently retire that form or shelve it for a predetermined period and use it again later. An extreme example would be an assessment program that only delivers an item during a single testing window before retiring it. The limit will depend on your risk tolerance, the number of items you have available, and the number of participants taking the assessment. Exposure limits are especially important in CAT where some items will get delivered much more frequently than others due to the item selection algorithm.

Short Testing Windows

When participants are only allowed to take a test during a short time period, there are fewer opportunities for people to talk about or share content before the testing window closes. Short testing windows may be less convenient for your participant population, but you can take advantage of the extra downtime to spend time detecting item breaches, developing new content, and performing assessment maintenance.

In my next post, I will provide an overview of methods for identifying instances of an item breach.

Know what your questions are about before you deliver the test

Austin Fossey-42Posted by Austin Fossey

A few months ago, I had an interesting conversation with an assessment manager at an educational institution—not a Questionmark customer, mind you. Finding nothing else in common, we eventually began discussing assessment design.

At this institution (which will remain anonymous), he admitted that they are often pressed for time in their assessment development cycle. There is not enough time to do all of the item development work they need to do before their students take the assessment. To get around this, their item writers draft all of the items, conduct an editorial review, and then deliver the items. The items are assigned topics after administration, and students’ total scores and topic scores are calculated from there. He asked me if Questionmark software allows test developers to assign topics and calculate topic scores after assessing the students, and I answered truthfully that it does not.

But why not? Is there a reason test developers should not do what is being practiced at this institution? Yes, there are in fact two reasons. Get ready for some psychometric finger-wagging.

Consider what this institution is doing. The items are drafted and subjected to an editorial review, but no one ever classifies the items within a topic until after the test has been administered. Recall what people typically do during a content review prior to administration:

  • Remove items that are not relevant to the domain.
  • Ensure that the blueprint is covered.
  • Check that items are assigned to the correct topic.

If topics are not assigned until after the participants have already tested, we risk the validity of the results and the legal defensibility of the test. If we have delivered items that are not relevant to the domain, we have wasted participants’ time and will need to adjust their total score. Okay, we can manage that by telling the participants ahead of time that some of the test items might not count. But if we have not asked the correct number of questions for a given area of the blueprint, the entire assessment score will be worthless—a threat to validity known as construct underrepresentation or construct deficiency in The Standards for Educational and Psychological Testing.

For example, if we were supposed to deliver 20 items from Topic A, but find out after the fact that only 12 items have been classified as belonging to Topic A, then there is little we can do about it besides rebuilding the test form and making everyone take the test again.

The Standards provide helpful guidance in these matters. For this particular case, the Standards point out that:

“The test developer is responsible for documenting that the items selected for the test meet the requirements of the test specifications. In particular, the set of items selected for a new test form . . . must meet both content and psychometric specifications.” (p. 82)

Publications describing best practices for test development also specify that the content must be determined before delivering an operational form. For example, in their chapter in Educational Measurement (4th Edition), Cynthia Schmeiser and Catherine Welch note the importance of conducting a content review of items before field testing, as well a final content review of a draft test form before it becomes operational.

In Introduction to Classical and Modern Test Theory, Linda Crocker and James Algina also made an interesting observation about classroom assessments, noting that students expect to be graded on all of the items they have been asked to answer. Even if notified in advance that some items might not be counted (as one might do in field testing), students might not consider it fair that their score is based on a yet-to-be-determined subset of items that may not fully represent the content that is supposed to be covered.

This is why Questionmark’s software is designed the way it is. When creating an item, item writers must assign an item to a topic, and items can be classified or labeled along other dimensions (e.g., cognitive process) using metatags. Even if an assessment program cannot muster any further content review, at least the item writer has classified items by content area. The person building the test form then has the information they need to make sure that the right questions get asked.

We have a responsibility as test developers to treat our participants fairly and ethically. If we are asking them to spend their time taking a test, then we owe them the most useful measurement that we can provide. Participants trust that we know what we are doing. If we postpone critical, basic development tasks like content identification until after participants have already given us their time, we are taking advantage of that trust.

Test Security: Not Necessarily a Question of Proctoring Mode

Austin Fossey-42Posted by Austin Fossey

I recently spent time looking for research studies that analyzed the security levels of online and in-person proctoring. Unfortunately, no one seems to have compared these two approaches with a well-designed study. (If someone has done a rigorous study contrasting these two modes of delivery, please let me know! I certainly may have overlooked it in my research.)

I did learn a lot from the sparse literature that was available, and my main takeaway is this: security is related less to proctoring mode than it is to how much effort the test developer puts into administration planning and test design. Investing in solid administration policies, high-quality monitoring technology, and well-trained proctors is what really matters most for both in-person and online proctoring.

With some effort, testing programs with online proctors can likely achieve levels of security and service comparable to the services offered by many test centers. This came into focus for me after attending several recent seminars about online and in-person proctoring through the Association of Test Publishers (ATP) and Performance Testing Council (PTC).

The Standards for Educational and Psychological Testing provide a full list of considerations for organizations running any type of exam, but here are a few key points gleaned from the Standards and from PTC’s webinar (.wmv) to help you plan for online proctoring:

Control of the Environment

Unless a collaborator is onsite to set up and maintain the test environment, all security controls will need to be managed remotely. Here are suggestions for what you would need to do if you were a test program administrator under those circumstances:

  • Work with your online proctors to define the rules for acceptable test environments.
  • Ensure that test environment requirements are realistic for participants while still meeting your standards for security and comparability between administrations.
  • If security needs demand it, have monitoring equipment sent in advance (e.g., multiple cameras for improved monitoring, scanners to authenticate identification).
  • Clearly communicate policies to participants and get confirmation that they understand and can abide by your policies.
  • Plan policies for scenarios that might arise in an environment that is not managed by the test program administrator or proctor. For example, are you legally allowed to video someone who passes by in the background if they have not given their permission to be recorded? If not, have a policy in place stating that the participant is responsible for finding an isolated place to test. Do you or the proctoring company manage the location where the test is being delivered? If not, have a policy for who takes responsibility and absorbs the cost of an unexpected interruption like a fire alarm or power outage.

You should be prepared to document the comparability of administrations. This might include describing potential variations in the remote environment and how they may or may not impact the assessment results and security.

It is also advisable to audit some administrations to make sure that the testing environments comply with your testing program’s security policy. The online proctors’ incident reports should also be recorded in an administration report, just as they would with an in-person proctor.

Test Materials

You also need to make sure that everything needed to administer the test is provided, either physically or virtually.

  • Each participant must have the equipment and resources needed to take the test. If it is not reasonable to expect the participant to handle these tasks, you need to plan for someone else to do so, just as you would at a test center. For example, it might not be reasonable to expect some participant populations to know how to check whether the computer used for testing meets minimum software requirements.
  • If certain hardware (e.g., secured computers, cameras, scanners, microphones) or test materials (e.g., authorized references, scratch paper) are needed for the assessment design, you need to make sure these are available onsite for the participant and make sure they are collected afterwards.

Accommodations

Accommodations may take the form of physical or virtual test materials, but accommodations can also include additional services or some changes in the format of the assessment.

  • Some accommodations (e.g., extra time, large print) can be controlled by the assessment instrument or an online proctor, just as they would in a test center.
  • Other accommodations require special equipment or personnel onsite. Some personnel (e.g., scribes) may be able to provide their services remotely, but accommodations like tactile printouts of figures for the blind must be present onsite.

Extra effort is clearly needed when setting up an online-proctored test. Activities that might have been handled by a testing center (control of the environment, management of test materials, providing accommodations) now need to be remotely coordinated by the test program staff and proctors; however, the payoffs may be worth the extra effort. If comparable administration practices can be achieved, online-proctored assessments may be cheaper than test centers, offer increased access to participants, and lower the risks of collaborative cheating.

For more on online proctoring, check out this informational page and video below

Writing JTA Task Statements

Austin Fossey-42Posted by Austin Fossey

One of the first steps in an evidence-centered design (ECD) approach to assessment development is a domain analysis. If you work in credentialing, licensure, or workplace assessment, you might accomplish this step with a job task analysis (JTA) study.

A JTA study gathers examples of tasks that potentially relate to a specific job. These tasks are typically harvested from existing literature or observations, reviewed by subject matter experts (SMEs), and rated by practitioners or other stakeholder groups across relevant dimensions (e.g., applicability to the job, frequency of the task). The JTA results are often used later to determine the content areas, cognitive processes, and weights that will be on the test blueprint.

 Questionmark has tools for authoring and delivering JTA items, as well as some limited analysis tools for basic response frequency distributions. But if we are conducting a JTA study, we need to start at the beginning: how do we write task statements?

One of my favorite sources on the subject is Mark Raymond and Sandra Neustel’s chapter, “Determining the Content of Credentialing Examinations,” in The Handbook of Test Development. The chapter provides information on how to organize a JTA study, how to write tasks, how to analyze the results, and how to use the results to build a test blueprint. The chapter is well-written, and easy to understand. It provides enough detail to make it useful without being too dense. If you are conducting a JTA study, I highly recommend checking out this chapter.

Raymond and Neustel explain that a task statement can refer to a physical or cognitive activity related to the job/practice. The format of a task statement should always follow a subject/verb/object format, though it might be expanded to include qualifiers for how the task should be executed, the resources needed to do the task, or the context of its application. They also underscore that most task statements should have only one action and one object. There are some exceptions to this rule, but if there are multiple actions and objects, they typically should be split into different tasks. As a hint, they suggest critiquing any task statement that has the words “and” or “or” in it.

Here is an example of a task statement from the Michigan Commission on Law Enforcement Standards’ Statewide Job Analysis of the Patrol Officer Position: Task 320: “[The patrol officer can] measure skid marks for calculation of approximate vehicle speed.”

I like this example because it is pretty specific, certainly better than just saying “determine vehicle’s speed.” It also provides a qualifier for how good their measurement needs to be (“approximate”). The context might be improved by adding more context (e.g., “using a tape measure”), but that might be understood by their participant population.

Raymond and Neustel also caution researchers to avoid words that might have multiple meanings or vague meanings. For example, the verb “instruct” could mean many different things—the practitioner might be giving some on-the-fly guidance to an individual or teaching a multi-week lecture. Raymond and Neustel underscore the difficult balance of writing task statements at a level of granularity and specificity that is appropriate for accomplishing defined goals in the workplace, but at a high enough level that we do not overwhelm the JTA participants with minutiae. The authors also advise that we avoid writing task statements that describe best practice or that might otherwise yield a biased positive response.

Early in my career, I observed a JTA SME meeting for an entry-level credential in the construction industry. In an attempt to condense the task list, the psychometrician on the project combined a bunch of seemingly related tasks into a single statement—something along the lines of “practitioners have an understanding of the causes of global warming.” This is not a task statement; it is a knowledge statement, and it would be better suited for a blueprint. It is also not very specific. But most important, it yielded a biased response from the JTA survey sample. This vague statement had the words “global warming” in it, which many would agree is a pretty serious issue, so respondents ranked it as of very high importance. The impact was that this task statement heavily influenced the topic weighting of the blueprint, but when it came time to develop the content, there was not much that could be written. Item writers were stuck having to write dozens of items for a vague yet somehow very important topic. They ended up churning out loads of questions about one of the few topics that were relevant to the practice: refrigerants. The end result was a general knowledge assessment with tons of questions about refrigerants. This experience taught me how a lack of specificity and the phrasing of task statements can undermine the entire content validity argument for an assessment’s results.

If you are new to JTA studies, it is worth mentioning that a JTA can sometimes turn into a significant undertaking. I attended one of Mark Raymond’s seminars earlier this year, and he observed anecdotally that he has had JTA studies take anywhere from three months to over a year. There are many psychometricians who specialize in JTA studies, and it may be helpful to work with them for some aspects of the project, especially when conducting a JTA for the first time. However, even if we use a psychometric consultant to conduct or analyze the JTA, learning about the process can make us better-informed consumers and allow us to handle some of work internally, potentially saving time and money.

JTA

Example of task input screen for a JTA item in Questionmark Authoring.

For more information on JTA and other reporting tools that are available with Questionmark, check out this Reporting & Analytics page

Question Type Report: Use Cases

Austin Fossey-42Posted by Austin Fossey

A client recently asked me if there is a way to count the number of each type of item in their item bank, so I pointed them toward the Question Type Report in Questionmark Analytics. While this type of frequency data can also be easily pulled using our Results API, it can be useful to have a quick overview of the number of items (split out by item type) in the item bank.

The Question Type Report does not need to be run frequently (and Analytics usage stats reflect that observation), but the data can help indicate the robustness of an item bank.

This report is most valuable in situations involving topics for a specific assessment or set of related assessments. While it might be nice to know that we have a total of 15,000 multiple choice (MC) items in the item bank, these counts are trivial unless we have a system-wide practical application—for example planning a full program translation or selling content to a partner.

This report can provide a quick profile of the population of the item bank or a topic when needed, though more detailed item tracking by status, topic, metatags, item type, and exposure is advisable for anyone managing a large-scale item development project. Below are some potential use cases for this simple report.

Test Development and Maintenance:
The Question Type Report’s value is primarily its ability to count the number of each type of item within a topic. If we know we have 80 MC items in a topic for a new assessment, and they all need be reviewed by a bias committee, then we can plan accordingly.

Form Building:
If we are equating multiple forms using a common-item design, the report can help us determine how many items go on each form and the degree to which the forms can overlap. Even if we only have one form, knowing the number of items can help a test developer check that enough items are available to match the blueprint.

Item Development:
If the report indicates that there are plenty of MC items ready for future publications, but we only have a handful of essay items to cover our existing assessment form, then we might instruct item writers to focus on developing new essay questions for the next publication of the assessment.

Question type

Example of a Question Type Report showing the frequency distribution by item type.

 

Simpson’s Paradox and the Steelyard Graph

Austin Fossey-42Posted by Austin Fossey

If you work with assessment statistics or just about any branch of social science, you may be familiar with Simpson’s paradox—the idea that data trends between subgroups change or disappear when the subgroups are aggregated. There are hundreds of examples of Simpson’s paradox (and I encourage you to search some on the internet for kicks), but here is a simple example for the sake of illustration.

Simpson’s Paradox Example

Let us say that I am looking to get trained as a certified window washer so that I can wash windows on Boston’s skyscrapers. Two schools in my area offer training, and both had 300 students graduate last year. Graduates from School A had an average certification test score of 70.7%, and graduates from School B had an average score of 69.0%. Ignoring for the moment whether these differences are significant, as a student I will likely choose School A due to its higher average test scores.

But here is where the paradox happens. Consider now that I have a crippling fear of heights, which may be a hindrance for my window-washing aspirations. It turns out that School A and School B also track test scores for their graduates based on whether or not they have a fear of heights. The table below reports the average scores for these phobic subgroups.

Simpson's
Notice anything? The average score for people with and without a fear of heights in School B is higher than the same groups in School A. The paradox is that School A has a higher average test score overall, yet School B can boast better average test scores for students with a fear of heights and students without a fear of heights. School B’s overall average is lower because they simply had more students with a fear of heights. If we want to test the significance of these differences, we can do so with ANOVA.

Gaviria and González-Barbera’s Steelyard Graph

Simpson’s paradox occurs in many different fields, but it is sometimes difficult to explain to stakeholders. Tables (like the one above) are often used to
illustrate the subgroup differences, but in the Fall 2014 issue of Educational Measurement, José-Luis Gaviria and Coral González-Barbera from the Universidad Complutense de Madrid won the publication’s data visualization contest with their Steelyard Graph, which illustrates Simpson’s Paradox with a graph resembling a steelyard balance. The publication’s visual editor, ETS’s Katherine Furgol Castellano, wrote the discussion piece for the Steelyard Graph, praising Gaviria and González-Barbera for the simplicity of the approach and the novel yet astute strategy of representing averages with balanced levers.

The figure below illustrates the same data from the table above using Gaviria and González-Barbera’s Steelyard Graph approach. The size of the squares corresponds to the number of students, the location on the lever indicates the average subgroup score, and the triangular fulcrum represents the school’s overall average score. Notice how clear it is that the subgroups in School B have higher average scores than their counterparts in School A. The example below has only two subgroups, but the same approach can be used for more subgroups.

Simpson's 2

Example of Gaviria and González-Barbera’s Steelyard Graph to visualize Simpson’s paradox for subgroups’ average test scores.

Making a Decision when Faced with Simpson’s Paradox

When one encounters Simpson’s paradox, decision-making can be difficult, especially if there are no theories to explain why the relational pattern is different at a subgroup level. This is why exploratory analysis often must be driven by and interpreted through a lens of theory. One could come up with arbitrary subgroups that reverse the aggregate relationships, even though there is no theoretical grounding for doing so. On the other hand, relevant subgroups may remain unidentified by researchers, though the aggregate relationship may still be sufficient for decision-making.

For example, as a window-washing student seeing the phobic subgroups’ performances, I might decide that School B is the superior school for teaching the trade, regardless of which subgroup a student belongs to. This decision is based on a theory that a fear of heights may impact performance on the certification assessment, in which case School B does a better job at preparing both subgroups for their assessments. If that theory is not tenable, it may be that School A is really the better choice, but as an acrophobic would-be window washer, I will likely choose School B after seeing this graph . . . as long as the classroom is located on the ground floor.