Checklists for Test Development

Austin Fossey-42Posted by Austin Fossey

There are many fantastic books about test development, and there are many standards systems for test development, such as The Standards for Educational and Psychological Testing. There are also principled frameworks for test development and design, such as evidence-centered design (ECD). But it seems that the supply of qualified test developers cannot keep up with the increased demand for high-quality assessment data, leaving many organizations to piece together assessment programs, learning as they go.checklist

As one might expect, this scenario leads to new tools targeted at these rookie test developers—simplified guidance documents, trainings, and resources attempting to idiot-proof test development. As a case in point, Questionmark seeks to distill information from a variety of sources into helpful, easy-to-follow white papers and blog posts. At an even simpler level, there appears to be increased demand for checklists that new test developers can use to guide test development or evaluate assessments.

For example, my colleague, Bart Hendrickx, shared a Dutch article from the Research Center for Examination and Certification (RCEC) at University of Twente describing their Beoordelingssysteem. He explained that this system provides a rubric for evaluating education assessments in areas like representativeness, reliability, and standard setting. The Buros Center for Testing addresses similar needs for users of mental assessments. In the Assessment Literacy section of their website, Buros has documents with titles like “Questions to Ask When Evaluating a Test”—essentially an evaluation checklist (though Buros also provides their own professional ratings of published assessments). There are even assessment software packages that seek to operationalize a test development checklist by creating a rigid workflow that guides the test developer through different steps of the design process.

The benefit of these resources is that they can help guide new test developers through basic steps and considerations as they build their instruments. It is certainly a step up from a company compiling a bunch of multiple choice questions on the fly and setting a cut score of 70% without any backing theory or test purpose. On the other hand, test development is supposed to be an iterative process, and without the flexibility to explore the nuances and complexities of the instrument, the results and the inferences may fall short of their targets. An overly simple, standardized checklist for developing or evaluating assessments may not consider an organization’s specific measurement needs, and the program may be left with considerable blind spots in its validity evidence.

Overall, I am glad to see that more organizations are wanting to improve the quality of their measurements, and it is encouraging to see more training resources to help new test developers tackle the learning curve. Checklists may be a very helpful tool for a lot of applications, and test developers frequently create their own checklists to standardize practices within their organization, like item reviews.

What do our readers think? Are checklists the way to go? Do you use a checklist from another organization in your test development?

 

 

 

 

Writing JTA Task Statements

Austin Fossey-42Posted by Austin Fossey

One of the first steps in an evidence-centered design (ECD) approach to assessment development is a domain analysis. If you work in credentialing, licensure, or workplace assessment, you might accomplish this step with a job task analysis (JTA) study.

A JTA study gathers examples of tasks that potentially relate to a specific job. These tasks are typically harvested from existing literature or observations, reviewed by subject matter experts (SMEs), and rated by practitioners or other stakeholder groups across relevant dimensions (e.g., applicability to the job, frequency of the task). The JTA results are often used later to determine the content areas, cognitive processes, and weights that will be on the test blueprint.

 Questionmark has tools for authoring and delivering JTA items, as well as some limited analysis tools for basic response frequency distributions. But if we are conducting a JTA study, we need to start at the beginning: how do we write task statements?

One of my favorite sources on the subject is Mark Raymond and Sandra Neustel’s chapter, “Determining the Content of Credentialing Examinations,” in The Handbook of Test Development. The chapter provides information on how to organize a JTA study, how to write tasks, how to analyze the results, and how to use the results to build a test blueprint. The chapter is well-written, and easy to understand. It provides enough detail to make it useful without being too dense. If you are conducting a JTA study, I highly recommend checking out this chapter.

Raymond and Neustel explain that a task statement can refer to a physical or cognitive activity related to the job/practice. The format of a task statement should always follow a subject/verb/object format, though it might be expanded to include qualifiers for how the task should be executed, the resources needed to do the task, or the context of its application. They also underscore that most task statements should have only one action and one object. There are some exceptions to this rule, but if there are multiple actions and objects, they typically should be split into different tasks. As a hint, they suggest critiquing any task statement that has the words “and” or “or” in it.

Here is an example of a task statement from the Michigan Commission on Law Enforcement Standards’ Statewide Job Analysis of the Patrol Officer Position: Task 320: “[The patrol officer can] measure skid marks for calculation of approximate vehicle speed.”

I like this example because it is pretty specific, certainly better than just saying “determine vehicle’s speed.” It also provides a qualifier for how good their measurement needs to be (“approximate”). The context might be improved by adding more context (e.g., “using a tape measure”), but that might be understood by their participant population.

Raymond and Neustel also caution researchers to avoid words that might have multiple meanings or vague meanings. For example, the verb “instruct” could mean many different things—the practitioner might be giving some on-the-fly guidance to an individual or teaching a multi-week lecture. Raymond and Neustel underscore the difficult balance of writing task statements at a level of granularity and specificity that is appropriate for accomplishing defined goals in the workplace, but at a high enough level that we do not overwhelm the JTA participants with minutiae. The authors also advise that we avoid writing task statements that describe best practice or that might otherwise yield a biased positive response.

Early in my career, I observed a JTA SME meeting for an entry-level credential in the construction industry. In an attempt to condense the task list, the psychometrician on the project combined a bunch of seemingly related tasks into a single statement—something along the lines of “practitioners have an understanding of the causes of global warming.” This is not a task statement; it is a knowledge statement, and it would be better suited for a blueprint. It is also not very specific. But most important, it yielded a biased response from the JTA survey sample. This vague statement had the words “global warming” in it, which many would agree is a pretty serious issue, so respondents ranked it as of very high importance. The impact was that this task statement heavily influenced the topic weighting of the blueprint, but when it came time to develop the content, there was not much that could be written. Item writers were stuck having to write dozens of items for a vague yet somehow very important topic. They ended up churning out loads of questions about one of the few topics that were relevant to the practice: refrigerants. The end result was a general knowledge assessment with tons of questions about refrigerants. This experience taught me how a lack of specificity and the phrasing of task statements can undermine the entire content validity argument for an assessment’s results.

If you are new to JTA studies, it is worth mentioning that a JTA can sometimes turn into a significant undertaking. I attended one of Mark Raymond’s seminars earlier this year, and he observed anecdotally that he has had JTA studies take anywhere from three months to over a year. There are many psychometricians who specialize in JTA studies, and it may be helpful to work with them for some aspects of the project, especially when conducting a JTA for the first time. However, even if we use a psychometric consultant to conduct or analyze the JTA, learning about the process can make us better-informed consumers and allow us to handle some of work internally, potentially saving time and money.

JTA

Example of task input screen for a JTA item in Questionmark Authoring.

For more information on JTA and other reporting tools that are available with Questionmark, check out this Reporting & Analytics page

Item Development – Managing the Process for Large-Scale Assessments

Austin FosseyPosted by Austin Fossey

Whether you work with low-stakes assessments, small-scale classroom assessments or large-scale, high-stakes assessment, understanding and applying some basic principles of item development will greatly enhance the quality of your results.

This is the first in a series of posts setting out item development steps that will help you create defensible assessments. Although I’ll be addressing the requirements of large-scale, high-stakes testing, the fundamental considerations apply to any assessment.

You can find previous posts here about item development including how to write items, review items, increase complexity, and avoid bias. This series will review some of what’s come before, but it will also explore new territory. For instance, I’ll discuss how to organize and execute different steps in item development with subject matter experts. I’ll also explain how to collect information that will support the validity of the results and the legal defensibility of the assessment.

In this series, I’ll take a look at:

Item Dev.

These are common steps (adapted from Crocker and Algina’s Introduction to Classical and Modern Test Theory) taken to create the content for an assessment. Each step requires careful planning, implementation, and documentation, especially for high-stakes assessments.

This looks like a lot of steps, but item development is just one slice of assessment development. Before item development can even begin, there’s plenty of work to do!

In their article, Design and Discovery in Educational Assessment: Evidence-Centered Design, Psychometrics, and Educational Data Mining, Mislevy, Behrens, Dicerbo, and Levy provide an overview of Evidence-Centered Design (ECD). In ECD, test developers must define the purpose of the assessment, conduct a domain analysis, model the domain, and define the conceptual assessment framework before beginning assessment assembly, which includes item development.

Once we’ve completed these preparations, we are ready to begin item development. In the next post, I will discuss considerations for training our item writers and item reviewers.

Discussing data mining at NCME

Austin FosseyPosted by Austin Fossey

We will wrap up our discussion of themes at the National Council for Measurement in Education (NCME) annual meeting with an overview of the inescapable discussion about working with complex — and often messy– data sets.

It was clear from many of the presentations and poster sessions that technology is driving the direction of assessment, for better or for worse (or as Damian Betebenner put it, “technology eats statistics”). Advances in technology have allowed researchers to examine new statistical models for scoring participants, identify aberrant responses, score performance tasks, identify sources of construct-irrelevant variance, diversify item formats, and improve reporting methods.

As the symbiotic knot between technology and assessment grows tighter, many researchers and test developers are in the unexpected position of having too much data. This is especially true in complex assessment environments that yield log files with staggering amounts of information about a participant’s actions within an assessment.

Log files can track many types of data in an assessment, such as responses, click streams, and system states. All of these data are time stamped, and if they capture the right data, they can illuminate some of the cognitive processes that are manifesting themselves through the participant’s interaction with the assessment. Raw assessment data like Questionmark’s Results API OData Feeds can also be coupled with institutional data, thus exponentially growing the types of research questions we can pursue within a single organization.

NCME attendees learned about hardware and software that captures both response variables and behavioral variables from participants as they complete an online learning task.

Several presenters discussed issues and strategies for addressing less-structured data, with many papers tackling log file data gathered as participants interact with an online assessment or other online task. Ryan Baker (International Educational Data Mining Society) gave a talk about combine the data mining of log files with field observations to identify hard-to-capture domains, like student engagement.

Baker focused on the positive aspects of having oceans of data, choosing to remain optimistic about what we can do rather than dwell on the difficulties of iterative model building in these types of research projects. He shared examples of intelligent tutoring systems designed to teach students while also gathering data about the student’s level of engagement with the lesson. These examples were peppered with entertaining videos of the researchers in classrooms playing with their phones so that individual students would not realize that they were being subtly observed by the researcher via sidelong glances.

Evidence-centered design (ECD) emerged as a consistent theme: there was a lot conversation about how researchers are designing assessments so that they yield fruitful data for
intended inferences. Nearly every presentation about assessment development referenced ECD. Valerie Shute (Florida State University) observed that five years ago, only a fraction of participants would have known about ECD, but today it is widely used by practitioners.

Discussing the revised Standards for Educational and Psychological Testing

Austin FosseyPosted by Austin Fossey

I just returned from the National Council for Measurement in Education (NCME) annual meeting in Philadelphia, which is held in conjunction with the American Educational Research Association (AERA) annual meeting.

There were many big themes around research and advances in assessment, but there were also a lot of interesting discussions about changes in practice. There seemed to be a great deal of excitement and perhaps some trepidation about the upcoming release of the next version of the Standards for Educational and Psychological Testing, which is the authority on requirements for good assessment design and implementation, and which has not been updated since 1999.NCME Standards

There were two big discussion sessions about the Standards during the conference. The first was a two-hour overview hosted by Wayne Camara (ACT) and Suzanne Lane (University of Pittsburgh). Presenters from several organizations summarized the changes to the various chapters in the Standards. In the second discussion, Joan Herman (UCLA/CRESST) hosted a panel that talked about how these changes might impact the practices that we use to develop and deliver assessments.

During the panel discussion, the chapter about Fairness came up several times. This appears to be an area where the Standards are taking a more detailed approach, especially with regard to the use of testing accommodations. From the discussion, it sounds like the next version will have better guidance about best practices for various accommodations and for documenting that those accommodations properly minimize construct-irrelevant variance without giving participants any unfair advantages over the general population.

During the discussion, Scott Marion (Center for Assessment) observed that the new Standards do not address Fairness in the context of some delivery mechanisms (as opposed to the delivery conditions) in assessment. For example, he noted that computer-adaptive tests (CATs) use item selection algorithms that are based on the general population, but there is no requirement to research whether the adaptation works comparably in subpopulations, such as students with cognitive disabilities who might be eligible for other accommodations like extra time.

The panelists also mentioned that some of the standards have been written so that the language mirrors the principles of evidence-centered design (ECD), though the Standards do not specifically mention ECD outright. This seems like a logical step for the Standards, as nearly every presentation I attended about assessment development referenced ECD. Valerie Shute (Florida State University) observed that five years ago, only a fraction of participants would have known about ECD, but today it is widely used. Though ECD was around several years before the 1999 Standards, it did not have the following that it does today.

In general, it sounds like most of the standards we know and love will remain intact, and the revisions are primarily serving to provide more clarity or to accommodate the changing practices in assessment development. Nearly all of the presenters work on large-scale, high-stakes assessments that have been developed under the 1999 Standards, and many of them mentioned that they are already committing themselves to review their programs and documentation against the new Standards when they are published later this year.

Teaching to the test and testing to what we teach

Austin FosseyPosted by Austin Fossey

We have all heard assertions that widespread assessment creates a propensity for instructors to “teach to the test.” This often conjures images of students memorizing facts without context in order to eke out passing scores on a multiple choice assessment.

But as Jay Phelan and Julia Phelan argue in their essay, Teaching to the (Right) Test, teaching to the test is usually problematic when we have a faulty test. When our curriculum, instruction, and assessment are aligned, teaching to the test can be beneficial because we have are testing what we taught. We can flip this around and assert that we should be testing to what we teach.

There is little doubt that poorly-designed assessments have made their way into some slices of our educational and professional spheres. Bad assessment designs can stem from shoddy domain modeling, improper item types, or poor reporting.test classroom

Nevertheless, valid, reliable, and actionable assessments can improve learning and performance. When we teach to a well-designed assessment, we should be teaching what we would have taught anyway, but now we have a meaningful measurement instrument that can help students and instructors improve.

I admit that there are constructs like creativity and teamwork that are more difficult to define, and appropriate assessment for these learning goals can be difficult. We may instinctively cringe at the thought of assessing an area like creativity—I would hate to see a percentage score assigned to my creativity.

But if creativity is a learning goal, we should be collecting evidence that helps us support the argument that our students are learning to be creative. A multiple choice test may be the wrong tool for that job, but we can use frameworks like evidence-centered design (ECD) to decide what information we want to collect (and the best methods for collecting it) to demonstrate our students’ creativity.

Assessments have evolved a lot over the past 25 years, and with better technology and design, test developers can improve the validity of the assessments and their utility in instruction. This includes new item types, simulation environments, improved data collection, a variety of measurement models, and better reporting of results. In some programs, the assessment is actually embedded in the everyday work or games that the participant would be interacting with anyway—a strategy that Valerie Shute calls stealth assessment.

With a growing number of tools available to us, test developers should always be striving to improve how we test what we teach so that we can proudly teach to the test.