High-stakes assessment: It’s not just about test takers

Lance bio picPosted by

In my last post I spent some time defining how I think about the idea of high-stakes assessment. I also talked about how these assessments affect the people who take them including how important it is to their ability to get or do a job.

Now I want to talk a little bit about how these assessments affect the rest of us.

The rest of us

Guess what? The rest of us are affected by the outcomes of these assessments. Did you see that coming?

But seriously, the credentials or scores that result from these assessments affect large swathes of the public. Ultimately that’s the point of high-stakes assessment. The resulting certifications and licenses exist to protect the public. These assessments are acting as barriers preventing incompetent people from practicing professions where competency really matters.

 It really matters

What are some examples of “really matters”? Well, when hiring, it really matters to employers that the network techs they hire knows how to configure a network securely, not that the techs just say they do. It matters to the people crossing a bridge that the engineers who designed it knew their physics. It really matters to every one of us that our doctor, dentist, nurse, or surgeon know what they are doing when they treat us. It really matters to society at large when we measure (well) the children and adults who take large-scale assessments like college entrance exams.

At the end of the day, high-stakes exams are high-stakes because in a very real way, almost all of us have a stake in their outcome.

 Separating the wheat from the chaff

There are a couple of ways that high stakes assessments do what they do. Some assessments are simply designed to measure “minimal competence,” with test takers either ending above the line—often known as “passing”—or below the line. The dreaded “fail.”

Other assessments are designed to place test takers on a continuum of ability. This type of assessment assigns scores to test takers, and the range of
score often appear odd to laypeople. For example, the SAT uses a 200 – 800 scale.

Want to learn more? Hang on till next time!

When to Give Partial Credit for Multiple-Response Items

Austin Fossey-42 Posted by Austin Fossey

Three different customers recently asked me how to decide between scoring a multiple-response (MR) item dichotomously or polytomously; i.e., when should an MR item be scored right/wrong, and when should we give partial credit? I gave some garrulous, rambling answers, so the challenge today is for me to explain this in a single blog post that I can share the next time it comes up.

In their chapter on multiple-choice and matching exercises in Educational Assessment of Students (5th ed.), Anthony Nitko and Susan Brookhart explain that matching items (which we may extend to include MR item formats, drag-and-drop formats, survey-matrix formats, etc.) are often a collection of single-response multiple choice (MC) items. The advantage of the MR format is that is saves space and you can leverage dependencies in the questions (e.g., relationships between responses) that might be redundant if broken into separate MC items.

Given that an MR items is often a set of individually scored MC items, then a polytomously scored format almost always makes sense. From an interpretation standpoint, there are a couple of advantages for you as a test developer or instructor. First, you can differentiate between participants who know some of the answers and those who know none of the answers. This can improve the item discrimination. Second, you have more flexibility in how you choose to score and interpret the responses. In the drag-and-drop example below (a special form of an MR item), the participant has all of the dates wrong; however, the instructor may still be interested in knowing that the participant knows the correct order of events for the Stamp Act, the Townshend Act, and the Boston Massacre.

stamp 1

Example of a drag-and-drop item in Questionmark where the participant’s responses are wrong, but the order of responses is partially correct.

Are there exceptions? You know there are. This is why it is important to have a test blueprint document, which can help clarify which item formats to use and how they should be evaluated. Consider the following two variations of a learning objective on a hypothetical CPR test blueprint:

  • The participant can recall the actions that must be taken for an unresponsive victim requiring CPR.
  • The participant can recall all three actions that must be taken for an unresponsive victim requiring CPR.

The second example is likely the one that the test developer would use for the test blueprint. Why? Because someone who knows two of the three actions is not going to cut it. This is a rare all-or-nothing scenario where knowing some of the answers is essentially the same (from a qualifications standpoint) as knowing none of the answers. The language in this learning objective (“recall all three actions”) is an indicator to the test developer that if they use an MR item to assess this learning objective, they should score it dichotomously (no partial credit). The example below shows how one might design an item for this hypothetical learning objective with Questionmark’s authoring tools:

stamp 2

Example of a Questionmark authoring screen for MR item that is scored dichotomously (right/wrong).

To summarize, a test blueprint document is the best way to decide if an MR item (or variant) should be scored dichotomously or polytomously. If you do not have a test blueprint, think critically about what you are trying to measure and the interpretations you want reflected in the item score. Partial-credit scoring is desirable in most use cases, though there are occasional scenarios where an all-or-nothing scoring approach is needed—in which case the item can be scored strictly right/wrong. Finally, do not forget that you can score MR items differently within an assessment. Some MR items can be scored polytomously and others can be scored dichotomously on the same test, though it may be beneficial to notify participants when scoring rules differ for items that use the same format.

If you are interested in understanding and applying some basic principles of item development and enhancing the quality of your results, download the free white paper written by Austin: Managing Item Development for Large-Scale Assessment

Item Development – Benefits of editing items before the review process

Austin FosseyPosted by Austin Fossey

Some test developers recommend a single round of item editing (or editorial review), usually right before items are field tested. When schedules and resources allow for it, I recommend that test developers conduct two rounds of editing—one right after the items are written and one after content and bias reviews are completed. This post addresses the first round of editing, to take place after items are drafted.

Why have two rounds of editing? In both rounds, we will be looking for grammar or spelling errors, but the first round serves as a filter to keep items with serious flaws from making it to content review or bias review.

In their chapter in Educational Measurement (4 th ed.), Cynthia Shmeiser and Catherine Welch explain that an early round of item editing “serves to detect and correct deficiencies in the technical qualities of the items and item pools early in the development process.” They recommend that test developers use this round of item editing to do a cursory review of whether the items meet the Standards for Educational and Psychological Testing.

Items that have obvious item writing flaws should be culled in the first round of item editing and either sent back to the item writers or removed. This may include item writing errors like cluing or having options that do not match the stem grammatically. Ideally, these errors will be caught and corrected in the drafting process, but a few items may have slipped through the cracks.

In the initial round of editing, we will also be looking for proper formatting of the items. Did the item writers use the correct item types for the specified content? Did they follow the formatting rules in our style guide? Is all supporting content (e.g., pictures, references) present in the item? Did the item writers record all of the metadata for the item, like its content area, cognitive level, or reference? Again, if an item does not match the required format, it should be sent back to the item writers or removed.

It is helpful to look for these issues before going to content review or bias review because these types of errors may distract your review committees from their tasks; the committees may be wasting time reviewing items that should not be delivered anyway due to formatting flaws. You do not want to get all the way through content and bias reviews only to find that a large number of your items have to be returned to the drafting process. We will discuss review committee processes in the following posts.

For best practice guidance and practical advice for the five key stages of test and exam development, check out our white paper: 5 Steps to Better Tests.

Early-bird savings on conference registration end today: Sign up now!

Joan Phaup 2013 (3)Posted by Joan Phaup

Just a reminder that you can save $200 if you register today for the Questionmark 2014 Users Conference.

We look forward to seeing you March 4 – 7 in San Antonio, Texas, for three intensive days of learning and networking.

Check out the conference program as it continues to take shape, and sign up today!

This conference truly is the best place to learn about our technologies, improve your assessments and discuss best practices with Questionmark staff, industry experts and your colleagues. But don’t take my word for it. Let these attendees at the 2013 tell you what they think:



Teaching to the test and testing to what we teach

Austin FosseyPosted by Austin Fossey

We have all heard assertions that widespread assessment creates a propensity for instructors to “teach to the test.” This often conjures images of students memorizing facts without context in order to eke out passing scores on a multiple choice assessment.

But as Jay Phelan and Julia Phelan argue in their essay, Teaching to the (Right) Test, teaching to the test is usually problematic when we have a faulty test. When our curriculum, instruction, and assessment are aligned, teaching to the test can be beneficial because we have are testing what we taught. We can flip this around and assert that we should be testing to what we teach.

There is little doubt that poorly-designed assessments have made their way into some slices of our educational and professional spheres. Bad assessment designs can stem from shoddy domain modeling, improper item types, or poor reporting.test classroom

Nevertheless, valid, reliable, and actionable assessments can improve learning and performance. When we teach to a well-designed assessment, we should be teaching what we would have taught anyway, but now we have a meaningful measurement instrument that can help students and instructors improve.

I admit that there are constructs like creativity and teamwork that are more difficult to define, and appropriate assessment for these learning goals can be difficult. We may instinctively cringe at the thought of assessing an area like creativity—I would hate to see a percentage score assigned to my creativity.

But if creativity is a learning goal, we should be collecting evidence that helps us support the argument that our students are learning to be creative. A multiple choice test may be the wrong tool for that job, but we can use frameworks like evidence-centered design (ECD) to decide what information we want to collect (and the best methods for collecting it) to demonstrate our students’ creativity.

Assessments have evolved a lot over the past 25 years, and with better technology and design, test developers can improve the validity of the assessments and their utility in instruction. This includes new item types, simulation environments, improved data collection, a variety of measurement models, and better reporting of results. In some programs, the assessment is actually embedded in the everyday work or games that the participant would be interacting with anyway—a strategy that Valerie Shute calls stealth assessment.

With a growing number of tools available to us, test developers should always be striving to improve how we test what we teach so that we can proudly teach to the test.

To Your Health! Good practice for competency testing in laboratories

John Kleeman HeadshotPosted by John Kleeman

In the world of health care, from pathology labs to medical practitioners to pharmaceutical manufacturers, a mistake can mean much more than a regulatory fine or losing money – people’s lives and health are at stake. Hospitals, laboratories and other medical organizations have large numbers of people and need effective systems to make them work well together.

I’ve been learning about how assessments are used in the health care sector. Here is the first of a series of blog articles in the  theme of “learning from health care”.

In this article, I’d like to share some of what I’ve learned about how pathology and other health care laboratories approach competency assessment. Laboratory personnel have to work tirelessly and in an error-free way to give good quality, reliable pathology results. And mistakes cost – as the US College of American Pathologists (CAP) state in their trademarked motto “Every number is a life”. I think there is a lot we can all learn from how they do competency testing.

Job Description -> Task-specific Training -> Competency Assessment -> Competency RecognitionA good place to start is with the World Health Organization (WHO). Their training on personnel management reminds us that “personnel are the most important laboratory resource” and they promote competency assessment based on a job description and task-specific training as shown in the diagram on the right.

WHO advise that competency assessments should be conducted regularly (usually once or twice a year) and they recommend observational assessments for many areas of competence:  “Observation is the most time-consuming way to assess employee competence, but this method is advised when assessing the areas that may have a higher impact on patient care.” Their key steps for conducting observational assessments are:

  • Assessor arranges with employee a pre-arranged time for the assessment
  • The assessment is done on routine work tasks
  • To avoid subjectivity, the assessment should be recorded on a fixed check-list with everyone assessed the same way, to avoid bias
  • The results of the assessment are recorded, kept confidential but shared with the employee
  • If remediation is needed, an action plan involving retraining is defined and agreed with the employee

WHO’s guidance is international. Here is some additional guidance from the US, from a 2012 presentation in the US by CAP’s inspections team lead on competency assessment for pathology labs. This advice seems to make sense in a wider context:

  • If it’s not documented, it didn’t happen!
  • You need to do competency assessment on every person on every important system they work with
  • If employees who are not in your department or organization, contribute significantly to the work product, you  need to assess their competence too. Otherwise the quality of your work product is impacted
  • Competency assessment often contains quizzes/tests, observational assessments, review of records, demonstration of taking corrective action and troubleshooting
  • If people fail competency assessment, you need to re-train, re-assess and document that

If your organization relies on employees working accurately, I hope this provides value and interest to you. I will share more of what I’m learning in future articles.