10 Reasons Why Frequent Testing Makes Sense

Posted by John Kleeman

It matters to society, organizations and individuals that test results are trustable. Tests and exams are used to make important decisions about people and each failure of test security reduces that trustworthiness.

There are several risks to test security, but two important ones are identity fraud and getting help from others. With identity fraud, someone asks a friend to take the test for them or pays a professional cheater to take the test and pretend to be them. With getting help from others, a test-taker subverts the process and gets a friend or expert to help them with the test, feeding them the right answers. In both cases, this makes the individual test result meaningless and detracts from the value and trustworthiness of the whole assessment process.

There are lots of mitigations to these risks – checking identity carefully, having well trained proctors, using forensics or other reports and using technical solutions like secure browsers – and these are very helpful. But testing more frequently can also reduce the risk: let me explain.

Suppose you just need to pass a single exam to get an important career step – certification, qualification or other important job requirement, then the incentive to cheat on that one test is large. But if you have a series of smaller tests over a period, then it’s more hassle for a test taker to conduct identity fraud or to get help from others each time. He or she would have to pay the proxy test taker several times.  And make sure the same person is available in case photos are captured. And for the expert help you also must reach out more often, and evade whatever security there is each time

There are other benefits too; here is a list of ten reasons why more frequent testing makes sense:

  1. More reliable. More frequent testing contributes to more reliable testing. A single large test is vulnerable to measurement error if a test taker is sick or has an off day, whereas that is less likely to impact frequent tests.
  2. More up to date. With technology and society changing rapidly, more frequent tests can make tests more current. For instance, some IT certification providers create “delta” tests measuring understanding of their latest releases and encourage people to take quarterly tests to ensure they remain up to date.
  3. Less test anxiety. Test anxiety can be a big challenge to some test takers (see Ten tips on reducing test anxiety for online test-takers), and more frequent tests means less is at stake for each one, and so may help test takers be less anxious.
  4. More feedback. More frequent tests give feedback to test takers on how well they are performing and allow them to identify training or continuing education to improve.
  5. More data for testing organization. In today’s world of business intelligence and analytics, there is potential for correlations and other valuable insight from the data of people’s performance in a series of tests over time.
  6. Encourages test takers to target retention of learning. We all know of people who cram for an exam and then forget it afterwards. More frequent tests encourage people to plan to learn for the longer term.
  7. Encourages spaced out learning. There is strong evidence that learning at spaced out intervals makes it more likely knowledge and skills will be retained. Periodic tests encourage revision at regular intervals and so make it more likely that learning will be remembered.
  8. Testing effect. There is also evidence that tests themselves give retrieval practice and aid retention and more frequent tests will give more such practice.
  9. More practical. With online assessment software and online proctoring, it’s very practical to test frequently, and no longer necessary to bring test takers to a central testing center for one off large tests.
  10. Harder to cheat. Finally, as described above, more frequent testing makes it harder to use identity fraud or to get help from others, which reduce cheating.

I think we’re seeing a slow paradigm shift from larger testing events that happen at a single point in time to smaller, online testing events happening periodically. What do you think?

High-stakes assessment: It’s not just about test takers

Lance bio picPosted by

In my last post I spent some time defining how I think about the idea of high-stakes assessment. I also talked about how these assessments affect the people who take them including how important it is to their ability to get or do a job.

Now I want to talk a little bit about how these assessments affect the rest of us.

The rest of us

Guess what? The rest of us are affected by the outcomes of these assessments. Did you see that coming?

But seriously, the credentials or scores that result from these assessments affect large swathes of the public. Ultimately that’s the point of high-stakes assessment. The resulting certifications and licenses exist to protect the public. These assessments are acting as barriers preventing incompetent people from practicing professions where competency really matters.

 It really matters

What are some examples of “really matters”? Well, when hiring, it really matters to employers that the network techs they hire knows how to configure a network securely, not that the techs just say they do. It matters to the people crossing a bridge that the engineers who designed it knew their physics. It really matters to every one of us that our doctor, dentist, nurse, or surgeon know what they are doing when they treat us. It really matters to society at large when we measure (well) the children and adults who take large-scale assessments like college entrance exams.

At the end of the day, high-stakes exams are high-stakes because in a very real way, almost all of us have a stake in their outcome.

 Separating the wheat from the chaff

There are a couple of ways that high stakes assessments do what they do. Some assessments are simply designed to measure “minimal competence,” with test takers either ending above the line—often known as “passing”—or below the line. The dreaded “fail.”

Other assessments are designed to place test takers on a continuum of ability. This type of assessment assigns scores to test takers, and the range of
score often appear odd to laypeople. For example, the SAT uses a 200 – 800 scale.

Want to learn more? Hang on till next time!

When to Give Partial Credit for Multiple-Response Items

Austin Fossey-42 Posted by Austin Fossey

Three different customers recently asked me how to decide between scoring a multiple-response (MR) item dichotomously or polytomously; i.e., when should an MR item be scored right/wrong, and when should we give partial credit? I gave some garrulous, rambling answers, so the challenge today is for me to explain this in a single blog post that I can share the next time it comes up.

In their chapter on multiple-choice and matching exercises in Educational Assessment of Students (5th ed.), Anthony Nitko and Susan Brookhart explain that matching items (which we may extend to include MR item formats, drag-and-drop formats, survey-matrix formats, etc.) are often a collection of single-response multiple choice (MC) items. The advantage of the MR format is that is saves space and you can leverage dependencies in the questions (e.g., relationships between responses) that might be redundant if broken into separate MC items.

Given that an MR items is often a set of individually scored MC items, then a polytomously scored format almost always makes sense. From an interpretation standpoint, there are a couple of advantages for you as a test developer or instructor. First, you can differentiate between participants who know some of the answers and those who know none of the answers. This can improve the item discrimination. Second, you have more flexibility in how you choose to score and interpret the responses. In the drag-and-drop example below (a special form of an MR item), the participant has all of the dates wrong; however, the instructor may still be interested in knowing that the participant knows the correct order of events for the Stamp Act, the Townshend Act, and the Boston Massacre.

stamp 1

Example of a drag-and-drop item in Questionmark where the participant’s responses are wrong, but the order of responses is partially correct.

Are there exceptions? You know there are. This is why it is important to have a test blueprint document, which can help clarify which item formats to use and how they should be evaluated. Consider the following two variations of a learning objective on a hypothetical CPR test blueprint:

  • The participant can recall the actions that must be taken for an unresponsive victim requiring CPR.
  • The participant can recall all three actions that must be taken for an unresponsive victim requiring CPR.

The second example is likely the one that the test developer would use for the test blueprint. Why? Because someone who knows two of the three actions is not going to cut it. This is a rare all-or-nothing scenario where knowing some of the answers is essentially the same (from a qualifications standpoint) as knowing none of the answers. The language in this learning objective (“recall all three actions”) is an indicator to the test developer that if they use an MR item to assess this learning objective, they should score it dichotomously (no partial credit). The example below shows how one might design an item for this hypothetical learning objective with Questionmark’s authoring tools:

stamp 2

Example of a Questionmark authoring screen for MR item that is scored dichotomously (right/wrong).

To summarize, a test blueprint document is the best way to decide if an MR item (or variant) should be scored dichotomously or polytomously. If you do not have a test blueprint, think critically about what you are trying to measure and the interpretations you want reflected in the item score. Partial-credit scoring is desirable in most use cases, though there are occasional scenarios where an all-or-nothing scoring approach is needed—in which case the item can be scored strictly right/wrong. Finally, do not forget that you can score MR items differently within an assessment. Some MR items can be scored polytomously and others can be scored dichotomously on the same test, though it may be beneficial to notify participants when scoring rules differ for items that use the same format.

If you are interested in understanding and applying some basic principles of item development and enhancing the quality of your results, download the free white paper written by Austin: Managing Item Development for Large-Scale Assessment

Item Development – Benefits of editing items before the review process

Austin FosseyPosted by Austin Fossey

Some test developers recommend a single round of item editing (or editorial review), usually right before items are field tested. When schedules and resources allow for it, I recommend that test developers conduct two rounds of editing—one right after the items are written and one after content and bias reviews are completed. This post addresses the first round of editing, to take place after items are drafted.

Why have two rounds of editing? In both rounds, we will be looking for grammar or spelling errors, but the first round serves as a filter to keep items with serious flaws from making it to content review or bias review.

In their chapter in Educational Measurement (4 th ed.), Cynthia Shmeiser and Catherine Welch explain that an early round of item editing “serves to detect and correct deficiencies in the technical qualities of the items and item pools early in the development process.” They recommend that test developers use this round of item editing to do a cursory review of whether the items meet the Standards for Educational and Psychological Testing.

Items that have obvious item writing flaws should be culled in the first round of item editing and either sent back to the item writers or removed. This may include item writing errors like cluing or having options that do not match the stem grammatically. Ideally, these errors will be caught and corrected in the drafting process, but a few items may have slipped through the cracks.

In the initial round of editing, we will also be looking for proper formatting of the items. Did the item writers use the correct item types for the specified content? Did they follow the formatting rules in our style guide? Is all supporting content (e.g., pictures, references) present in the item? Did the item writers record all of the metadata for the item, like its content area, cognitive level, or reference? Again, if an item does not match the required format, it should be sent back to the item writers or removed.

It is helpful to look for these issues before going to content review or bias review because these types of errors may distract your review committees from their tasks; the committees may be wasting time reviewing items that should not be delivered anyway due to formatting flaws. You do not want to get all the way through content and bias reviews only to find that a large number of your items have to be returned to the drafting process. We will discuss review committee processes in the following posts.

For best practice guidance and practical advice for the five key stages of test and exam development, check out our white paper: 5 Steps to Better Tests.

Early-bird savings on conference registration end today: Sign up now!

Joan Phaup 2013 (3)Posted by Joan Phaup

Just a reminder that you can save $200 if you register today for the Questionmark 2014 Users Conference.

We look forward to seeing you March 4 – 7 in San Antonio, Texas, for three intensive days of learning and networking.

Check out the conference program as it continues to take shape, and sign up today!

This conference truly is the best place to learn about our technologies, improve your assessments and discuss best practices with Questionmark staff, industry experts and your colleagues. But don’t take my word for it. Let these attendees at the 2013 tell you what they think:

 

 

Teaching to the test and testing to what we teach

Austin FosseyPosted by Austin Fossey

We have all heard assertions that widespread assessment creates a propensity for instructors to “teach to the test.” This often conjures images of students memorizing facts without context in order to eke out passing scores on a multiple choice assessment.

But as Jay Phelan and Julia Phelan argue in their essay, Teaching to the (Right) Test, teaching to the test is usually problematic when we have a faulty test. When our curriculum, instruction, and assessment are aligned, teaching to the test can be beneficial because we have are testing what we taught. We can flip this around and assert that we should be testing to what we teach.

There is little doubt that poorly-designed assessments have made their way into some slices of our educational and professional spheres. Bad assessment designs can stem from shoddy domain modeling, improper item types, or poor reporting.test classroom

Nevertheless, valid, reliable, and actionable assessments can improve learning and performance. When we teach to a well-designed assessment, we should be teaching what we would have taught anyway, but now we have a meaningful measurement instrument that can help students and instructors improve.

I admit that there are constructs like creativity and teamwork that are more difficult to define, and appropriate assessment for these learning goals can be difficult. We may instinctively cringe at the thought of assessing an area like creativity—I would hate to see a percentage score assigned to my creativity.

But if creativity is a learning goal, we should be collecting evidence that helps us support the argument that our students are learning to be creative. A multiple choice test may be the wrong tool for that job, but we can use frameworks like evidence-centered design (ECD) to decide what information we want to collect (and the best methods for collecting it) to demonstrate our students’ creativity.

Assessments have evolved a lot over the past 25 years, and with better technology and design, test developers can improve the validity of the assessments and their utility in instruction. This includes new item types, simulation environments, improved data collection, a variety of measurement models, and better reporting of results. In some programs, the assessment is actually embedded in the everyday work or games that the participant would be interacting with anyway—a strategy that Valerie Shute calls stealth assessment.

With a growing number of tools available to us, test developers should always be striving to improve how we test what we teach so that we can proudly teach to the test.