An argument against using negative item scores in CTT

Austin Fossey-42Posted by Austin Fossey

Last year, a client asked for my opinion about whether or not to use negative scores on test items. For example, if a participant answers an item correctly, they would get one point, but if they answer the item incorrectly, they would lose one point. This means the item would be scored dichotomously [-1,1] instead of in the more traditional way [0,1].

I believe that negative item scores are really useful if the goal is to confuse and mislead participants. They are not appropriate for most classical test theory (CTT) assessment designs, because they do not add measurement value, and they are difficult to interpret.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015  *space is limited

Measurement value of negative item scores

Changing the item scoring format from [0,1] to [-1,1] does not change anything about your ability to measure participants—after all, the dichotomous scores are just symbols. You are simply using a different total score scale.

Consider a 60-item assessment made up of dichotomously scored items. If the items are scored [0,1], the total score scale ranges from 0 to 60 points. If scored [-1,1], the score range doubles, now ranging from -60 to 60 points.

From a statistical standpoint, nothing has changed. The item-total discrimination statistics will be the same under both designs, as will the assessment’s reliability. The standard error of measurement will double, but that is to be expected because the score range has doubled. Thus there is no change in the precision of scores or misclassification rates. How you score the items does not matter as long as they are scored dichotomously on the same scale.

The figure below illustrates the score distributions for 1,000 normally distributed assessment scores that were simulated using WinGen. This sample’s item responses have been scored with three different models: [-1,1], [0,1], and [0,2]. While this shifts and stretches the distribution of scores on to different scales, there is no change in reliability or the standard error of measurement (as a percentage of the score range).

Distribution and assessment statistics for 1,000 simulated test scores with items dichotomously scored three ways: [-1,1], [0,1], and [0,2]

 Interpretation issues of negative item scores

If the item scores do not make a difference statistically, and they are just symbols, then why not use negative scores? Remember that an item is a mechanism for collecting and quantifying evidence to support the student model, so how we score our items (and the assessment as a whole) plays a big role in how people interpret the participant’s performance.

Consider an item scored [0,1]. In a CTT model, a score of 1 represents accumulated evidence about the presence or magnitude of a construct, whereas a score of 0 suggests that no evidence was found in the response to this item.

Now suppose we took the same item and scored it [-1,1]. A score of 1 still suggests accumulated evidence, but now we are also changing the total score based on wrong answers. The interpretation is that we have collected evidence about the absence of the construct. To put it another way, the test designer is claiming to have positive evidence that the participants does not know something.

This is not an easy claim to make. In psychometrics, we can attempt to measure the presence of a hypothetical construct, but it is difficult to make a claim that a construct is not there. We can only make inferences about what we observe, and I argue that it is very difficult to build an evidentiary model for someone not knowing something.

Furthermore, negative scores negate evidence we have collected in other items. If a participant gets one item right and earns a point but then loses that point on the next item, we have essentially canceled out the information about the participant from a total score perspective. By using negative scores in a CTT model, we also introduce the possibility that someone can get a negative score on the whole test, but what would a negative score mean? This lack of interpretability is one major reason people do not use negative scores.

Consider a participant who answers 40 items correctly on the 60-item assessment I mentioned earlier. When scored [0,1], the raw score (40 points) corresponds to the number of correct responses provided by the participant. This scale is useful for calculating percentage scores (40/60 = 67% correct), setting cut scores, and supporting the interpretation of the participant’s performance.

When the same items are scored [1,-1], the participant’s score is more difficult to interpret. The participant answered 40 questions correctly, but they only get a score of 20. They know the maximum score on the assessment is 60 points, yet their raw score of 20 corresponds to a correct response rate of 67%, not 33%, since 20 points corresponds to 67% of the range between -60 to 60 points.

There are times when items need to be scored differently from other items on the assessment. Polytomous items clearly need different scoring models (though similar interpretive arguments could be leveled against people who try to score items in fractions of points), and there are times when an item may need to be weighted differently from other items. (We’ll discuss that in my next post.)Some item response theory (IRT) assessments like the SAT use negative points to correct for guessing, but this should only be done if you can demonstrate improved model fit and you have a theory and evidence to justify doing so. In general, when using CTT, negative item scores only serve to muddy the water.

Interested in learning more about classical test theory and item statistics? Psychometrician Austin Fossey will be delivering a free 75 minute online workshop — Item Analysis: Concepts and Practice Tuesday, June 23, 2015 11:00 AM – 12:15 PM EDT  *spots are limited

Caveon Q&A: Enhanced security of high-stakes tests

Headshot JuliePosted by Julie Delazyn

Questionmark and Caveon Test Security, an industry leader in protecting high-stakes test programs, have recently joined forces to provide clients of both organizations with additional resources for their test administration toolboxes.

Questionmark’s comprehensive platform offers many features that help ensure security and validity throughout the assessment process. This emphasis on security, along with Caveon’s services, which include analyzing data to identify validity risks as well as monitoring the internet for any leak that could affect intellectual property, adds a strong layer of protection for customers using Questionmark for high-stakes assessment management and delivery.

I sat down with Steve Addicott, Vice President of Caveon, to ask him a few questions about the new partnership, what Caveon does and what security means to him. Here is an excerpt from our conversation

Who is Caveon? Tell me about your company.

At Caveon Test Security, we fundamentally believe in quality testing and trustworthy test results. That’s why Caveon offers test security and test item
development services dedicated to helping prevent test fraud and better protecting our clients’ items, tests, and reputations.

What does security mean to you, and why is it important?

High stakes test programs make important education and career decisions about test takers based on test results. We also spend a tremendous amount of time creating, administering, scoring, and reporting results. With increased security pressures from pirates and cheats, we are here to make sure that those results are trustworthy, reflecting the true knowledge and skills of test takers.

Why a partnership with Questionmark and why now?

With a growing number of Questionmark clients engaging in high-stakes testing, Caveon’s experience in protecting the validity of test results is a natural extension of Questionmark’s security features. For Caveon, we welcome the chance to engage with a vendor like Questionmark to help protect exam results.

And how does this synergy help Questionmark customers who deliver high-stakes tests and exams?

As the stakes in testing continue to rise, so do the challenges involved in protecting your program. Both organizations are dedicated to providing clients with the most secure methods for protecting exam administrations, test development investments, exam result validity and, ultimately, their programs’ reputations.

For more information on Questionmark’s dedication to security, check out this video and download the white paper: Delivering Assessments Safely and Securely.

US Justice Department demands accessible educational technology

John Kleeman HeadshotPosted by John Kleeman

The US Justice Department made an important intervention last week, that could tip the balance in making educational technology more accessible for learners with disabilities.

They are intervening on the side of the learner in a court case between a blind learner and Miami University. The case is about learners with disabilities not getting the same access to digital content as other learners. For example, according to the complaint, the university required all learners to use applications with inaccessible Flash content as well as an LMS that was not usable with screen readers.

To quote the US Justice Department’s motion to intervene:

“Miami University’s failure to make its digital- and web-based technologies accessible to individuals with disabilities, or to otherwise take appropriate steps to ensure effective communication with such individuals, places them at a great disadvantage, depriving them of equal access to Miami University’s educational content and services.”

Example question with black on white text showing buttons that can change text size and contrastQuestionmark has long taken accessibility seriously. When we re-architected our assessment delivery engine for our version 5 release, we made accessibility a priority – see Assessment Accessibility in Questionmark Perception Version 5 .  Our platform  includes several standard templates that include “text sizing” and “contrast controls” that administrators can make available to participants – these can be helpful for certain visual impairments.

Here are some other aspects of the delivery platform that we have optimized for accessibility:

  • The administrator can override an established assessment time limit for certain participants
  • Participants can use a pointing device other than a mouse or navigate the assessment using keystrokes such as the “tab” The same question as above showing a different contrast, with yellow text on a blue backgroundkey
  • Screen readers can be used to clearly dictate assessment questions, choices and other content

Please note that preparing assessments for participants with disabilities takes more than an optimized delivery platform: assessment authors and administrators need to plan for accessibility as well. For example, items that rely heavily on graphics or images must use suitable description tags, videos should be appropriately captioned, and so on. Vendors and testing organizations alike must make a constant effort to ensure that material stays accessible as technology changes.

Providing you are following best practice for developing accessible content, the Questionmark delivery platform can complete the loop and help you give all of your participants–including those with disabilities–a reliable and fair test-taking experience.

Accessible software is good for everyone, not just those who are temporarily or permanently need accommodations for their disabilities. Many of the technologies required to make software accessible also enhance delivery on mobile devices and improve blended delivery in general.

With the US Department of Justice now engaging in lawsuits against institutions that do not take accessibility seriously, accessibility support will become more important to everyone.

 

Does online learning and assessment help sustainability?

John Kleeman HeadshotPosted by John Kleeman

Encouraged by public interest and increasing statutory controls, most large organizations care about and report on environmental sustainability and greenhouse gas emissions. I’ve been wondering how much online assessments and the wider use of e-learning help sustainability. Does taking assessments and learning online contribute to the planet’s well-being?

Does using computers instead of paper save trees?Picture of trees, part cut down

It’s easy to see that by taking exams on computer, we save a lot of paper. Trees vary in size, but it seems the average tree might make about 50,000 pages of paper. If a typical paper test uses 10 pages of paper, then an organization that delivers 100,000 tests per year is using 20 trees a year. Or suppose a piece of learning material is 100 pages is distributed to 10,000 learners. The 20 trees cut down for that learning would be saved if the learning were delivered online.

These are useful benefits, but they need to be set against the environmental costs of the computers and electricity used. The environmental benefit is probably modest.

What about the benefits of reduced business travel?

A much stronger environmental case might be made around reduced travel. Taking a test on paper and/or in a test center likely means travelling. So we’re not surprised to be seeing increased use of online proctoring. For example, SAP are starting to use it for their certification exams. Online proctoring means that a candidate doesn’t have to travel to a test center but can take an exam from their home or office. This saves time and money. It also eliminates the environmental costs of  travel. Learning online rather than going to a classroom does the same.

Training and assessment are only a small reason for business travel, but the overall environmental impact of business travel is imagehuge.  One large services company has reported that 67 percent of their carbon footprint in 2014 was related to it. Another  indicates that cost at over 30 percent.. Many large companies have internal targets to reduce business travel greenhouse gas emissions.

In the academic world, the Open University in the UK performed a study a few years back on the carbon benefits of their model of distance learning compared with more conventional university education. The study suggested that carbon emissions were 85 percent lower with distance education compared with a more conventional university approach. However, the benefit of electronic delivery rather than paper delivery in distance learning was more modest at 12 percent, partly because students often print the e-learning materials. This suggests that there is a very substantial benefit in distance learning and a smaller benefit in it being electronic rather than paper-based.

The strongest benefit of online assessment is that it  gives accurate information about people’s knowledge, skills and abilities to help organizations make good decisions. But it does seem that there may well also be a useful environmental benefit too.

7 actionable steps for making your assessments more trustable

John Kleeman HeadshotPosted by John Kleeman

Questionmark has recently published a white paper on trustable assessment,  and we blog about this topic frequently. See Reliability and validity are the keys to trust and The key to reliability and validity is authoring for some recent blog posts about the white paper.

But what can you do today if you want to make your assessments more trustable? Obviously you can read the white paper! But here are seven actionable steps that if you’re not doing already you could do today or at least reasonably quickly to improve your assessments.

1. Organize questions in an item bank with topic structure

If you are already using Questionmark software, you are likely doing this already.  But putting questions in an item bank structured by hierarchical topics facilitates an easy management view of all questions and assessments under development. It allows you to use the same question in multiple assessments, easily add questions and retire them and easily search questions, for example to find the ones that need update when laws change or a product is retired.

2. Use questions that apply knowledge in the job context

It is better to ask questions that check how people can apply knowledge in the job context than just to find out whether they have specific knowledge. See my earlier post Test above knowledge: Use scenario questions for some tips on this. If you currently just test on knowledge and not on how to apply that knowledge, make today the day that you start to change!

3. Have your subject matter experts directly involved in authoring

Especially in an area where there is rapid change, you need subject matter experts directly involved in authoring and reviewing questions. Whether you use Questionmark Live or another system, start involving them.

4. Set a pass score fairly

Setting a pass score fairly is critical to being able to trust an assessment’s results. See Is a compliance test better with a higher pass score? and Standard Setting: A Keystone to Legal Defensibility for some starting points on setting good pass scores. And if you don’t think you’re following good practice, start to change.

5. Use topic scoring and feedback

As Austin Fossey explained in his ground-breaking post Is There Value in Reporting Subscores?, you do need to check whether it is sensible to report topic scores. But in most cases, topic scores and topic feedback can be very useful and actionable – they direct people to where there are problems or where improvement is needed.

6. Define a participant code of conduct

If people cheat, it makes assessment results much less trustable. As I explained in my post What is the best way to reduce cheating? , setting up a participant code of conduct (or honesty code) is an easy and effective way of reducing cheating. What can you do today to encourage your test takers to believe your program is fair and be on your side in reducing cheating?

7. Run item analysis and weed out poor items

This is something that all Questionmark users could do today. Run an item analysis report – it takes just a minute or two from our interfaces and look at the questions that are flagged as needing review (usually amber or red). Review them to check appropriateness and potentially retire them from your pool or else improve them.

Questionmark item analysis report

 

Many of you will probably be doing all the above and more, but I hope that for some of you this post could be a spur to action to make your assessments more trustable. Why not start today?