Performance testing versus knowledge testing

Joan Phaup HeadshotPosted by Joan Phaup

Art Stark is an instructor at the United States Coast Guard National Search and Rescue School – and a longtime Questionmark user.

He will team up with James Parry, Test Development/E-Testing Manager at the Coast Guard’s Performance Systems Branch, to share a case study at the Questionmark Users Conference in Baltimore March 3 – 6.

I’m looking forward to hearing about the Coast Guard’s progress in moving from knowledge-based tests to performance-based tests. Here’s how Art explains the basic ideas behind this.

Tell me about your experience with performance-based training at the Coast Guard.

Art Stark photo

Art Stark

All Coast Guard training is performance-based. At the National Search and Rescue School we’ve recently completed a course re-write and shifted more from knowledge-based assessments to performance-based assessments. Before coming to the National SAR School, I was an instructor and boat operator trainer on Coast Guard small boats. Everything we did was 100% performance-based. The boat was the classroom and we had standards and objectives we had to meet.

How does performance testing differ from knowledge testing?

To me, knowledge-based testing is testing to the lowest denominator. All through elementary and high school we have been tested at the knowledge level and very infrequently at a performance level. Think of a test you may have crammed for, as soon as the test was over you promptly forgot the information. Most times this was just testing knowledge.

Performance testing is actually being able to observe and evaluate the performance while it is occurring. Knowledge testing is relatively easy to develop. Performance testing is much harder and much more expensive, to create. With reductions to budgets, it is becoming harder and harder to develop the type of facilities we need to use for performance testing, so we need to find new, less expensive ways to test performance.

It takes a much more concerted effort to develop knowledge application test items than to develop simple knowledge test items. When a test is geared to knowledge only, it does not give the evaluator a good assessment of the student’s real ability. An example of this would be applying for a job as a customer service representative. Often there are questions for the job that actually test the application of knowledge, such as “You are approached by an irate customer, what actions do you take…?”

How will you address this during your session?

We’ll look at using written assessments to test performance objectives, which requires creating test items that apply knowledge instead of just recalling it. Taking from Blooms Taxonomy, I look at the third step, application. I’ll be showing how to bridge the gap from knowledge-based testing to performance-based testing.

What would you and Jim like your audience to take away from your presentation?

A heightened awareness of using written tests to evaluate performance.

You’ve attended many of these conferences. What makes you return each year?

The ability to connect with other professionals and increase my knowledge and awareness of advances in training. Meeting and being with good friends in the industry.

Check out the conference program and register soon.

Evaluating the Test — Test Design & Delivery Part 10

Doug Peterson HeadshotPosted By Doug Peterson

In this, the 10th and final installment of the Test Design and Delivery series, we take a look at evaluating the test. Statistical analysis improves as the number of test takers goes up, but data from even a few attempts can provide useful information. In most cases, it we recommended performing analysis on data from at least 100 participants data from 250 or more is considered more trustworthy.

Analysis falls into two categories: item statistics and analysis (the performance of an individual item), and test analysis (the performance of the test as a whole). Questionmark provides both of these analyses in our Reports and Analytics suites.

Item statistics provide information on things like how many times an item has been presented and how many times each choice has been selected. This information can point out a number of problems:

  • An item that has been presented a lot may need to be retired. There is no hard and fast number as far as how many presentations is too many, but items on a high-stakes test should be changed fairly frequently.
  • If the majority of test-takers are getting the question wrong but they are all selecting the same choice, the wrong choice may be flagged as the correct answer, or the training might be teaching the topic incorrectly.
  • If no choice is being selected a majority of the time, it may indicate that the test-takers are guessing, which could in turn indicate a problem with the training. It could also indicate that no choice is completely correct.

Item analysis typically provides two key pieces of information: the Difficulty Index and the Point-Biserial Correlation.

  • Difficulty index: P value = % who answered correctly
  • Too high = too easy
  • Too low = too hard, confusing or misleading, problem with content or instruction
  • Point-Biserial Correlation: how well item discriminated between those who did well on the exam and those who did not
  • Positive value = those who got the item correct also did well on the exam, and those who got the item wrong also did poorly on the exam
  • Negative value = those who did well on the test got the item wrong, those who did poorly on the test got the item right
  • +0.10 or above is typically required to keep an item

Test analysis typically comes down to determining a Reliability Coefficient. In other words, does the test measure knowledge consistently – does it produce similar results under consistent conditions? (Please note that this has nothing to do with validity. Reliability does not address whether or not the assessment tests what it is supposed to be testing. Reliability only indicates that the assessment will return the same results consistently, given the same conditions.)

  • Reliability Coefficient: range of 0 – 1.00
  • Acceptable value depends on consequences of testing error
  • If failing means having to take some training again, a lower value might be acceptable
  • If failing means the health and safety of coworkers might be in jeopardy, a high value is required

part 10

There are a number of different types of consistency:

  • Test – Retest: repeatability of test scores with the passage of time
  • Alternate / Parallel Form: consistency of score across two or more forms by same test taker
  • Inter-Rater: consistency of test score when rated by different raters
  • Internal Consistency: extent to which items on a test measure the same thing
  • Most common: Kuder Richardson-20 (KR-20) or Coefficient Alpha
  • Items must be single answer (right/wrong)
  • May be low if test measures several different, unrelated objectives
  • Low value can also indicate many very easy or hard items, poorly written items that do not discriminate well, or items that do not test the proper content
  • Mastery Classification Consistency
  • Criterion-referenced tests
  • Not affected by items measuring unrelated items
  • 3 common measures:
  • Phi coefficient
  • Agreement coefficient
  • Kappa

Doug will share these and other best practices for test design and delivery at the Questionmark Users Conference in Baltimore March 3 -6. The program includes an optional pre-conference workshop on Criterion-Referenced Test Development led by Sharon Shrock and Bill Coscarelli. Click here for conference and workshop registration.

Assessments that mitigate risk bring other business benefits, too

Headshot Julie

Posted by Julie Delazyn

Quizzes, tests and surveys are important tools for helping financial services organizations with risk mitigation – a key component of their success. Such assessments provide a cost-effective way to confirm and document mandatory regulatory training. They help ensure that employees understand their roles and what is required of them to meet regulatory and business needs. What’s more, compliance-related assessments also do a lot to improve performance – making for happier customers and employees.

Our white paper, The Role of Assessments in Mitigating Risk for Financial Services Organizations, offers best practices for implementing a legally defensible assessment program. It explains why many large companies regard online assessments as a crucial to their compliance programs. It also describes how compliance-related assessments can bring peace of mind and tremendous business value.

Here are just a few business advantages of effective assessments:white-paper-cover1-300x244.png

They help meet explicit or implicit legal requirements — The most compelling reason for assessments is that they are legally required! As the US FDIC says: “Once personnel have been trained on a particular subject, a compliance officer should periodically assess employees on their knowledge and comprehension of the subject matter”. Some regulators require you to ensure your employees are “competent.” Assessments aligned with a formal competency model are a strong way to demonstrate competence against such principles and reduce regulatory risk.

They demonstrate commitment to complying with laws — Requiring all employees, including managers, to take a test on the laws and regulations is an important signal of top management’s commitment to following the laws. In some jurisdictions, should an employee break the rules, having a strong compliance policy can prove the employee acted alone.

They conclusively and cost-effectively document that training has taken place — Many regulators and laws require you to document that people have been through training. Passing a test is clear evidence of attending and understanding training – and probably the least expensive and most conclusive way of proving this.

They give early warning of problems  — Assessments are one of the few ways of contacting and getting input from your entire workforce. If they are prepared well and analyzed effectively, the results can tell you of potential problems in time to act and resolve them before they cause pain: they let you see into the future.

They harness training required by compliance to give business advantage — Successful companies in financial services often see compliance training as an opportunity for business advantage. Financial services are a people business; if you are training your people for compliance purposes, you can also take the opportunity of training for business value and customer service.

They reduce cost and time spent in unnecessary training — If an employee already knows something well, then training him or her in it is a waste of resources and motivation. Diagnostic tests offer a way of identifying what employees know, allowing employees to “test out” of training they do not need to take.

They ensure that partners and brokers understand your products — Many financial services organizations work with brokers, advisers or other third-party partners. In some jurisdictions, your company is liable if these products or services are mis-sold, and in all jurisdictions you will want these partners to be capable and successful with your products.

They reduce forgetting among your employees — People quickly forget material after learning it. Giving assessments after a training session gives participants a chance to practice retrieval of the learning and significantly increases long-term retention.

The figure below shows results from a cognitive psychology peer-reviewed paper: One group of people studied material and the other group spent the same amount of time studying and being tested. After 5 minutes, there was little difference between how much they knew, but a week later, the group that took tests recalled significantly more (56% vs. 42%).


 (Data from Experiment 1 in Roediger, H. L., III, & Karpicke, J. D. (2006b). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science, 17, 249-255.  )

Click here to read the paper, which you can download free after login or sign-up.

Writing high-complexity items for performance-based tests

Joan Phaup HeadshotPosted by Joan Phaup

Performance-based tests require complex items that require participants to show that they know how to apply what they have learned.

James Parry from the U.S. Coast Guard will lead a discussion about the challenges involved in writing these types of questions during the Questionmark 2013 Users Conference in Baltimore March 3 – 6: Advanced Question Writing: Creating High-Complexity Test Items.

I asked James for some background on the subject, which is a major interest of his work as Test Development/E-Testing Manager for Coast Guard Training Center Yorktown, Virginia.

Why are higher-complexity test items important?


James Parry

The push in today’s society is towards performance-based testing to ensure a candidate can actually ‘do’ the job and not just recite knowledge about the job.  True performance-based testing involves each individual performing each task while being observed. With our challenging global economy, having both the time and real equipment for each student to perform each task is, most times, cost prohibitive.  Well-written questions that address performance rather than just knowledge can help bridge that gap.

Why is it more difficult to write questions that go beyond checking someone’s knowledge?

Performance-based test items are expensive and time consuming to develop compared to knowledge-based items.  A solution is a hybrid approach which can use a traditional objective-based written test item to check the foundation knowledge at the highest cognitive level possible as a precursor to the actual performance test.  Being able to develop test items at a higher cognitive level is difficult because the designer has to think outside of the traditional multiple choice test item and more towards the performance.

What’s your own experience writing higher-complexity test items?

The Coast Guard requires performance-based testing at all of its entry-level schools to ensure the young men and women going into the field can actually perform the jobs they are trained to do.  As the test development manager I work with the test authors at the schools to help build foundational assessments at various levels of complexity to test the students comprehension as close to the actual performance as possible before actually performing the tasks.

What will be happening during your discussion session?

I will be discussing how both Bloom’s Taxonomy and Gangé’s nine Instructional events relate to both complexity and cognitive levels of testing.  I’ll share an example of how a simple objective can be tested at just about any level of complexity and couple the example with Bloom’s Taxonomy.  Then I’ll ask the participants to develop a test item at all of the complexity and taxonomy levels.

Who do you think will be benefit most from this discussion?

Participants who have a basic understanding of test item development will benefit the most, but novice test developers and training managers will benefit as well.

How can people prepare for this session?

Those who attend Sharon Shrock and Bill Coscarelli’s pre-conference workshop on Criterion-Reference Test Development will have a good foundation to build on – but just reading up on the subject of performance-based testing would be helpful, too. I’d like participants to think of a learning objective in their own area of expertise and how they would like to design items that test more than just rote knowledge.

What do you hope your participants will take away?

Participants should be able to walk away with a better understanding of the challenges associated with solid development of test items that test more than memorization of knowledge.

You’ve attended several Users Conferences. What prompts you to return?

I think this is my 7th or 8th User’s Conference.  Every time I attend I walk away with a better understanding of testing and how Questionmark technologies are being used in creative ways to enhance the testing experience.  The networking opportunities and access to testing experts from around the world are invaluable to me.

Click here to see the complete conference program, and register soon.

Only two out of ten learning techniques have high utility

John Kleeman HeadshotPosted by John Kleeman

I’m indebted to Kerry Eades at the Oklahoma CareerTech Testing Center for alerting me to a just-published research paper by a team of authors led by John Dunlosky of Kent State University’s Department of Psychology. The paper evaluates 10 learning techniques and the evidence that they are genuinely useful in learning. Techniques are characterized as “high utility” if their benefit is robust and generalizes widely, or “moderate utility” or “low utility” if they are less effective, less general or have insufficient evidence.

They identified two high utility techniques  – practice testing (self-testing or taking practice quizzes) and distributed practice (spacing out practice over time) and three as having moderate utility – elaborative interrogation and self-explanation (simplistically both ways of asking why something is so) and interleaved practice (mixing practice up with other things). Five techniques were considered low utility – including summarization, re-studying and highlighting.

You can see this schematically in the diagram below.

High utility, moderate utility and low utility - diagram showing text above

For those of you attending the Questionmark Users Conference in Baltimore in March, I’ll be sharing more of my understanding of these areas at my session, Assessment Feedback – What can we learn from Psychology Research. If you’re not able to attend the conference, the 55-page paper is well worth reading – it is Improving Students’ Learning With Effective Learning Techniques: Promising Directions From Cognitive and Educational Psychology and is published in the journal Psychological Science in the Public Interest. An online version is here.

The paper looks at over 120 research articles on practice testing/quizzing and finds practice testing has broad applicability:

“effects have been demonstrated across an impressive range of practice-test formats, kinds of material, learner ages, outcome measures, and retention intervals”

The paper also reports evidence that practice testing (and also distributed practice or spacing out learning) works not just in the laboratory but also in representative real-life educational contexts. It also suggests feedback improves the effect:

“Practice testing with feedback also consistently outperforms practice testing alone”

The paper ends by suggesting that there are many factors which contribute to students and others failing to learn, and that improved learning techniques will not on their own improve learning – motivation, for instance, is also important. But the authors suggest that encouraging use of the higher utility techniques (such as practice testing and distributed practice) and discouraging students from using lower utility techniques such as rereading or highlighting would produce meaningful gains in learning.

Early-birds: Conference registration savings end today

Joan Phaup HeadshotPosted by Joan Phaup

The Questionmark 2013 Users Conference will start about six weeks from now, so early-bird sign-ups will end this Friday, January 18th.

If you want to save $100 on your conference registration, now is the time to act!

We are looking forward to seeing customers in Baltimore for three intensive days of learning, professional development and networking. Whether you are just starting out with Questionmark or have many years of experience, this conference truly is the best place to learn about our technologies, improve your assessments and gather knowledge from Questionmark staff, industry experts and fellow learning and assessment professionals.

Some highlights:

  • Charles Jennings’ keynote, Meeting the Challenge of Measuring Informal and Workplace Learning
  • Bring-your-own-laptop sessions on authoring questions and assessments
  • Nine different instructional presentations about the use of Questionmark features and functions
  • Focus groups about future solutions for assessment authoring, delivery and analytics
  • Opportunities to meet one-on-one with Questionmark technicians
  • Case studies and peer discussions
  • Instruction in the use of Questionmark features and functions
  • Advice from industry experts on everything from instructional design to item analysis
  • Questionmark’s 25th Anniversary Party and other great social events
  • Two optional pre-conference workshops: Boot Camp for Beginners and Criterion-Referenced Test Development

Don’t delay! Sign up today!