Reliability and validity are the keys to trust

image.png

Not reliable

John Kleeman HeadshotPosted by John Kleeman

How can you trust assessment results? The two keys are reliability and validity.

Reliability explained

An assessment is reliable if it measures the same thing consistently and reproducibly. If you were to deliver an assessment with high reliability to the same participant on two occasions, you would be very likely to reach the same conclusions about the participant’s knowledge or skills. A test with poor reliability might result in very different scores across the two instances.An unreliable assessment does not measure anything consistently and cannot be used for any trustable measure of competency. It is useful visually to think of a dartboard; in the diagram to the right, darts have landed all over the board—they are not reliably in any one place.In order for an assessment to be reliable, there needs to be a predictable authoring process, effective beta testing of items, trustworthy delivery to all the devices used to give the assessment, good-quality post-assessment reporting and effective analytics.

Validity explained

image.png

Reliable but not valid

Being reliable is not good enough on its own. The darts in the dartboard in the figure to the right are in the same place, but not in the right place. A test can be reliable but not measure what it is meant to measure. For example, you could have a reliable assessment that tested for skill in word processing, but this would not be valid if used to test machine operators, as writing is not one of the key tasks in their jobs.

An assessment is valid if it measures what it is supposed to measure. So if you are measuring competence in a job role, a valid assessment must align with the knowledge, skills and abilities required to perform the tasks expected of a job role. In order to show that an assessment is valid, there must be some formal analysis of the tasks in a job role and the assessment must be structured to match those tasks. A common method of performing such analysis is a job task analysis, which surveys subject matter experts or people in the job role to identify the importance of different tasks.

Assessments must be reliable AND valid

Trustable assessments must be reliable AND valid.

image.png

Reliable and valid

The darts in the figure to the right are in the same place and at the right place.

When you are constructing an assessment for competence, you are looking for it to consistently measure the competence required for the job.

 Comparison with blood tests

It is helpful to consider what happens if you go to the doctor with an illness. The doctor goes through a process of discovery, analysis, diagnosis and prescription. As part of the discovery process, sometimes the doctor will order a blood test to identify if a particular condition is present, which can diagnose the illness or rule out a diagnosis.

It takes time and resources to do a blood test, but it can be an invaluable piece of information. A great deal of effort goes into making sure that blood tests are both reliable (consistent) and valid (measure what they are supposed to measure). For example, just like exam results, blood samples are labelled carefully, as shown in the picture, to ensure that patient identification is retained.

image_thumb.pngA blood test that was not reliable would be dangerous—a doctor might think that a disease is not present when it is. Furthermore, a reliable blood test used for the wrong purpose is not useful—for example, there is no point in having a test for blood glucose level if the doctor is trying to see if a heart attack is imminent.

The blood test results are a single piece of information that helps the doctor make the diagnosis in conjunction with other data from the doctor’s discovery process.

In exactly the same way, a test of competence is an important piece of information to determine if someone is competent in their job role.

Using the blood test metaphor, it is easy to understand the personnel and organizational risks that can result from making decisions based on untrustworthy results. If an organization assesses someone’s knowledge, skill or competence for health and safety or regulatory compliance purposes, you need to ensure the assessment is designed correctly and runs consistently, which means that they must be reliable and valid.

For assessments to be reliable and valid, it is necessary that you follow structured processes at each step from planning through authoring to delivery and reporting. These processes are explained in our new white paper “Assessment Results You can Trust” and I’ll be sharing some of the content in future articles in this blog.

For fuller information, you can download the white paper, click here

Item Development – Planning your field test study

Austin Fossey-42Posted by Austin Fossey

Once the items have passed their final editorial review, they are ready to be delivered to participants, but they are not quite ready to be delivered as scored items. For large-scale assessments, it is best practice to deliver your new items as unscored field test items so that you can gather item statistics for review before using the items to count toward a participant’s score. We discussed field test studies in an earlier post, but today we will focus more on the operational aspects of this task.

If you are embedding field test items, there is little you need to do to plan for the field test, other than to collect data on your participants to ensure representativeness and to make sure that enough participants respond to the item to yield stable statistics. You can collect data for representativeness by using demographic questions in Questionmark’s authoring tools.

If field testing an entire form, you will need to plan your field test carefully. When an entire form is going to be field tested, Schmeiser and Welch ( Educational Measurement, 4th ed.) recommend testing twice as many items as you will need for your operational form.

To check representativeness, you may want to survey your participants in advance to help you select your participant sample. For example, if your participant population is 60% female and 40% male, but your field test sample is 70% male, then that may impact the validity of your field test results. It will be up to you to decide which factors are relevant (e.g., sex, ethnicity, age, level of education, location, level of experience). You can use Questionmark’s authoring tools and reports to deliver and analyze these survey results.

You will also need to entice participants to take your field test. Most people will not want to take a test if they do not have to, but you will likely want to conduct the field test expeditiously. You may want to offer an incentive to test, but that incentive should not bias the results.

For example, I worked on a certification assessment where the assessment cost participants several hundred dollars. To incentivize participation in the field test study of multiple new forms, we offered the assessment free of charge and told participants that their results would be scored once the final forms were assembled. We surveyed volunteers and selected a representative sample to field test each of the forms.

The number of responses you need for each item will depend on your scoring model and your organization’s policies. If using Classical Test Theory, some organizations will feel comfortable with 80 – 100 responses, but Item Response Theory models may require 200 – 500 responses to yield stable item parameters.

More is always better, but it is not always possible. For instance, if an assessment is for a very small population, you may not have very many field test participants. You will still be able to use the item statistics, but they should be interpreted cautiously in conjunction with their standard errors. In the next post, we will talk about interpreting item statistics in the psychometric review following the field test.

Get trustable results : Require a topic score as a prerequisite to pass a test

John Kleeman HeadshotPosted by John Kleeman

If you are taking an assessment to prove your competence as a machine operator, and you get all the questions right except the health and safety ones, should you pass the assessment? Probably not. Some topics can be more important than others, and assessment results should reflect that fact.

In most assessments, it’s acceptable to define a pass or cut score, and all that is required to pass the assessment is for the participant to achieve the passing score or higher. The logic for this is that success on one item can make up for failure on another item,  so skills in one area are substitutable for skills in another. However, there are other assessments where some skills or knowledge are critical, and here you might want to require a passing score or even a 100% score in the key or “golden” topics as well as a pass score for the test as a whole.

This is easy to set up in Questionmark when you author your assessments. When you create the assessment outcome that defines passing the test, you define some topic prerequisites.

Here is an illustrative example, showing 4 topics. As well as achieving the pass score on the test, the participant must achieve 60% in three topics: “Closing at end of day”, “Operations” and “Starting up”, and 100% in one topic: “Safety”.

Prerequisites

If you need to ensure that participants don’t pass a test unless they have achieved scores in certain topics, topic prerequisites are the way to achieve this.

Trustworthy Assessment Results – A Question of Transparency

Austin FosseyPosted by Austin Fossey

Do you trust the results of your test? Like many questions in psychometrics, the answer is that it depends. Like the trust between two people, trustworthy assessment results have to be earned by the testing body.

trustMany of us want to implicitly trust the testing body, be it a certification organization, a department of education, or our HR department. When I fill a car with gas, I don’t want to have to siphon the gas out to make sure the amount of gas matches the volume on the pump—I just assume it’s accurate. We put the same faith in our testing bodies.

Just as gas pumps are certified and periodically calibrated, many high-stakes assessment programs are also reviewed. In the U.S., state testing programs are reviewed by the U.S. Department of Education, peer review groups, and technical advisory boards. Certification and licensure programs are sometimes reviewed by third-party accreditation programs, though these accreditations usually only look to see that certain requirements are met without evaluating how well they were executed.

In her op-ed, Can We Trust Assessment Results?, Eva Baker argues that the trustworthiness of assessment results is dependent on the transparency of the testing program. I agree with her. Participants should be able to easily get information on the purpose of the assessment, the content that is covered, and how the assessment was developed. Baker also adds that appropriate validity studies should be conducted and shared. I was especially pleased to see Baker propose that “good transparency occurs when test content can be clearly summarized without giving away the specific questions.”

For test results to be trustworthy, transparency also needs to extend beyond the development of the assessment to include its maintenance. Participants and other stakeholders should have confidence that the testing body is monitoring its assessments, and that a plan is in place should their results become compromised.

In their article, Cheating: Its Implications for ABFM Examinees, Kenneth Royal and James Puffer discuss cases where widespread cheating affects the statistics of the assessment, which in turn mislead test developers by making items appear easier. The effect can be an assessment that yields invalid results. Though specific security measures should be kept confidential, testing bodies should have a public-facing security plan that explains their policies for addressing improprieties. This plan should address policies for the participants as
well as for how the testing body will handle test design decisions that have been impacted by compromised results.

Even under ideal circumstances, mistakes can happen. Readers may recall that, in 2006, thousands of students received incorrect scores on the SAT, arguably one of the best-developed and carefully scrutinized assessments in U.S. education. The College Board (the testing body that runs the SAT) handled the situation as well as they could, publicly sharing the impact of the issue, the reasons it happened, and their policies for how they would handle the incorrect results. Others will feel differently, but I trust SAT scores more now that I have observed how the College Board communicated and rectified the mistake.

Most testing programs are well-run, professional operations backed by qualified teams of test developers, but there are the occasional junk testing programs such as predatory certificate programs, that yield useless, untrustworthy results. It can be difficult to tell the difference, but like Eva Baker, I believe that organizational transparency is the right way for a testing body to earn the trust of its stakeholders.

Applications of confidence intervals in a psychometric context

greg_pope-150x1502

Posted by Greg Pope

I have always been a fan of confidence intervals. Some people are fans of sports teams, for me, it’s confidence intervals! I find them really useful in assessment reporting contexts, all the way from item and test analysis psychometrics to participant reports.

Many of us get exposure to the practical use of confidence intervals via the media, when survey results are quoted. For example: “Of the 1,000 people surveyed, 55% said they will vote for John Doe. The margin of error for the survey was plus or minus 5% 95 times out of 100.” This is saying that the “observed” percentage of people who  say they will vote for Mr. Doe is 55% and there is a 95% chance that the “true” percentage of people who will vote for John Doe is somewhere between 50-60%.

Sample size is a big factor in the margin of error: generally, the larger the sample the smaller the margin of error as we get closer to representing the population.  (We can’t survey approximately all 307,006,550 people in the US now, can we!) So if the sample was 10,000 instead of 1,000 we would expect that the margin of error would be smaller than plus or minus 5%.

These concepts are relevant in an assessment context as well. You may remember my previous post on Classical Test Theory and reliability in which I explained that an observed test score (the score a participant achieves on an assessment) is composed of a true score and error. In other words, the observed score that a participant achieves is not 100% accurate; there is always error in the measurement. What this means practically is that if a participant achieves 50% on an exam their true score could actually be somewhere between say 44% and 56%.

This notion that observed scores are not absolute has implications for verifying what participants know and can do. For example, a  participant who achieves 50% on a crane certification exam (on which the pass score is 50%) would pass the exam and be able to hop into a crane, moving stuff up and down and around. However, achieving a score right on the borderline means this person may not, in fact, know enough to pass the exam if he or she were to take it again and then be certified on crane operation. His/her supervisor might not feel very confident about letting this person operate that crane!

To deal with the inherent uncertainty around observed scores, some organizations factor this margin of error in when setting the cut score…but this is another fun topic that I touched on in another post. I believe a best practice is to incorporate a confidence interval into the reporting of scores for participants in order to recognize that the score is not an “absolute truth” and is an estimate of what a person knows and can do. A simple example of a participant report I created to demonstrate this shows a diamond that encapsulates the participant score; the vertical height of the diamond represents the confidence interval around the participant’s score.

In some of my previous posts I talked about how sample size affects the robustness of item level statistics like p-values and item-total correlation coefficients and provided graphics showing the confidence interval ranges for the statistics based on sample sizes. I believe confidence intervals are also very useful in this psychometric context of evaluating the performance of items and tests. For example, often when we see a p-value for a question of 0.600 we incorrectly accept this as the “truth” that 60% of participants got the question right. In actual fact, this p-value of 0.600 is an observation and the “true” p-value could actually be between 0.500 and 0.700, a big difference when we are carefully choosing questions to shape our assessment!

With the holiday season fast approaching, perhaps Santa has a confidence interval in his sack for you and your organization to apply to your assessment results reporting and analysis!

Results Management System Quiz: Test your knowledge!

greg_pope-150x1502

Posted by Greg Pope

Organizations involved in medium and high-stakes testing must employ sound test development, administration and scoring processes to help ensure fair, reliable and valid assessments.

Knowledge Check

But despite everyone’s best efforts, there are times when it’s necessary to review and potentially modify test results to provide information and certificates that fairly reflect what was being measured.That’s where the Questionmark RMS, or Results
Management System comes in: It enables organizations to analyze, edit and publish assessment results in an informed and defensible way.

I have created a quiz on RMS to test your knowledge. Take assessment one and see how well you do. All the answers for the questions are available on the Questionmark web site, so if you study hard you can get a perfect score and impress your friends and colleagues.  Good luck!

.