Free eBook: Using Assessments for Compliance

Chloe MendoncaPosted by Chloe Mendonca

Every organisation needs to assess its workforce — whether to check competence, company procedures, knowledge of the law, health and safety guidelines, or testing product knowledge — and assessments are the most reliable and cost-effective way of doing so.ebook

Without regular testing, how do you know what your employees know? And in the case of an audit or an emergency, is it good enough to have had the participant sign off saying that they’ve attended training and understand the content?

With increasing regulatory requirements, compliance is becoming more and more of a priority for many organisations. However, due to the challenges of setting up an effective assessment program, many organisations aren’t doing enough to demonstrate compliance.

Questionmark has just published a new eBook Using Assessments for Compliance* providing tips and recommendations for the various stages within assessment development.

The eBook covers:

  • The rationale for assessments in compliance
  • The business benefits
  • Specific applications of useful assessments within a compliance program
  • Best practice recommendations covering the entire assessment lifecycle
    • Planning
    • Deployment
    • Authoring
    • Delivery
    • Analytics

Click here to get your copy of the free eBook. *

*Available in a variety of formats (PDF, ePub, MOBI) for various eReaders.

G Theory and Reliability for Assessments with Randomly Selected Items

Austin Fossey-42Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

  • Error variance.
  • Variance in the item mean scores.
  • Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

Is There Value in Reporting Changes in Subscores?

Austin Fossey-42Posted by Austin Fossey

I had the privilege of meeting with an organization that is reporting subscores to show how their employees are improving across multiple areas of their domain, as determined by an assessment given before and after training. They have developed some slick reports to show these scores, including the participant’s first score, second score (after training is complete), and the change in those scores.

At first glance, these reports are pretty snazzy and seem to suggest huge improvements resulting from the training, but looks can be deceiving. I immediately noticed one participant had made a subscore gain of 25%, which sounds impressive—like he or she is suddenly 25% better at the tasks in that domain—but here is the fine print: that subscore was measured with only four items. To put it another way, that 25% improvement means that the participant answered one more item correctly. Other subscores were similarly underrepresented—most with four or fewer items in their topic.

In a previous post, I reported on an article by Richard Feinberg and Howard Wainer about how to determine if a subscore is worth reporting. My two loyal readers (you know who you are) may recall that a reported subscore has to be reliable, and it must contain information that is sufficiently different from the information contained in the assessment’s total score (AKA “orthogonality”).

In an article titled Comments on “A Note on Subscores” by Samuel A. Livingston, Sandip Sinharay and Shelby Haberman defended against a critique that their previous work (which informed Feinberg and Wainer’s proposed Value Added Ratio (VAR) metric) indicated that subscores should never be reported when examining changes across administrations. Sinharay and Haberman explained that in these cases, one should examine the suitability of the change scores, not the subscores themselves. One may then find that the change scores are suitable for reporting.

A change score is the difference in scores from one administration to the next. If a participant gets a subscore of 12 on their first assessment and a subscore of 30 on their next assessment, their change score for that topic is 18. This can then be thought of as the subscore of interest, and one can then evaluate whether or not this change score is suitable for reporting.

Change scores are also used to determine if a change in scores is statistically significant for a group of participants. If we want to know whether a group of participants is performing statistically better on an assessment after completing training (at a total score or subscore level), we do not compare average scores on the two tests. Instead, we look to see if the group’s change scores across the two tests are significantly greater than zero. This is typically analyzed with a dependent samples t-test.

The reliability, orthogonality, and significance of changes in subscores are statistical concerns, but scores must be interpretable and actionable to make a claim about the validity of the assessment. This raises the concern of domain representation. Even if the statistics are fine, a subscore cannot be meaningful if the items do not sufficiently represent the domain they are supposed to measure. Making an inference about a participant’s ability in a topic based on only four items is preposterous—you do not need to know anything about statistics to come to that conclusion.

To address the concern of domain representation, high-stakes assessment programs that report subscores will typically set a minimum for the number of items that are needed to sufficiently represent a topic before a subscore is reported. For example, one program I worked for required (perhaps somewhat arbitrarily) a minimum of eight items in a topic before generating a subscore. If this domain representation criterion is met, one can presumably use methods like the VAR to then determine if the subscores meet the statistical criteria for reporting.

Scary Findings : Proctors often involved in test-center cheating

John Kleeman HeadshotPosted by John Kleeman

Over Halloween, I’ve been reviewing how often it seems that test center administrators or proctors have been shown to help candidates cheat at exams. It’s scary how often this appears to happen.

BBC undercover pictureJust a couple of weeks ago, a BBC television investigation reported widespread cheating at UK test centers where construction workers and builders were certified on health and safety. The BBC showed (see undercover picture to the right) a test center director reading exam answers from a big screen, instructing candidates:

“Follow me on screen, guys. I’m going to shout the correct answer, you just click. We’re going to make a couple of mistakes – what I don’t want is everyone making the same mistake.”

The sad thing is that construction is a dangerous occupation. In the past five years, the BBC reports that 221 workers died in the UK while on the job within the construction sector. It’s very worrying that corrupt test centers that facilitate cheating on health and safety tests are likely contributing to this.

Another scary example is from a recent US court case where a decorated former police officer in San Francisco was sentenced to two years in jail for taking bribes from taxi drivers to give them a passing grade, whether or not they passed the test. These are a couple of examples I happen to have seen this weekend. See my earlier blog entry Online or test center proctoring: Which is more secure? for several other examples of test center fraud.

So what is the answer?  Part of the solution as I argued in What is the best way to reduce cheating? is to remove people’s rationalization to cheat. Most people think of themselves as good, honest people, and if you communicate the aims of the test and take other measures to make people think the test is fair, then fewer of them are likely to cheat.

Another approach is to do what Cambodia has been doing and throw a lot of resources into preventing cheating. According to this article, the government’s anti-corruption unit has been focusing on university exams, enlisting 2,000 volunteers to help monitor last summer’s exams and prevent collusion between proctors and students.

Of course, the vast majority of tests at test centers are entirely legitimate, and reputable test center providers do all they can to prevent face-to-face proctors from colluding with candidates. But there does seem to be two persistent problems:

  1. Some proctors are keen to help their local candidates
  2. The financial stakes involved in passing a test means that when candidate and proctor meet face-to-face, there is an ever-present risk of corruption.

I strongly suspect online proctoring is part of the solution here. The main argument for online proctoring is that candidates do not need to travel to a test center (see Online or test center proctoring: Which is best?). But there is an important side benefit to this: candidates and proctor never meet, and all their communications can be recorded. Without a face-to-face meeting and without a local connection, the likelihood of collusion, so this kind of cheating is much less probable. Now, that’s a non-scary solution that has some promise.