Free eBook: Using Assessments for Compliance

Chloe MendoncaPosted by Chloe Mendonca

Every organisation needs to assess its workforce — whether to check competence, company procedures, knowledge of the law, health and safety guidelines, or testing product knowledge — and assessments are the most reliable and cost-effective way of doing so.ebook

Without regular testing, how do you know what your employees know? And in the case of an audit or an emergency, is it good enough to have had the participant sign off saying that they’ve attended training and understand the content?

With increasing regulatory requirements, compliance is becoming more and more of a priority for many organisations. However, due to the challenges of setting up an effective assessment program, many organisations aren’t doing enough to demonstrate compliance.

Questionmark has just published a new eBook Using Assessments for Compliance* providing tips and recommendations for the various stages within assessment development.

The eBook covers:

  • The rationale for assessments in compliance
  • The business benefits
  • Specific applications of useful assessments within a compliance program
  • Best practice recommendations covering the entire assessment lifecycle
    • Planning
    • Deployment
    • Authoring
    • Delivery
    • Analytics

Click here to get your copy of the free eBook. *

*Available in a variety of formats (PDF, ePub, MOBI) for various eReaders.

G Theory and Reliability for Assessments with Randomly Selected Items

Austin Fossey-42Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

  • Error variance.
  • Variance in the item mean scores.
  • Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.96. Thus we can randomly deliver a quarter of the full item set to these participants with very little loss of generalizability.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

Is There Value in Reporting Changes in Subscores?

Austin Fossey-42Posted by Austin Fossey

I had the privilege of meeting with an organization that is reporting subscores to show how their employees are improving across multiple areas of their domain, as determined by an assessment given before and after training. They have developed some slick reports to show these scores, including the participant’s first score, second score (after training is complete), and the change in those scores.

At first glance, these reports are pretty snazzy and seem to suggest huge improvements resulting from the training, but looks can be deceiving. I immediately noticed one participant had made a subscore gain of 25%, which sounds impressive—like he or she is suddenly 25% better at the tasks in that domain—but here is the fine print: that subscore was measured with only four items. To put it another way, that 25% improvement means that the participant answered one more item correctly. Other subscores were similarly underrepresented—most with four or fewer items in their topic.

In a previous post, I reported on an article by Richard Feinberg and Howard Wainer about how to determine if a subscore is worth reporting. My two loyal readers (you know who you are) may recall that a reported subscore has to be reliable, and it must contain information that is sufficiently different from the information contained in the assessment’s total score (AKA “orthogonality”).

In an article titled Comments on “A Note on Subscores” by Samuel A. Livingston, Sandip Sinharay and Shelby Haberman defended against a critique that their previous work (which informed Feinberg and Wainer’s proposed Value Added Ratio (VAR) metric) indicated that subscores should never be reported when examining changes across administrations. Sinharay and Haberman explained that in these cases, one should examine the suitability of the change scores, not the subscores themselves. One may then find that the change scores are suitable for reporting.

A change score is the difference in scores from one administration to the next. If a participant gets a subscore of 12 on their first assessment and a subscore of 30 on their next assessment, their change score for that topic is 18. This can then be thought of as the subscore of interest, and one can then evaluate whether or not this change score is suitable for reporting.

Change scores are also used to determine if a change in scores is statistically significant for a group of participants. If we want to know whether a group of participants is performing statistically better on an assessment after completing training (at a total score or subscore level), we do not compare average scores on the two tests. Instead, we look to see if the group’s change scores across the two tests are significantly greater than zero. This is typically analyzed with a dependent samples t-test.

The reliability, orthogonality, and significance of changes in subscores are statistical concerns, but scores must be interpretable and actionable to make a claim about the validity of the assessment. This raises the concern of domain representation. Even if the statistics are fine, a subscore cannot be meaningful if the items do not sufficiently represent the domain they are supposed to measure. Making an inference about a participant’s ability in a topic based on only four items is preposterous—you do not need to know anything about statistics to come to that conclusion.

To address the concern of domain representation, high-stakes assessment programs that report subscores will typically set a minimum for the number of items that are needed to sufficiently represent a topic before a subscore is reported. For example, one program I worked for required (perhaps somewhat arbitrarily) a minimum of eight items in a topic before generating a subscore. If this domain representation criterion is met, one can presumably use methods like the VAR to then determine if the subscores meet the statistical criteria for reporting.

Conference Agenda: Next gen authoring, security and SSO

Julie Delazyn HeadshotPosted by Julie Delazyn

We have so many exciting sessions planned for our most important learning event of the year, the 2016 Questionmark Conference, which will take place in Miami April

sunWhat’s on the agenda? Here’s a sneak-peek of what you can expect:

We’ll also have two item analysis workshops: intro to item statistics, and an advanced, hands-on session that will conduct an item analysis review with a demo assessment.

sunCheck out the agenda for details about many of the sessions we will be offering in Miami.

We look forward to seeing you there!

Scary Findings : Proctors often involved in test-center cheating

John Kleeman HeadshotPosted by John Kleeman

Over Halloween, I’ve been reviewing how often it seems that test center administrators or proctors have been shown to help candidates cheat at exams. It’s scary how often this appears to happen.

BBC undercover pictureJust a couple of weeks ago, a BBC television investigation reported widespread cheating at UK test centers where construction workers and builders were certified on health and safety. The BBC showed (see undercover picture to the right) a test center director reading exam answers from a big screen, instructing candidates:

“Follow me on screen, guys. I’m going to shout the correct answer, you just click. We’re going to make a couple of mistakes – what I don’t want is everyone making the same mistake.”

The sad thing is that construction is a dangerous occupation. In the past five years, the BBC reports that 221 workers died in the UK while on the job within the construction sector. It’s very worrying that corrupt test centers that facilitate cheating on health and safety tests are likely contributing to this.

Another scary example is from a recent US court case where a decorated former police officer in San Francisco was sentenced to two years in jail for taking bribes from taxi drivers to give them a passing grade, whether or not they passed the test. These are a couple of examples I happen to have seen this weekend. See my earlier blog entry Online or test center proctoring: Which is more secure? for several other examples of test center fraud.

So what is the answer?  Part of the solution as I argued in What is the best way to reduce cheating? is to remove people’s rationalization to cheat. Most people think of themselves as good, honest people, and if you communicate the aims of the test and take other measures to make people think the test is fair, then fewer of them are likely to cheat.

Another approach is to do what Cambodia has been doing and throw a lot of resources into preventing cheating. According to this article, the government’s anti-corruption unit has been focusing on university exams, enlisting 2,000 volunteers to help monitor last summer’s exams and prevent collusion between proctors and students.

Of course, the vast majority of tests at test centers are entirely legitimate, and reputable test center providers do all they can to prevent face-to-face proctors from colluding with candidates. But there does seem to be two persistent problems:

  1. Some proctors are keen to help their local candidates
  2. The financial stakes involved in passing a test means that when candidate and proctor meet face-to-face, there is an ever-present risk of corruption.

I strongly suspect online proctoring is part of the solution here. The main argument for online proctoring is that candidates do not need to travel to a test center (see Online or test center proctoring: Which is best?). But there is an important side benefit to this: candidates and proctor never meet, and all their communications can be recorded. Without a face-to-face meeting and without a local connection, the likelihood of collusion, so this kind of cheating is much less probable. Now, that’s a non-scary solution that has some promise.

4 Ways to Identify a Content Breach

Austin Fossey-42Posted by Austin Fossey

In my last post, I discussed five ways you can limit the use of breached content so that a person with unauthorized access to your test content will have limited opportunities to put that information to use; however, those measures only control the problem of a content breach. Our next goal is to identify when a content breach has occurred so that we can remedy the problem through changes to the assessment or disciplinary actions against the parties involved in the breach.

mitigating risk

Interested in learning about item analysis or how-to take your test planning to the next level? I will be presenting a series of workshops on at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. 

Channel for Reporting

In most cases, you (the assessment program staff) will not be the first to find out that content has been stolen. You are far more likely to learn about the problem through a tip from another participant or stakeholder. One of the best things your organization can do to identify a content breach is to have a clear process for letting people report these concerns, as well as a detailed policy for what to do if a breach is found.

For example, you may want to have a disciplinary policy to address the investigation process, potential consequences, and an appeals process for participants who allegedly gained unauthorized access to the content (even if they did not pass the assessment). You may want to have legal resources lined up to help address non-participant parties who may be sharing your assessment content illegally (e.g., so-called “brain dump” sites). Finally, you should have an internal plan in place for what you will do if content is breached. Do you have backup items that can be inserted in the form? Can you release an updated form ahead of your republishing schedule? Will your response be different depending on the extent of the breach?

Web Patrol Monitoring

Several companies offer a web patrol service that will search the internet for pages where your assessment content has been posted without permission. Some of these companies will even purchase unauthorized practice exams that claim to have your assessment content and look for item breaches within them. Some of Questionmark’s partners provide web patrol services.

Statistical Models

There are several publicly available statistical models that can be used to identify abnormalities in participants’ response patterns or matches between a response pattern and a known content breach, such as the key patterns posted on a brain-dump site. Several companies, including some of Questionmark’s partners, have developed their own statistical methods for identifying cases where a participant may have used breached content.

In their chapter in Educational Measurement (4th ed.), Allan Cohen and James Wollack explain that all of these models tend to explore whether the amount of similarity between two sets of responses can be explained by chance alone. For example, one could look for two participants who had similar responses, possibly suggesting collusion or indicating that one participant copied the other. One could also look for similarity between a participant’s responses and the keys given in a leaked assessment form. Models also exist for identifying patterns within groups, as might be the case when a teacher chooses to provide answers to an entire class.

These models are a sophisticated way to look for breaches in content, but they are not foolproof. None of them prove that a participant was cheating, though they can provide weighty statistical evidence. Cohen and Wollack warn that several of the most popular models have been shown to suffer from liberal or conservative Type I error rates, though new models continue to improve in this area.

Item Drift

When considering content breaches, you might also be interested in cases where an item appears to become easier (or harder) for everyone over time. Consider a situation where your participant population has global access to information that changes how they respond to an item. This could be for some unsavory reasons (e.g., a lot of people stole your content), or it could be something benign, like a newsworthy event that caused your population to learn more about content related to your assessment. In these cases, you might expect certain items to become easier for everyone in the population.

To detect whether an item is becoming easier over time, we do not use the p value from Classical Test Theory. Instead, we use Item Response Theory (IRT) and a Differential Item Functioning to detect item drift, which is changes in an item’s IRT parameters over time. This is done with Thissen, Steinberg, and Wainer’s likelihood ratio test that they detailed in Test Validity. Creators of IRT assessments use item parameter drift analyses to see if an item has become easier over time. This information helps test developers make decisions about cycling out items from production or planning new calibration studies.

Interested in learning about item analysis or how-to take your test planning to the next level? I will be presenting a series of workshops at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event. 


Next Page »