The Importance of Safety in the Utilities Industry: A Q&A with PG&E

Headshot Julie

Posted by Julie Delazyn

Wendy Lau is a Psychometrician at Pacific Gas and Electric Company (PG&E). She will be leading a discussion at Questionmark Conference 2016 in Miami, about Safety and the Utilities Industry: Why Assessments Matter.

WendyLau_Q&A

Wendy Lau, Psychometrician, PG&E

Wendy’s session will describe a day in the life of a psychometrician in the utilities industry. It will explore the role assessments play at PG&E, and how Questionmark has helped the company focus on safety and train its employees.

I recently asked her about her session:

Tell me about PG&E and its use of assessments:

PG&E is a utilities company that provides natural gas and electricity to most of the northern two-thirds of California. Over the years, we have evolved into a more data-driven company, and Questionmark has been a part of that for the past 7 years. Having assessments readily available and secured within a platform that we can trust is very important to PG&E. We are also glad to have found a testing tool that offers such a wide variety of question types.

Why is safety important in the utilities industry?

Depending on the activity that our employees perform, most of the work has serious safety implications — whether it is a lineman climbing up a  pole to perform liveline work or a utility worker digging near a major gas pipeline. Our technical training must have safety in mind and, more importantly, it must ensure that after going through training, employees are competent to perform their tasks safely and proficiently. In order to ensure workforce capability, we rely heavily on testing to prove that our workforce is in fact safe and proficient and that the community we serve and our employees are safe and receiving reliable services.

What role does Questionmark play in ensuring that safety?

Questionmark helps us focus on safety-related questions by allowing special assessment strategies such as identifying critical versus coachable assessment items and identifying cutscores for each accordingly. Questionmark also allows a secured platform so that we can ensure our test items are never compromised and that our employees are truly being assessed under fair circumstances.

To find out more about the role of Questionmark plays in ensuring safety, you’ll just have to attend my session at Questionmark Conference 2016 in Miami!

What are you looking forward to at the conference?

I am very much looking forward to ‘talking shop’ with other Psychometricians and sharing best practices with others in the utilities industry and other companies alike!

Thank you, Wendy for taking time out of your busy schedules to discuss your session with us!

palm tree emoji 2If you have not already done so, you still have a chance to attend this important learning event. Click here to register.

 

Checklists for Test Development

Austin Fossey-42Posted by Austin Fossey

There are many fantastic books about test development, and there are many standards systems for test development, such as The Standards for Educational and Psychological Testing. There are also principled frameworks for test development and design, such as evidence-centered design (ECD). But it seems that the supply of qualified test developers cannot keep up with the increased demand for high-quality assessment data, leaving many organizations to piece together assessment programs, learning as they go.checklist

As one might expect, this scenario leads to new tools targeted at these rookie test developers—simplified guidance documents, trainings, and resources attempting to idiot-proof test development. As a case in point, Questionmark seeks to distill information from a variety of sources into helpful, easy-to-follow white papers and blog posts. At an even simpler level, there appears to be increased demand for checklists that new test developers can use to guide test development or evaluate assessments.

For example, my colleague, Bart Hendrickx, shared a Dutch article from the Research Center for Examination and Certification (RCEC) at University of Twente describing their Beoordelingssysteem. He explained that this system provides a rubric for evaluating education assessments in areas like representativeness, reliability, and standard setting. The Buros Center for Testing addresses similar needs for users of mental assessments. In the Assessment Literacy section of their website, Buros has documents with titles like “Questions to Ask When Evaluating a Test”—essentially an evaluation checklist (though Buros also provides their own professional ratings of published assessments). There are even assessment software packages that seek to operationalize a test development checklist by creating a rigid workflow that guides the test developer through different steps of the design process.

The benefit of these resources is that they can help guide new test developers through basic steps and considerations as they build their instruments. It is certainly a step up from a company compiling a bunch of multiple choice questions on the fly and setting a cut score of 70% without any backing theory or test purpose. On the other hand, test development is supposed to be an iterative process, and without the flexibility to explore the nuances and complexities of the instrument, the results and the inferences may fall short of their targets. An overly simple, standardized checklist for developing or evaluating assessments may not consider an organization’s specific measurement needs, and the program may be left with considerable blind spots in its validity evidence.

Overall, I am glad to see that more organizations are wanting to improve the quality of their measurements, and it is encouraging to see more training resources to help new test developers tackle the learning curve. Checklists may be a very helpful tool for a lot of applications, and test developers frequently create their own checklists to standardize practices within their organization, like item reviews.

What do our readers think? Are checklists the way to go? Do you use a checklist from another organization in your test development?

 

 

 

 

G Theory and Reliability for Assessments with Randomly Selected Items

Austin Fossey-42Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

  • Error variance.
  • Variance in the item mean scores.
  • Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

Is There Value in Reporting Changes in Subscores?

Austin Fossey-42Posted by Austin Fossey

I had the privilege of meeting with an organization that is reporting subscores to show how their employees are improving across multiple areas of their domain, as determined by an assessment given before and after training. They have developed some slick reports to show these scores, including the participant’s first score, second score (after training is complete), and the change in those scores.

At first glance, these reports are pretty snazzy and seem to suggest huge improvements resulting from the training, but looks can be deceiving. I immediately noticed one participant had made a subscore gain of 25%, which sounds impressive—like he or she is suddenly 25% better at the tasks in that domain—but here is the fine print: that subscore was measured with only four items. To put it another way, that 25% improvement means that the participant answered one more item correctly. Other subscores were similarly underrepresented—most with four or fewer items in their topic.

In a previous post, I reported on an article by Richard Feinberg and Howard Wainer about how to determine if a subscore is worth reporting. My two loyal readers (you know who you are) may recall that a reported subscore has to be reliable, and it must contain information that is sufficiently different from the information contained in the assessment’s total score (AKA “orthogonality”).

In an article titled Comments on “A Note on Subscores” by Samuel A. Livingston, Sandip Sinharay and Shelby Haberman defended against a critique that their previous work (which informed Feinberg and Wainer’s proposed Value Added Ratio (VAR) metric) indicated that subscores should never be reported when examining changes across administrations. Sinharay and Haberman explained that in these cases, one should examine the suitability of the change scores, not the subscores themselves. One may then find that the change scores are suitable for reporting.

A change score is the difference in scores from one administration to the next. If a participant gets a subscore of 12 on their first assessment and a subscore of 30 on their next assessment, their change score for that topic is 18. This can then be thought of as the subscore of interest, and one can then evaluate whether or not this change score is suitable for reporting.

Change scores are also used to determine if a change in scores is statistically significant for a group of participants. If we want to know whether a group of participants is performing statistically better on an assessment after completing training (at a total score or subscore level), we do not compare average scores on the two tests. Instead, we look to see if the group’s change scores across the two tests are significantly greater than zero. This is typically analyzed with a dependent samples t-test.

The reliability, orthogonality, and significance of changes in subscores are statistical concerns, but scores must be interpretable and actionable to make a claim about the validity of the assessment. This raises the concern of domain representation. Even if the statistics are fine, a subscore cannot be meaningful if the items do not sufficiently represent the domain they are supposed to measure. Making an inference about a participant’s ability in a topic based on only four items is preposterous—you do not need to know anything about statistics to come to that conclusion.

To address the concern of domain representation, high-stakes assessment programs that report subscores will typically set a minimum for the number of items that are needed to sufficiently represent a topic before a subscore is reported. For example, one program I worked for required (perhaps somewhat arbitrarily) a minimum of eight items in a topic before generating a subscore. If this domain representation criterion is met, one can presumably use methods like the VAR to then determine if the subscores meet the statistical criteria for reporting.

4 Ways to Identify a Content Breach

Austin Fossey-42Posted by Austin Fossey

In my last post, I discussed five ways you can limit the use of breached content so that a person with unauthorized access to your test content will have limited opportunities to put that information to use; however, those measures only control the problem of a content breach. Our next goal is to identify when a content breach has occurred so that we can remedy the problem through changes to the assessment or disciplinary actions against the parties involved in the breach.

mitigating risk

Interested in learning about item analysis or how-to take your test planning to the next level? I will be presenting a series of workshops on at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. 

Channel for Reporting

In most cases, you (the assessment program staff) will not be the first to find out that content has been stolen. You are far more likely to learn about the problem through a tip from another participant or stakeholder. One of the best things your organization can do to identify a content breach is to have a clear process for letting people report these concerns, as well as a detailed policy for what to do if a breach is found.

For example, you may want to have a disciplinary policy to address the investigation process, potential consequences, and an appeals process for participants who allegedly gained unauthorized access to the content (even if they did not pass the assessment). You may want to have legal resources lined up to help address non-participant parties who may be sharing your assessment content illegally (e.g., so-called “brain dump” sites). Finally, you should have an internal plan in place for what you will do if content is breached. Do you have backup items that can be inserted in the form? Can you release an updated form ahead of your republishing schedule? Will your response be different depending on the extent of the breach?

Web Patrol Monitoring

Several companies offer a web patrol service that will search the internet for pages where your assessment content has been posted without permission. Some of these companies will even purchase unauthorized practice exams that claim to have your assessment content and look for item breaches within them. Some of Questionmark’s partners provide web patrol services.

Statistical Models

There are several publicly available statistical models that can be used to identify abnormalities in participants’ response patterns or matches between a response pattern and a known content breach, such as the key patterns posted on a brain-dump site. Several companies, including some of Questionmark’s partners, have developed their own statistical methods for identifying cases where a participant may have used breached content.

In their chapter in Educational Measurement (4th ed.), Allan Cohen and James Wollack explain that all of these models tend to explore whether the amount of similarity between two sets of responses can be explained by chance alone. For example, one could look for two participants who had similar responses, possibly suggesting collusion or indicating that one participant copied the other. One could also look for similarity between a participant’s responses and the keys given in a leaked assessment form. Models also exist for identifying patterns within groups, as might be the case when a teacher chooses to provide answers to an entire class.

These models are a sophisticated way to look for breaches in content, but they are not foolproof. None of them prove that a participant was cheating, though they can provide weighty statistical evidence. Cohen and Wollack warn that several of the most popular models have been shown to suffer from liberal or conservative Type I error rates, though new models continue to improve in this area.

Item Drift

When considering content breaches, you might also be interested in cases where an item appears to become easier (or harder) for everyone over time. Consider a situation where your participant population has global access to information that changes how they respond to an item. This could be for some unsavory reasons (e.g., a lot of people stole your content), or it could be something benign, like a newsworthy event that caused your population to learn more about content related to your assessment. In these cases, you might expect certain items to become easier for everyone in the population.

To detect whether an item is becoming easier over time, we do not use the p value from Classical Test Theory. Instead, we use Item Response Theory (IRT) and a Differential Item Functioning to detect item drift, which is changes in an item’s IRT parameters over time. This is done with Thissen, Steinberg, and Wainer’s likelihood ratio test that they detailed in Test Validity. Creators of IRT assessments use item parameter drift analyses to see if an item has become easier over time. This information helps test developers make decisions about cycling out items from production or planning new calibration studies.

Interested in learning about item analysis or how-to take your test planning to the next level? I will be presenting a series of workshops at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event. 

 

5 Ways to Limit the Use of Breached Assessment Content

Austin Fossey-42Posted by Austin Fossey

In an earlier post, Questionmark’s Julie Delazyn listed 11 tips to help prevent cheating. The third item on that list related to minimizing item exposure; i.e., limiting how and when people can see an item so that content will not be leaked and used for dishonest purposes.

During a co-presentation with Manny Straehle of Assessment, Education, and Research Experts at a Certification Network Group quarterly meeting, I presented a set of considerations that can affect the severity of item exposure. My message was that although item exposure may not be a problem for some assessment programs, assessment managers should consider the design, purpose, candidate population, and level of investment for their assessment when evaluating their content security requirements.

mitigating risk

If item exposure is a concern for your assessment program, there are two ways to mitigate the effects of leaked content: limiting opportunities to use the content, and identifying the breach so that it can be corrected. In this post, I will focus on ways to limit content-using opportunities:

Multiple Forms

Using different assessment forms lowers the number of participants who will see an item in delivery. Having multiple forms also lowers the probability that someone with access to a breached item will actually get to put that information to use. Many organizations achieve this by using multiple, equated forms which are systematically assigned to participants to limit joint cheating or to limit item exposure across multiple retakes. Some organizations also achieve this through the use of randomly generated forms like those in Linear-on-the-Fly Testing (LOFT) or empirically generated forms like those in Computer Adaptive Testing (CAT).

Frequent Republishing

Assessment forms are often cycled in and out of production on a set schedule. Decreasing the amount of time a form is in production will limit the impact of item exposure, but it also requires more content and staff resources to keep rotating forms.

Large Item Banks

Having a lot of items can help you make lots of assessment forms, but this is also important for limiting item exposure in LOFT or CAT. Item banks can also be rotated. For example, some assessment programs will use an item bank for particular testing windows or geographic regions and then switch them at the next administration.

Exposure Limits

If your item bank can support it, you may also want to put an exposure limit on items or assessment forms. For example, you might set up a rule where an assessment form remains in production until it has been delivered 5,000 times. After that, you may permanently retire that form or shelve it for a predetermined period and use it again later. An extreme example would be an assessment program that only delivers an item during a single testing window before retiring it. The limit will depend on your risk tolerance, the number of items you have available, and the number of participants taking the assessment. Exposure limits are especially important in CAT where some items will get delivered much more frequently than others due to the item selection algorithm.

Short Testing Windows

When participants are only allowed to take a test during a short time period, there are fewer opportunities for people to talk about or share content before the testing window closes. Short testing windows may be less convenient for your participant population, but you can take advantage of the extra downtime to spend time detecting item breaches, developing new content, and performing assessment maintenance.

In my next post, I will provide an overview of methods for identifying instances of an item breach.

Next Page »