G Theory and Reliability for Assessments with Randomly Selected Items

Austin Fossey-42Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

  • Error variance.
  • Variance in the item mean scores.
  • Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

Item Development – Organizing a bias review committee (Part 2)

Austin Fossey-42Posted by Austin Fossey

The Standards for Educational and Psychological Testing describe two facets of an assessment that can result in bias: the content of the assessment and the response process. These are the areas on which your bias review committee should focus. You can read Part 1 of this post, here.

Content bias is often what people think of when they think about examples of assessment bias. This may pertain to item content (e.g., students in hot climates may have trouble responding to an algebra scenario about shoveling snow), but it may also include language issues, such as the tone of the content, differences in terminology, or the reading level of the content. Your review committee should also consider content that might be offensive or trigger an emotional response from participants. For example, if an item’s scenario described interactions in a workplace, your committee might check to make sure that men and women are equally represented in management roles.

Bias may also occur in the response processes. Subgroups may have differences in responses that are not relevant to the construct, or a subgroup may be unduly disadvantaged by the response format. For example, an item that asks participants to explain how they solved an algebra problem may be biased against participants for whom English is a second language, even though they might be employing the same cognitive processes as other participants to solve the algebra. Response process bias can also occur if some participants provide unexpected responses to an item that are correct but may not be accounted for in the scoring.

How do we begin to identify content or response processes that may introduce bias? Your sensitivity guidelines will depend upon your participant population, applicable social norms, and the priorities of your assessment program. When drafting your sensitivity guidelines, you should spend a good amount of time researching potential sources of bias that could manifest in your assessment, and you may need to periodically update your own guidelines based on feedback from your reviewers or participants.

In his chapter in Educational Measurement (4th ed.), Gregory Camilli recommends the chapter on fairness in the ETS Standards for Quality and Fairness and An Approach for Identifying and Minimizing Bias in Standardized Tests (Office for Minority Education) as sources of criteria that could be used to inform your own sensitivity guidelines. If you would like to see an example of one program’s sensitivity guidelines that are used to inform bias review committees for K12 assessment in the United States, check out the Fairness Guidelines Adopted by PARCC (PARCC), though be warned that the document contains examples of inflammatory content.

In the next post, I will discuss considerations for the final round of item edits that will occur before the items are field tested.

Check out our white paper: 5 Steps to Better Tests for best practice guidance and practical advice for the five key stages of test and exam development.

Austin Fossey will discuss test development at the 2015 Users Conference in Napa Valley, March 10-13. Register before Dec. 17 and save $200.

Item Development – Benefits of editing items before the review process

Austin FosseyPosted by Austin Fossey

Some test developers recommend a single round of item editing (or editorial review), usually right before items are field tested. When schedules and resources allow for it, I recommend that test developers conduct two rounds of editing—one right after the items are written and one after content and bias reviews are completed. This post addresses the first round of editing, to take place after items are drafted.

Why have two rounds of editing? In both rounds, we will be looking for grammar or spelling errors, but the first round serves as a filter to keep items with serious flaws from making it to content review or bias review.

In their chapter in Educational Measurement (4 th ed.), Cynthia Shmeiser and Catherine Welch explain that an early round of item editing “serves to detect and correct deficiencies in the technical qualities of the items and item pools early in the development process.” They recommend that test developers use this round of item editing to do a cursory review of whether the items meet the Standards for Educational and Psychological Testing.

Items that have obvious item writing flaws should be culled in the first round of item editing and either sent back to the item writers or removed. This may include item writing errors like cluing or having options that do not match the stem grammatically. Ideally, these errors will be caught and corrected in the drafting process, but a few items may have slipped through the cracks.

In the initial round of editing, we will also be looking for proper formatting of the items. Did the item writers use the correct item types for the specified content? Did they follow the formatting rules in our style guide? Is all supporting content (e.g., pictures, references) present in the item? Did the item writers record all of the metadata for the item, like its content area, cognitive level, or reference? Again, if an item does not match the required format, it should be sent back to the item writers or removed.

It is helpful to look for these issues before going to content review or bias review because these types of errors may distract your review committees from their tasks; the committees may be wasting time reviewing items that should not be delivered anyway due to formatting flaws. You do not want to get all the way through content and bias reviews only to find that a large number of your items have to be returned to the drafting process. We will discuss review committee processes in the following posts.

For best practice guidance and practical advice for the five key stages of test and exam development, check out our white paper: 5 Steps to Better Tests.

Item Development – Five Tips for Organizing Your Drafting Process

Austin FosseyPosted by Austin Fossey

Once you’ve trained your item writers, they are ready to begin drafting items. But how should you manage this step of the item development process?

There is an enormous amount of literature about item design and item writing techniques—which we will not cover in this series—but as Cynthia Shmeiser and Catherine Welch observe in their chapter in Educational Measurement (4th ed.), there is very little guidance about the item writing process. This is surprising, given that item writing is critical to effective test development.

It may be tempting to let your item writers loose in your authoring software with a copy of the test specifications and see what comes back, but if you invest time and effort in organizing your item drafting sessions, you are likely to retain more items and better support the validity of the results.

Here are five considerations for organizing item writing sessions:

  • Assignments – Shmeiser and Welch recommend giving each item writer a specific assignment to set expectations and to ensure that you build an item bank large enough to
    meet your test specifications. If possible, distribute assignments evenly so that no single author has undue influence over an entire area of your test specifications. Set realistic goals for your authors, keeping in mind that some of their items will likely be dropped later in item reviews.
  • Instructions – In the previous post, we mentioned the benefit of a style guide for keeping item formats consistent. You may also want to give item writers instructions or templates for specific item types, especially if you are working with complex item types. (You should already have defined the types of items that can be used to measure each area of your test specifications in advance.)
  • Monitoring – Monitor item writers’ progress and spot-check their work. This is not a time to engage in full-blown item reviews, but periodic checks can help you to provide feedback and correct misconceptions. You can also check in to make sure that the item writers are abiding by security policies and formatting guidelines. In some item writing workshops, I have also asked item writers to work in pairs to help check each other’s work.
  • Communication – With some item designs, several people may be involved in building the item. One team may be in charge of developing a scoring model, another team may draft content, and a third team may add resources or additional stimuli, like images or animations. These teams need to be organized so that materials are
    handed off on time, but they also need to be able to provide iterative feedback to each other. For example, if the content team finds a loophole in the scoring model, they need to be able to alert the other teams so that it can be resolved.
  • Be Prepared – Be sure to have a backup plan in case your item writing sessions hit a snag. Know what you are going to do if an item writer does not complete an assignment or if content is compromised.

Many of the details of the item drafting process will depend on your item types, resources, schedule, authoring software, and availability of item writers. Determine what you need to accomplish, and then organize your item writing sessions as much as possible so that you meet your goals.

In my next post, I will discuss the benefits of conducting an initial editorial review of the draft items before they are sent to review committees.

Item Development – Training Item Writers

Austin FosseyPosted by Austin Fossey

Once we have defined the purpose of the assessment, completed our domain analysis, and finalized a test blueprint, we might be eager to jump right in to item writing, but there is one important step to take before we begin: training!

Unless you are writing the entire assessment yourself, you will need a group of item writers to develop the content. These item writers are likely experts in their fields, but they may have very little understanding of how to create assessment content. Even if these experts have experience writing items, it may be beneficial to provide refresher trainings, especially if anything has changed in your assessment design.

In their chapter in Educational Measurement (4 th ed.), Cynthia Shmeiser and Catherine Welch note that it is important to consider the qualifications and representativeness of your item writers. It is common to ask item writers to fill out a brief survey to collect demographic information. You should keep these responses on file and possibly add a brief document explaining why you consider these item writers to be a qualified and representative sample.

Shmeiser and Welch also underscore the need for security. Item writers should be trained on your content security guidelines, and your organization may even ask them to sign an agreement stating that they will abide by those guidelines. Make sure everyone understands the security guidelines, and have a plan in place in case there are any violations.

Next, begin training your item writers on how to author items, which should include basic concepts about cognitive levels, drafting stems, picking distractors, and using specific item types appropriately. Shmeiser and Welch suggest that the test blueprint be used as the foundation of the training. Item writers should understand the content included in the specifications and the types of items they are expected to create for that content. Be sure to share examples of good and bad items.

If possible, ask your writers to create some practice items, then review their work and provide feedback. If they are using the item authoring software for the first time, be sure to acquaint them with the tools before they are given their item writing assignments.

Your item writers may also need training on your item data, delivery method, or scoring rules. For example, you may ask item writers to cite a reference for each item, or you might ask them to weight certain items differently. Your instructions need to be clear and precise, and you should spot check your item writers’ work. If possible, write a style guide that includes clear guidelines about item construction, such as fonts to use, acceptable abbreviations, scoring rules, acceptable item types, et cetera.

I know from my own experience (and Shmeiser and Welch agree) that investing more time in training will have a big payoff down the line. Better training leads to substantially better item retention rates when items are reviewed. If your item writers are not trained well, you may end up throwing out many of their items, which may not leave you enough for your assessment design. Considering the cost of item development and the time spent writing and reviewing items, putting in a few more hours of training can equal big savings for your program in the long run.

In my next post, I will discuss how to manage your item writers as they begin the important work of drafting the items.

Item Development – Managing the Process for Large-Scale Assessments

Austin FosseyPosted by Austin Fossey

Whether you work with low-stakes assessments, small-scale classroom assessments or large-scale, high-stakes assessment, understanding and applying some basic principles of item development will greatly enhance the quality of your results.

This is the first in a series of posts setting out item development steps that will help you create defensible assessments. Although I’ll be addressing the requirements of large-scale, high-stakes testing, the fundamental considerations apply to any assessment.

You can find previous posts here about item development including how to write items, review items, increase complexity, and avoid bias. This series will review some of what’s come before, but it will also explore new territory. For instance, I’ll discuss how to organize and execute different steps in item development with subject matter experts. I’ll also explain how to collect information that will support the validity of the results and the legal defensibility of the assessment.

In this series, I’ll take a look at:

Item Dev.

These are common steps (adapted from Crocker and Algina’s Introduction to Classical and Modern Test Theory) taken to create the content for an assessment. Each step requires careful planning, implementation, and documentation, especially for high-stakes assessments.

This looks like a lot of steps, but item development is just one slice of assessment development. Before item development can even begin, there’s plenty of work to do!

In their article, Design and Discovery in Educational Assessment: Evidence-Centered Design, Psychometrics, and Educational Data Mining, Mislevy, Behrens, Dicerbo, and Levy provide an overview of Evidence-Centered Design (ECD). In ECD, test developers must define the purpose of the assessment, conduct a domain analysis, model the domain, and define the conceptual assessment framework before beginning assessment assembly, which includes item development.

Once we’ve completed these preparations, we are ready to begin item development. In the next post, I will discuss considerations for training our item writers and item reviewers.