# G Theory and Reliability for Assessments with Randomly Selected Items

Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

• Error variance.
• Variance in the item mean scores.
• Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

# Item Development – Planning your field test study

Once the items have passed their final editorial review, they are ready to be delivered to participants, but they are not quite ready to be delivered as scored items. For large-scale assessments, it is best practice to deliver your new items as unscored field test items so that you can gather item statistics for review before using the items to count toward a participant’s score. We discussed field test studies in an earlier post, but today we will focus more on the operational aspects of this task.

If you are embedding field test items, there is little you need to do to plan for the field test, other than to collect data on your participants to ensure representativeness and to make sure that enough participants respond to the item to yield stable statistics. You can collect data for representativeness by using demographic questions in Questionmark’s authoring tools.

If field testing an entire form, you will need to plan your field test carefully. When an entire form is going to be field tested, Schmeiser and Welch ( Educational Measurement, 4th ed.) recommend testing twice as many items as you will need for your operational form.

To check representativeness, you may want to survey your participants in advance to help you select your participant sample. For example, if your participant population is 60% female and 40% male, but your field test sample is 70% male, then that may impact the validity of your field test results. It will be up to you to decide which factors are relevant (e.g., sex, ethnicity, age, level of education, location, level of experience). You can use Questionmark’s authoring tools and reports to deliver and analyze these survey results.

You will also need to entice participants to take your field test. Most people will not want to take a test if they do not have to, but you will likely want to conduct the field test expeditiously. You may want to offer an incentive to test, but that incentive should not bias the results.

For example, I worked on a certification assessment where the assessment cost participants several hundred dollars. To incentivize participation in the field test study of multiple new forms, we offered the assessment free of charge and told participants that their results would be scored once the final forms were assembled. We surveyed volunteers and selected a representative sample to field test each of the forms.

The number of responses you need for each item will depend on your scoring model and your organization’s policies. If using Classical Test Theory, some organizations will feel comfortable with 80 – 100 responses, but Item Response Theory models may require 200 – 500 responses to yield stable item parameters.

More is always better, but it is not always possible. For instance, if an assessment is for a very small population, you may not have very many field test participants. You will still be able to use the item statistics, but they should be interpreted cautiously in conjunction with their standard errors. In the next post, we will talk about interpreting item statistics in the psychometric review following the field test.

# Item Development – Conducting the final editorial review

Once you have completed your content review and bias review, it is best to conduct a final editorial review.

You may have already conducted an editorial review prior to the content and bias reviews to cull items with obvious item-writing flaws or inappropriate item types—so by the time you reach this second editorial review, your items should only need minor edits.

This is the time to put the final polish on all of your items. If your content review committee and bias review committee were authorized to make changes to the items, go back and make sure they followed your style guide and that they used accurate grammar and spelling. Make sure they did not make any drastic changes that violate your test specifications, such as adding a fourth option to a multiple choice item that should only have three options.

If you have resources to do so, have professional editors review the items’ content. Ask the editors to identify issues with language, but review their suggestions rather than letting them make direct edits to the items. The editors may suggest changes that violate your style guide, they may not be familiar with language that is appropriate for your industry, or they may wish to make a change that would drastically impact the item content. You should carefully review their changes to make sure they are each appropriate.

As with other steps in the item development process, documentation and organization is key. Using item writing software like that provided by Questionmark can help you track revisions to items, document changes, and track your items to make sure each one is reviewed.

Do not approve items with a rubber stamp. If an item needs major content revisions, send it back to the item writers and begin the process again. Faulty items can undermine the validity of your assessment and can result in time-consuming challenges from participants. If you have planned ahead, you should have enough extra items to allow for some attrition while retaining enough items to meet your test specifications.

Finally, be sure that you have the appropriate stakeholders sign off on each item. Once the item passes this final editorial review, it should be locked down and considered ready to deliver to participants. Ideally, no changes should be made to items once they are in delivery, as this may impact how participants respond to the item and perform on the assessment. (Some organizations require senior executives to review and approve any requested changes to items that are already in delivery.)

When you are satisfied that the items are perfect, they are ready to be field tested. In the next post, I will talk about item try-outs, selecting a field test sample, assembling field test forms, and delivering the field test.

Check out our white paper: 5 Steps to Better Tests for best practice guidance and practical advice for the five key stages of test and exam development.

Austin Fossey will discuss test development at the 2015 Users Conference in Napa Valley, March 10-13. Register before Jan. 29 and save \$100.

# Acronyms, Abbreviations and APIs

Posted by Steve Lay

As Questionmark’s integrations product owner, it is all too easy to speak in acronyms and abbreviations. Of course, with the advent of modern day ‘text-speak,’ acronyms are part of everyday speech. But that doesn’t mean everyone knows what they mean. David Cameron, the British prime minister, was caught out by the everyday ‘LOL’ when it was revealed during a recent public inquiry that he’d used it thinking it meant ‘lots of love’.

In the technical arena things are not so simple. Even spelling out an acronym like SOAP (which stands for Simple Object Access Protocol) doesn’t necessarily make the meaning any clearer. In this post, I’m going to do my best to explain the meanings of some of the key acronyms and abbreviations you are likely to hear talked about in relation to Questionmark’s Open Assessment Platform.

API

At a recent presentation (on Extending the Platform), while I was talking about ways of integrating with Questionmark technologies, I asked the audience how many people knew what ‘API’ stood for. The response prompted me to write this blog article!

The term, API, is used so often that it is easy to forget that it is not widely known outside of the computing world.

API stands for Application Programming Interface. In this case the ‘application’ refers to some external software that provides functionality beyond that which is available in the core platform. For example, it could be a custom registration application that collects information in a special way that makes it possible to automatically create a user and schedule them to a specified assessment.

The API is the information that the programmer needs to write this registration application. ‘Interface’ refers to the join between the external software and the platform it is extending. (Our own APIs are documented on the Questionmark website and can be reached directly from developer.questionmark.com.)

APIs and Standards

APIs often refer to technical standards. Using standards helps the designer of an API focus on the things that are unique to the platform concerned without having to go into too much incidental detail. Using a common standard also helps programmers develop applications more quickly. Pre-written code that implements the underlying standard will often be available for programmers to use.

To use a physical analogy, some companies will ask you to send them a self-addressed stamped envelope when requesting information from them. The company doesn’t need to explain what an envelope is, what a stamp is and what they mean by an address! These terms act a bit like technical standards for the physical world. The company can simply ask for one because they know you understand this request. They can focus their attention on describing their services, the types of requests they can respond to and the information they will send you in return.

QMWISe

QMWISe stands for Questionmark Web Integration Services Environment. This API allows programmers to exchange information with Questionmark OnDemand software-as-a-service or Questionmark Perception on-premise software. QMWISe is based on an existing standard called SOAP. (see above)

SOAP defines a common structure used for sending and receiving messages; it even defines the concept of a virtual ‘envelope’. Referring to the SOAP standard allows us to focus on the contents of the messages being exchanged such as creating participants, creating schedules, fetching results and so on.

REST

REST stands for REpresentational State Transfer and must qualify as one of the more obscure acronyms! In practice, REST represents something of a back-to-basics approach to APIs when contrasted with those based on SOAP. It is not, in itself, a standard but merely a set of stylistic guidelines for API designers defined by an academic paper written by Roy Fielding, a co-author of the HTTP standard (see below).

As a result, APIs are sometimes described as ‘RESTful’, meaning they adhere to the basic principles defined by REST. These days, publicly exposed APIs are more likely to be RESTful than SOAP-based. Central to the idea of a RESTful API is that the things your API deals with are identified by a URL (Uniform Resource Locator), the web’s equivalent of an address. In our case, that would mean that each participant, schedule, result, etc. would be identified by its own URL.

HTTP

RESTful APIs draw heavily on HTTP. HTTP stands for HyperText Transfer Protocol. It was invented by Tim Berners-Lee and forms one of the key inventions that underpin the web as we know it. Although conceived as a way of publishing HyperText documents (i.e., web pages), the underlying protocol is really just a way of sending messages. It defines the virtual envelope into which these messages are placed. HTTP is familiar as the prefix to most URLs.

OData

Finally this brings me to OData. OData just stands for Open Data. This standard makes it much easier to publish RESTful APIs. I recently OData in the post, What is Odata, and why is it important?

Although arguably simpler than SOAP, OData provides an even more powerful platform for defining APIs. For some applications, OData itself is enough, and tools can be integrated with no additional programming at all. The PowerPivot plugin for Microsoft Excel is a good example. Using Excel you can extract and analyse data using the Questionmark Results API (itself built on OData) without any Questionmark-specific programming at all.

For more about OData, check out this presentation on Slideshare.

# Item Development – Managing the Process for Large-Scale Assessments

Posted by Austin Fossey

Whether you work with low-stakes assessments, small-scale classroom assessments or large-scale, high-stakes assessment, understanding and applying some basic principles of item development will greatly enhance the quality of your results.

This is the first in a series of posts setting out item development steps that will help you create defensible assessments. Although I’ll be addressing the requirements of large-scale, high-stakes testing, the fundamental considerations apply to any assessment.

You can find previous posts here about item development including how to write items, review items, increase complexity, and avoid bias. This series will review some of what’s come before, but it will also explore new territory. For instance, I’ll discuss how to organize and execute different steps in item development with subject matter experts. I’ll also explain how to collect information that will support the validity of the results and the legal defensibility of the assessment.

In this series, I’ll take a look at:

These are common steps (adapted from Crocker and Algina’s Introduction to Classical and Modern Test Theory) taken to create the content for an assessment. Each step requires careful planning, implementation, and documentation, especially for high-stakes assessments.

This looks like a lot of steps, but item development is just one slice of assessment development. Before item development can even begin, there’s plenty of work to do!

In their article, Design and Discovery in Educational Assessment: Evidence-Centered Design, Psychometrics, and Educational Data Mining, Mislevy, Behrens, Dicerbo, and Levy provide an overview of Evidence-Centered Design (ECD). In ECD, test developers must define the purpose of the assessment, conduct a domain analysis, model the domain, and define the conceptual assessment framework before beginning assessment assembly, which includes item development.

Once we’ve completed these preparations, we are ready to begin item development. In the next post, I will discuss considerations for training our item writers and item reviewers.

# Trustworthy Assessment Results – A Question of Transparency

Posted by Austin Fossey

Do you trust the results of your test? Like many questions in psychometrics, the answer is that it depends. Like the trust between two people, trustworthy assessment results have to be earned by the testing body.

Many of us want to implicitly trust the testing body, be it a certification organization, a department of education, or our HR department. When I fill a car with gas, I don’t want to have to siphon the gas out to make sure the amount of gas matches the volume on the pump—I just assume it’s accurate. We put the same faith in our testing bodies.

Just as gas pumps are certified and periodically calibrated, many high-stakes assessment programs are also reviewed. In the U.S., state testing programs are reviewed by the U.S. Department of Education, peer review groups, and technical advisory boards. Certification and licensure programs are sometimes reviewed by third-party accreditation programs, though these accreditations usually only look to see that certain requirements are met without evaluating how well they were executed.

In her op-ed, Can We Trust Assessment Results?, Eva Baker argues that the trustworthiness of assessment results is dependent on the transparency of the testing program. I agree with her. Participants should be able to easily get information on the purpose of the assessment, the content that is covered, and how the assessment was developed. Baker also adds that appropriate validity studies should be conducted and shared. I was especially pleased to see Baker propose that “good transparency occurs when test content can be clearly summarized without giving away the specific questions.”

For test results to be trustworthy, transparency also needs to extend beyond the development of the assessment to include its maintenance. Participants and other stakeholders should have confidence that the testing body is monitoring its assessments, and that a plan is in place should their results become compromised.

In their article, Cheating: Its Implications for ABFM Examinees, Kenneth Royal and James Puffer discuss cases where widespread cheating affects the statistics of the assessment, which in turn mislead test developers by making items appear easier. The effect can be an assessment that yields invalid results. Though specific security measures should be kept confidential, testing bodies should have a public-facing security plan that explains their policies for addressing improprieties. This plan should address policies for the participants as
well as for how the testing body will handle test design decisions that have been impacted by compromised results.

Even under ideal circumstances, mistakes can happen. Readers may recall that, in 2006, thousands of students received incorrect scores on the SAT, arguably one of the best-developed and carefully scrutinized assessments in U.S. education. The College Board (the testing body that runs the SAT) handled the situation as well as they could, publicly sharing the impact of the issue, the reasons it happened, and their policies for how they would handle the incorrect results. Others will feel differently, but I trust SAT scores more now that I have observed how the College Board communicated and rectified the mistake.

Most testing programs are well-run, professional operations backed by qualified teams of test developers, but there are the occasional junk testing programs such as predatory certificate programs, that yield useless, untrustworthy results. It can be difficult to tell the difference, but like Eva Baker, I believe that organizational transparency is the right way for a testing body to earn the trust of its stakeholders.

Next Page »