Item Development – Organizing a bias review committee (Part 2)

Austin Fossey-42Posted by Austin Fossey

The Standards for Educational and Psychological Testing describe two facets of an assessment that can result in bias: the content of the assessment and the response process. These are the areas on which your bias review committee should focus. You can read Part 1 of this post, here.

Content bias is often what people think of when they think about examples of assessment bias. This may pertain to item content (e.g., students in hot climates may have trouble responding to an algebra scenario about shoveling snow), but it may also include language issues, such as the tone of the content, differences in terminology, or the reading level of the content. Your review committee should also consider content that might be offensive or trigger an emotional response from participants. For example, if an item’s scenario described interactions in a workplace, your committee might check to make sure that men and women are equally represented in management roles.

Bias may also occur in the response processes. Subgroups may have differences in responses that are not relevant to the construct, or a subgroup may be unduly disadvantaged by the response format. For example, an item that asks participants to explain how they solved an algebra problem may be biased against participants for whom English is a second language, even though they might be employing the same cognitive processes as other participants to solve the algebra. Response process bias can also occur if some participants provide unexpected responses to an item that are correct but may not be accounted for in the scoring.

How do we begin to identify content or response processes that may introduce bias? Your sensitivity guidelines will depend upon your participant population, applicable social norms, and the priorities of your assessment program. When drafting your sensitivity guidelines, you should spend a good amount of time researching potential sources of bias that could manifest in your assessment, and you may need to periodically update your own guidelines based on feedback from your reviewers or participants.

In his chapter in Educational Measurement (4th ed.), Gregory Camilli recommends the chapter on fairness in the ETS Standards for Quality and Fairness and An Approach for Identifying and Minimizing Bias in Standardized Tests (Office for Minority Education) as sources of criteria that could be used to inform your own sensitivity guidelines. If you would like to see an example of one program’s sensitivity guidelines that are used to inform bias review committees for K12 assessment in the United States, check out the Fairness Guidelines Adopted by PARCC (PARCC), though be warned that the document contains examples of inflammatory content.

In the next post, I will discuss considerations for the final round of item edits that will occur before the items are field tested.

Check out our white paper: 5 Steps to Better Tests for best practice guidance and practical advice for the five key stages of test and exam development.

Austin Fossey will discuss test development at the 2015 Users Conference in Napa Valley, March 10-13. Register before Dec. 17 and save $200.

Heading home from San Antonio

Joan Phaup 2013 (3)Posted by Joan Phaup

Bryan Chapman (2)

Bryan Chapman

As we head back home from this week’s Questionmark Users Conference in San Antonio, it’s good to reflect on the connections people made with one another during discussions, focus groups, social events and a wide variety of presentations covering best practices, case studies and the features and functions of Questionmark technologies. Many thanks to all our presenters!

Bryan Chapman’s keynote on Transforming Open Data into Meaning and Action offered an expansive approach to a key theme of this year’s conference. Bryan described the tremendous power of OData while dispelling much of the mystery around it. He explained that OData can be exchanged in simple ways, such as using a URL or inserting a command line to create, read, update, and/or delete data items.

thurs night scottIt was interesting to see how focusing on key indicators that have the biggest impact can produce easy-to-understand visual representations of what is happening within an organization. Among the many dashboards Bryan shared was on that showed the amount of safety training in relation to incidence of on-the-job injuries.thurs night trio

No conference is complete without social events that nurture new friendships and cement long-established bonds. Yesterday ended with a visit to the Rio Cibolo Ranch outside the city, where we enjoyed a Texas-style meal, western music and all manner of ranch activities. Many of us got acquainted with some Texas Longhorn Cattle, and the bravest folks of all took some lassoing lessons (snagging a  mechanical calf, not a longhorn!).

Today’s breakouts and general session complete three intensive days of learning. Here’s wishing everyone a good journey home and continued connections in the year ahead.

Psychometrics 101: How do I know if my assessment is reliable? (Part 1)


Posted by Greg Pope

At last week’s Questionmark Users Conference I presented a session on item and test analysis, and part of that session dealt with test score reliability.

“Reliability” is used in everyday language: “My car runs reliably” means it starts every time. In the assessment realm we talk about test score reliability, which refers to how consistently and accurately test scores measure a construct (knowledge/skills in the domain of interest such as “American History Knowledge”).

Assessments are measurement instruments; the questions composing the assessment take measurements of what people know and can do. Just as thermometers take measurements of temperature, assessment questions take measurements of psychological attributes. Like any measurement instrument, there is some imprecision in the estimates, so the test score that a person obtains (observed score) is actually composed of a theoretical “true score” (what they actually really know and can do) plus some error. Reliable test scores have the least amount of error and therefore the smallest difference between the observed score and this theoretical true score.It is hard to go into a great deal of detail here, so for a good primer into the theory check out: Traub, R.E. (1994). Reliability for the Social Sciences: Theory & Applications. Thousand Oaks: Sage.

Generally there are four approaches for measuring reliability:

1.    Internal consistency: How well do items on the test “hang together” to measure the same psychological attribute

2.    Split-half (split-forms): How well do scores on two forms (splits) of the test (first 25 items versus last 25) relate to one another

3.    Test-retest: How similar are scores obtained from multiple administrations of the same test

4.    Inter-rater reliability: How consistently do two or more raters (essay markers) obtain similar scores.

Internal consistency reliability is common and is used in our Test Analysis Report and Results Management System ,where we use Cronbach’s Alpha.

Stay tuned for Part 2 of this post, which will discuss the factors and test characteristics that generally influence internal consistency reliability coefficient values.