Posted by Greg Pope
At last week’s Questionmark Users Conference I presented a session on item and test analysis, and part of that session dealt with test score reliability.
“Reliability” is used in everyday language: “My car runs reliably” means it starts every time. In the assessment realm we talk about test score reliability, which refers to how consistently and accurately test scores measure a construct (knowledge/skills in the domain of interest such as “American History Knowledge”).
Assessments are measurement instruments; the questions composing the assessment take measurements of what people know and can do. Just as thermometers take measurements of temperature, assessment questions take measurements of psychological attributes. Like any measurement instrument, there is some imprecision in the estimates, so the test score that a person obtains (observed score) is actually composed of a theoretical “true score” (what they actually really know and can do) plus some error. Reliable test scores have the least amount of error and therefore the smallest difference between the observed score and this theoretical true score.It is hard to go into a great deal of detail here, so for a good primer into the theory check out: Traub, R.E. (1994). Reliability for the Social Sciences: Theory & Applications. Thousand Oaks: Sage.
Generally there are four approaches for measuring reliability:
1. Internal consistency: How well do items on the test “hang together” to measure the same psychological attribute
2. Split-half (split-forms): How well do scores on two forms (splits) of the test (first 25 items versus last 25) relate to one another
3. Test-retest: How similar are scores obtained from multiple administrations of the same test
4. Inter-rater reliability: How consistently do two or more raters (essay markers) obtain similar scores.
Stay tuned for Part 2 of this post, which will discuss the factors and test characteristics that generally influence internal consistency reliability coefficient values.