Psychometrics 101: How do I know if an assessment is reliable? (Part 2)

greg_pope-150x1502

Posted by Greg Pope

In my last post I offered some general information about assessment reliability. Below are some additional specific things to consider.

  • What factors / test characteristics generally influence internal consistency reliability coefficient values?

A.    Item difficulty: Items that are extremely hard or extremely easy affect discrimination and therefore reliability. If a large number of participants do not have time to finish the test this affects item difficulty
B.    Item discrimination: Items that have higher discrimination values will contribute more to the measurement efficacy of the assessment (more discriminating questions = higher reliability). Part of this relates to sound question development, if questions are well crafted and non-ambiguously worded they are more likely to have acceptable discrimination
C.    Construct being measured: If all questions are measuring the same construct (e.g., from the same topic) reliability will be increased
D.    How many participants took the test: With very small numbers of participants the reliability coefficient will be less stable
E.    Composition of people that took the test: If the sample of participants taking an assessment is not representative (e.g., no-one studied!) the reliability will be negatively impacted
F.    How many questions are administered: Generally the more questions administered the higher the reliability (to a point, we can’t have a 10,000 question test!)
G.    Environmental administration factors: Conditions in the testing area such as noise, lighting levels, etc. can cause distraction away from the measurement of what the participants know and can do
H.    Person factors: Test anxiety, fatigue, and other human factors can reduce the accuracy of measurement of what people know and can do

greg101

For more on this subject see the Questionmark White Paper, “Defensible Assessments: What You Need to Know”

Psychometrics 101: How do I know if my assessment is reliable? (Part 1)

greg_pope-150x1502

Posted by Greg Pope

At last week’s Questionmark Users Conference I presented a session on item and test analysis, and part of that session dealt with test score reliability.

“Reliability” is used in everyday language: “My car runs reliably” means it starts every time. In the assessment realm we talk about test score reliability, which refers to how consistently and accurately test scores measure a construct (knowledge/skills in the domain of interest such as “American History Knowledge”).

Assessments are measurement instruments; the questions composing the assessment take measurements of what people know and can do. Just as thermometers take measurements of temperature, assessment questions take measurements of psychological attributes. Like any measurement instrument, there is some imprecision in the estimates, so the test score that a person obtains (observed score) is actually composed of a theoretical “true score” (what they actually really know and can do) plus some error. Reliable test scores have the least amount of error and therefore the smallest difference between the observed score and this theoretical true score.It is hard to go into a great deal of detail here, so for a good primer into the theory check out: Traub, R.E. (1994). Reliability for the Social Sciences: Theory & Applications. Thousand Oaks: Sage.

image
Generally there are four approaches for measuring reliability:

1.    Internal consistency: How well do items on the test “hang together” to measure the same psychological attribute

2.    Split-half (split-forms): How well do scores on two forms (splits) of the test (first 25 items versus last 25) relate to one another

3.    Test-retest: How similar are scores obtained from multiple administrations of the same test

4.    Inter-rater reliability: How consistently do two or more raters (essay markers) obtain similar scores.

Internal consistency reliability is common and is used in our Test Analysis Report and Results Management System ,where we use Cronbach’s Alpha.

Stay tuned for Part 2 of this post, which will discuss the factors and test characteristics that generally influence internal consistency reliability coefficient values.

Psychometrics 101: Item Total Correlation

greg_pope

Posted by Greg Pope

I’ll be talking about a subject dear to my heart — psychometrics — at the Questionmark Users Conference April 5 -8. Here’s a sneak preview on one of my topics: item total correlation! What is it, and what does it mean?

The item total correlation is a correlation between the question score (e.g., 0 or 1 for multiple choice) and the overall assessment score (e.g., 67%). It is expected that if a participant gets a question correct they should, in general, have higher overall assessment scores than participants who get a question wrong. Similarly with essay type question scoring where a question could be scored between 0 and 5 participants who did a really good job on the essay (got a 4 or 5) should have higher overall assessment scores (maybe 85-90%). This relationship is shown in an example graph below.

chart-35

This relationship in psychometrics is called ‘discrimination’ referring to how well a question differentiates between participants who know the material and those that do not know the material. Participants who know the material taught to them should get high scores on questions and high overall assessment scores. Participants who did not master the material should get low scores on questions and lower overall assessment scores. This is the relationship that an item-total correlation provides to help evaluate the performance of questions. We want to have lots of highly discriminating questions on our tests because they are the most fine-tuned measurements to find out what participants know and can do. When looking at an item-total correlation generally negative values are a major red flag it is unexpected that participants who get low scores on the questions get high scores on the assessment. This could indicate a mis-keyed question or that the question was highly ambiguous and confusing to participants. Values for an item-total correlation (point-biserial) between 0 and 0.19 may indicate that the question is not discriminating well, values between 0.2 and 0.39 indicate good discrimination, and values 0.4 and above indicate very good discrimination.