Six tips to increase reliability in competence tests and exams

Posted by John Kleeman

Reliability (how consistent an assessment is in measuring something) is a vital criterion on which to judge a test, exam or quiz. This blog post explains what reliability is, why it matters and gives a few tips on how to increase it when using competence tests and exams within regulatory compliance and other work settings

What is reliability?

Picture of a kitchen scaleAn assessment is reliable if it measures the same thing consistently and reproducibly.

If you were to deliver an assessment with high reliability to the same participant on two occasions, you would be very likely to reach the same conclusions about the participant’s knowledge or skills. A test with poor reliability might result in very different scores across the two instances.

It’s useful to think of a kitchen scale. If the scale is reliable, then when you put a bag of flour on the scale today and the same bag of flour on tomorrow, then it will show the same weight. But if the scale is not working properly and is not reliable, it could give you a different weight each time.

Why does reliability matter?

Just like a kitchen scale that doesn’t work, an unreliable assessment does not measure anything consistently and cannot be used for any trustable measure of competency.

As well as reliability, it’s also important that an assessment is valid, i.e. measures what it is supposed to. Continuing the kitchen scale metaphor, a scale might consistently show the wrong weight; in such a case, the scale is reliable but not valid. To learn more about validity, see my earlier post Six tips to increase content validity in competence tests and exams.

How can you increase the reliability of your assessments?

Here are six practical tips to help increase the reliability of your assessment:

  1. Use enough questions to assess competence. Although you need a sensible balance to avoid tests being too long, reliability increases with test length. In their excellent book, Criterion-Referenced Test Development, Shrock and Coscarelli suggest a rule of thumb is 4-6 questions per objective, with more for critical objectives. You can also get guidance from an earlier post on this blog How many questions do I need on my assessment?
  2.  Have a consistent environment for participants. For test results to be consistent, it’s important that the test environment is consistent – try to ensure that all participants have the same amount of time to take the test in and have a similar environment. For example, if some participants are taking the test in a hurry in a public and noisy place and others are taking it at leisure in their office, this could impact reliability.
  3. Ensure participants are familiar with the assessment user interface. If a participant is new to the user interface or the question types, then they may not show their true competence due to the unfamiliarity. It’s common to provide practice tests to participants to allow them to become familiar with the assessment user interface. This can also reduce test anxiety which also influences reliability.
  4. If using human raters, train them well. If you are using human raters, for example in grading essays or in observational assessments that check practical skills, make sure to define your scoring rules very clearly and as objectively as possible. Train your observers/raters, review their performance, give practice sessions and provide exemplars.
  5. Measure reliability. There are a number of ways of doing this, but the most common way is to calculate what is called “Cronbach’s Alpha” which measures internal consistency reliability (the higher it is, the better). It’s particularly useful if all questions on the assessment measure the same construct. You can easily calculate this for Questionmark assessments using our Test Analysis Report.
  6. Conduct regular item analysis to weed out ambiguous or poor performing questions. Item analysis is an automated way of flagging weak questions for review and improvement. If questions are developed through sound procedures and so well crafted and non-ambiguously worded they are more likely to discriminate well and so contribute to a reliable test. Running regular item analysis is the best way to identify poorly performing questions. If you want to learn more about item analysis, I recently gave a webinar on “Item Analysis for Beginners”, and you can access the recording of this here.

 

I hope this blog post reminds you why reliability matters and gives some ideas on how to improve reliability. There is lots more information on how to improve reliability and write better assessments on the Questionmark website – check out our resources at www.questionmark.com/learningresources.

G Theory and Reliability for Assessments with Randomly Selected Items

Austin Fossey-42Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

  • Error variance.
  • Variance in the item mean scores.
  • Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

Item Analysis Report – Item Reliability

Austin FosseyPosted by Austin Fossey

In this series of posts, we have been discussing the statistics that are reported on the Item Analysis Report, including the difficulty index, correlational discrimination, and high-low discrimination.

The final statistic reported on the Item Analysis Report is the item reliability. Item reliability is simply the product of the standard deviation of item scores and a correlational discrimination index (Item-Total Correlation Discrimination in the Item Analysis Report). So item reliability reflects how much the item is contributing to total score variance. As with assessment reliability, higher values represent better reliability.

Like the other statistics in the Item Analysis Report, item reliability is used primarily to inform decisions about item retention. Crocker and Algina (Introduction to Classical and Modern Test Theory) describe three ways that test developers might use the item reliability index.

1) Choosing Between Two Items in Form Construction

If two items have similar discrimination values, but one item has a higher standard deviation of item scores, then that item will have higher item reliability and will contribute more to the assessment’s reliability. All else being equal, the test developer might decide to retain the item with higher reliability and save the lower reliability item in the bank as backup.

2) Building a Form with a Required Assessment Reliability Threshold

As Crocker and Algina demonstrate, Cronbach’s Alpha can be calculated as a function of the standard deviations of items’ scores and items’ reliabilities. If the test developer desires a certain minimum for the assessment’s reliability (as measured by Cronbach’s Alpha), they can use these two item statistics to build a form that will yield the desired level of internal consistency.

3) Building a Form with a Required Total Score Variance Threshold

Crocker and Algina explain that the total score variance is equivalent to the square of the sum of item reliability indices, so test developers may continue to add items to a form based on their item reliability values until they meet their desired threshold for total score variance.

reliability

Item reliability from Questionmark’s Item Analysis Report (item detail page)

When and where should I use randomly delivered assessments?

greg_pope-150x1502

Posted by Greg Pope

I am often asked my psychometric opinion regarding when and where random administration of assessments is most appropriate.

To refresh memories, this is a feature in Questionmark Perception Authoring Manager that allows you to select questions at random from one or more topics when creating an assessment. Rather than administering the same 10 questions to all participants, you can give each participant a different set of questions that are pulled at random from the bank of questions in the repository.

So when is it appropriate to use random administration? I think that depends on the answer this question: What are the assessment’s  stakes and purpose? If the stakes are low and the assessment scores are used to help reinforce information learned, or to give participants a rough guess as to how they are doing in an area, I would say that using random administration is defensible. However, if the stakes are medium/high and the assessment scores are used for advancing or certifying participants I usually caution against random administration.  Here are a few reasons why:

  • Expert review of the assessment form(s) cannot be conducted in advance (each participant gets a unique form)
  • Generally SMEs, psychometricians, and other experts will thoroughly review a test form before it is put into live production. This is to ensure that the form meets difficulty, content and other criteria before being administered to participants in a medium/high stakes context. In the case of randomly administered assessments, this review in advance is not possible as every participant obtains a different set of questions.
  • Issues with the calculation of question statistics using Classical Test Theory (CTT)
  • Smaller numbers of participants will be answering each individual question. (Rather than all 200 participants answering all 50 questions in a fixed form test, randomly administered tests generated from a bank of 100 questions may only have a few participants answering each question.)
  • As we saw in a previous blog post, sample size has an effect on the robustness of item statistics. With fewer participants taking each question it becomes difficult to have confidence in the stability of the statistics generated.
  • Equivalency of assessment scores is difficult to achieve and prove
  • An important assumption of CTT is equivalence of forms or parallel forms. In assessment contexts where more than one form of an exam is administered to participants, a great deal of time is spent ensuring that the forms of the assessment are parallel in every way possible (e.g.., difficulty of questions, blueprint coverage, question types, etc.) so that the scores participants obtain are equivalent.
  • With random administration it is not possible to control and verify in advance of an assessment session that the forms are parallel because the questions are pulled at random. This leads to the following problem in terms of the equivalence of participant scores:
  • If one participant got 2/10 on a randomly administered assessment and another participant got 8/10 on the same randomly administered assessment it would be difficult to know whether the participant who got 2/10 scored low because they (by chance) got harder questions than the participant who got 8/10 or whether the low-scoring participant actually did not know the material and therefore scored low.
  • Using meta tags one can mitigate this issue to some degree (e.g.,  by randomly administering questions within topics by difficulty ranges and other meta tag data) but this would not completely guarantee randomly equivalent forms.
  • Issues with calculation of test reliability statistics using CTT
  • Statistics such as Cronbach’s Alpha have trouble with randomly administered assessment administration. Random administration produces a lot of missing data for questions (e.g., not all participants answer all questions), which psychometric statistics rarely handle well.

There are other alternatives to random administration depending on what the needs are. For example, if random administration is being looked at to curb cheating, options such as shuffling answer choices and randomizing presentation order could serve this need, making it very difficult for participants to copy answers off of one another.

It is important for an organization to look at their context to determine what is best for them. Questionmark provides many options for our customers when it comes to assessment solutions and invites them to work with us in adopting workable solutions.

How many questions do I need on my assessment?

greg_pope-150x1502

Posted by Greg Pope

I recently was asked a common question regarding creating assessments: How many questions are needed on an assessment in order to obtain valid and reliable participant scores? The answer to this question depends on the context/purpose of the assessment and how scores are used. For example, if an organization is administering a low-stakes quiz designed to facilitate learning during study with on-the-spot question-level feedback and no summary scores, then one question would be enough (although probably more would be better to achieve the intended purpose). If no summary scores are calculated (e.g., an overall assessment score), or if these overall scores are not used for anything, then very small numbers of questions are fine. However, if an organization is administering an end-of-course exam that a participant has to pass in order to complete a course, the volume of questions on that exam is important. (A few questions aren’t going to cut it!) The issue in terms of psychometrics is whether very few questions would provide enough measurement information to allow someone to draw conclusions from the score obtained (e.g., does this participant know enough to be considered proficient).

Ever wonder why you have to take so many questions on a certification or licensing exam? One rarely gets to take only 2-3 questions on a driving test, and certainly not a chartered accountant licensing exam. Oftentimes one might take close to 100 questions on such exams. One of the reasons for this is because more individual measurements of what a participant knows and can do need to be provided in order to ensure that the reliability of the scores obtained are high (and therefore that the error is low). Individual measurements are questions, and if we only asked one question to a participant on an accounting licensing exam we likely would not get a reliable estimate regarding the participant’s accounting knowledge and skills. Reliability is required for an assessment score to be considered valid, and generally the more questions on an assessment (to a practical limit), the higher the reliability.

Generally, what an organization would do is have a target reliability value in mind that would help determine how many questions one would need at a minimum in order to have the measurement accuracy required in a given context. For example, in a high-stakes testing program where people are being certified or licensed based on their assessment scores a reliability of 0.9 or higher (the closer to 1 the better) would likely be required. Once a target minimum reliability target is established one can estimate how many items might be required in order to achieve this reliability. An organization could administer a pilot beta test of an assessment and run the Test Analysis Report to obtain the Cronbach’s Alpha test reliability coefficient. One could then use the Spearman-Brown prophecy formula (described further in “Psychometric Theory” by Nunnally & Bernstein, 1994) to estimate how much the internal consistency reliability will be increased if the number of questions on the assessment increases:

Where:

  • k=the increase in length of the assessment (e.g., k=3 would mean the assessment is 3x longer)
  • r11=the existing internal consistency reliability of the assessment

For example, if the Cronbach’s Alpha reliability coefficient of a 20-item exam is 0.70 and 40 items are added to the assessment (increasing the length of the test by 3x), the estimated reliability of the new 60-item exam will be 0.88:

Let’s look at this information visually:

If you would like to learn more about validity and reliability, see our white paper: Defensible Assessments: What you need to know.

I hope this helps to shed light on this burning psychometric issue!

Understanding Assessment Validity: An Introduction

greg_pope-150x1502

Posted by Greg Pope

In previous posts I discussed some of the theory and applications of classical test theory and test score reliability. For my next series of posts, I’d like to explore the exciting realm of validity. I will discuss some of the traditional thinking in the area of validity as well as some new ideas, and I’ll share applied examples of how your organization could undertake validity studies.

According to the “standards bible” of educational and psychological testing, the Standards for Educational and Psychological Testing (AERA/NCME, 1999), validity is defined as “The degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.”

The traditional thinking around validity, familiar to most people, is that there are three main types:

validity 1

The most recent thinking on validity takes a more unifying approach which I will go into in more detail in upcoming posts.

Now here is something you may have heard before: “In order for an assessment to be valid it must be reliable.” What does this mean? Well, as we learned in previous Questionmark blog posts, test score reliability refers to how consistently an assessment measures the same thing. One of the criteria to make the statement, “Yes this assessment is valid,” is that the assessment must have acceptable test reliability, such as high Cronbach’s Alpha test reliability index values as found in the Questionmark Test Analysis Report and Results Management System (RMS). Other criteria for making the statement, “Yes this assessment is valid,” is to show evidence for criterion related validity, content related validity, and construct related validity.

In my next posts on this topic I will provide some illustrative examples of how organizations may undertake investigating each of these traditionally defined types of validity for their assessment program.