Know what your questions are about before you deliver the test

Austin Fossey-42Posted by Austin Fossey

A few months ago, I had an interesting conversation with an assessment manager at an educational institution—not a Questionmark customer, mind you. Finding nothing else in common, we eventually began discussing assessment design.

At this institution (which will remain anonymous), he admitted that they are often pressed for time in their assessment development cycle. There is not enough time to do all of the item development work they need to do before their students take the assessment. To get around this, their item writers draft all of the items, conduct an editorial review, and then deliver the items. The items are assigned topics after administration, and students’ total scores and topic scores are calculated from there. He asked me if Questionmark software allows test developers to assign topics and calculate topic scores after assessing the students, and I answered truthfully that it does not.

But why not? Is there a reason test developers should not do what is being practiced at this institution? Yes, there are in fact two reasons. Get ready for some psychometric finger-wagging.

Consider what this institution is doing. The items are drafted and subjected to an editorial review, but no one ever classifies the items within a topic until after the test has been administered. Recall what people typically do during a content review prior to administration:

  • Remove items that are not relevant to the domain.
  • Ensure that the blueprint is covered.
  • Check that items are assigned to the correct topic.

If topics are not assigned until after the participants have already tested, we risk the validity of the results and the legal defensibility of the test. If we have delivered items that are not relevant to the domain, we have wasted participants’ time and will need to adjust their total score. Okay, we can manage that by telling the participants ahead of time that some of the test items might not count. But if we have not asked the correct number of questions for a given area of the blueprint, the entire assessment score will be worthless—a threat to validity known as construct underrepresentation or construct deficiency in The Standards for Educational and Psychological Testing.

For example, if we were supposed to deliver 20 items from Topic A, but find out after the fact that only 12 items have been classified as belonging to Topic A, then there is little we can do about it besides rebuilding the test form and making everyone take the test again.

The Standards provide helpful guidance in these matters. For this particular case, the Standards point out that:

“The test developer is responsible for documenting that the items selected for the test meet the requirements of the test specifications. In particular, the set of items selected for a new test form . . . must meet both content and psychometric specifications.” (p. 82)

Publications describing best practices for test development also specify that the content must be determined before delivering an operational form. For example, in their chapter in Educational Measurement (4th Edition), Cynthia Schmeiser and Catherine Welch note the importance of conducting a content review of items before field testing, as well a final content review of a draft test form before it becomes operational.

In Introduction to Classical and Modern Test Theory, Linda Crocker and James Algina also made an interesting observation about classroom assessments, noting that students expect to be graded on all of the items they have been asked to answer. Even if notified in advance that some items might not be counted (as one might do in field testing), students might not consider it fair that their score is based on a yet-to-be-determined subset of items that may not fully represent the content that is supposed to be covered.

This is why Questionmark’s software is designed the way it is. When creating an item, item writers must assign an item to a topic, and items can be classified or labeled along other dimensions (e.g., cognitive process) using metatags. Even if an assessment program cannot muster any further content review, at least the item writer has classified items by content area. The person building the test form then has the information they need to make sure that the right questions get asked.

We have a responsibility as test developers to treat our participants fairly and ethically. If we are asking them to spend their time taking a test, then we owe them the most useful measurement that we can provide. Participants trust that we know what we are doing. If we postpone critical, basic development tasks like content identification until after participants have already given us their time, we are taking advantage of that trust.

Agree or disagree? 10 tips for better surveys — Part 2

John Kleeman HeadshotPosted by John Kleeman

In my first post in this series, I explained that survey respondents go through a four-step process when they answer each question: comprehend the question, retrieve/recall the information that it requires, make a judgement on the answer and then select the response. There is a risk of error at each step. I also explained the concept of “satisficing”, where participants often give a satisfactory answer rather than an optimal one – another potential source of error.

Today, I’m offering some tips for effective online attitude survey design, based on research evidence. Following these tips should help you reduce error in your attitude surveys.

Tip #1 – Avoid Agree/Disagree questions

Although these are one of the most common types of questions used in surveys, you should try to avoid questions which ask participants whether they agree with a statement.

There is an effect called acquiescence bias, where some participants are more likely to agree than disagree. It seems from the research that some participants are easily influenced and so tend to agree with things easily. This seems to apply particularly to participants who are more junior or less well educated, who may tend to think that what is asked of them might be true. For example Krosnick and Presser state that across 10 studies, 52 percent of people agreed with an assertion compared to 42 percent of those disagreeing with its opposite. If you are interested in finding more about this effect, see this 2010 paper by Saris, Revilla, Krosnick and Schaeffer.

Satisficing – where participants just try to give a good enough answer rather than their best answer – also increases the number of “agree” answers.

For example, do not ask a question like this:

My overall health is excellent. Do you:

  • Strongly Agree
  • Agree
  • Neither Agree or Disagree
  • Disagree
  • Strongly Disagree

Instead re-word it to be construct specific:

How would you rate your health overall?

  • Excellent
  • Very good
  • Good
  • Fair
  • Bad
  • Very bad

 

Tip #2 – Avoid Yes/No and True/False questions

For the same reason, you should avoid Yes/No questions and True/False questions in surveys. People are more likely to answer Yes than No due to acquiescence bias.

Tip #3 – Each question should address one attitude only

Avoid double-barrelled questions that ask about more than one thing. It’s very easy to ask a question like this:

  • How satisfied are you with your pay and work conditions?

However, someone might be satisfied with their pay but dissatisfied with their work conditions, or vice versa. So make it two separate questions.

Tip #4 – Minimize the difficulty of answering each question

If a question is harder to answer, it is more likely that participants will satisfice – give a good enough answer rather than the best answer. To quote Stanford Professor  Jon Krosnick, “Questionnaire designers should work hard to minimize task difficulty”.  For example:

  • Use as few words as possible in question and responses.
  • Use words that all your audience will know.
  • Where possible, ask questions about the recent past not the distant past as the recent past is easier to recall.
  • Decompose complex judgement tasks into simpler ones, with a single dimension to each one.
  • Where possible make judgements absolute rather than relative.
  • Avoid negatives. Just like in tests and exams, using negatives in your questions adds cognitive load and makes the question less likely to get an effective answer.

The less cognitive load involved in questions, the more likely you are to get accurate answers.

Tip #5 – Randomize the responses if order is not importantSetting choices to be shuffled

The order of responses can significantly influence which ones get chosen.

There is a primacy effect in surveys where participants more often choose the first response than a later one. Or if they are satisficing, they can choose the first response that seems good enough rather than the best one.

There can also be a recency effect whereby participants read through a list of choices and choose the last one they have read.

In order to avoid these effects, if your choices do not have a clear progression or some other reason for being in a particular order, randomize them.  This is easy to do in Questionmark software and will remove the effect of response order on your results.

Here is a link to the next segment of this series: Agree or disagree? 10 tips for better surveys — part 3

Item Analysis Report – Item Difficulty Index

Austin FosseyPosted by Austin Fossey

In classical test theory, a common item statistic is the item’s difficulty index, or “p value.” Given many psychometricians’ notoriously poor spelling, might this be due to thinking that “difficulty” starts with p?

Actually, the p stands for the proportion of participants who got the item correct. For example, if 100 participants answered the item, and 72 of them answered the item correctly, then the p value is 0.72. The p value can take on any value between 0.00 and 1.00. Higher values denote easier items (more people answered the item correctly), and lower values denote harder items (fewer people answered the item correctly).

Typically, test developers use this statistic as one indicator for detecting items that could be removed from delivery. They set thresholds for items that are too easy and too difficult, review them, and often remove them from the assessment.

Why throw out the easy and difficult items? Because they are not doing as much work for you. When calculating the item-total correlation (or “discrimination”) for unweighted items, Crocker and Algina (Introduction to Classical and Modern Test Theory) note that discrimination is maximized when p is near 0.50 (about half of the participants get it right).

Why is discrimination so low for easy and hard items? An easy item means that just about everyone gets it right, no matter how proficient they are in the domain; the item does not discriminate well between high and low performers. (We will talk more about discrimination in subsequent posts.)

Sometimes you may still need to use a very easy or very difficult item on your test form. You may have a blueprint that requires a certain number of items from a given topic, and all of the available items might happen to be very easy or very hard. I also see this scenario in cases with non-compensatory scoring of a topic. For example, a simple driving test might ask, “Is it safe to drink and drive?” The question is very easy and will likely have a high p value, but the test developer may include it so that if a participant gets the item wrong, they automatically fail the entire assessment.

You may also want very easy or very hard items if you are using item response theory (IRT) to score an aptitude test, though it should be noted that item difficulty is modeled differently in an IRT framework. IRT yields standard errors of measurement that are conditional on the participant’s ability, so having hard and easy items can help produce better estimates of high- and low-performing participants’ abilities, respectively. This is different from the classical test theory where the standard error of measurement is the same for all observed scores on an assessment.

While simple to calculate, the p value requires cautious interpretation. As Crocker and Algina note, the p value is a function of the number of participants who know the answer to the item plus the number of participants who were able to correctly guess the answer to the item. In an open response item, that latter group is likely very small (absent any cluing in the assessment form), but in a typical multiple choice item, a number of participants may answer correctly, based on their best educated guess.

Recall also that p values are statistics—measures from a sample. Your interpretation of a p value should be informed by your knowledge of the sample. For example, if you have delivered an assessment, but only advanced students have been scheduled to take it, then the p value will be higher than it might be when delivered to a more representative sample.

Since the p value is a statistic, we can calculate the standard error of that statistic to get a sense of how stable the statistic is. The standard error will decrease with larger sample sizes. In the example below, 500 participants responded to this item, and 284 participants answered the item correctly, so the p value is 284/500 = 0.568. The standard error of the statistic is ± 0.022. If these 500 participants were to answer this item over and over again (and no additional learning took place), we would expect the p value for this item to fall in the range of 0.568 ± 0.022 about 68% of the time.

item analysis report 2

 

Item p value and standard error of the statistic from Questionmark’s Item Analysis Report

Measuring Learning Results: Eight Recommendations for Assessment Designers

Joan PhaupPosted by Joan Phaup

Is it possible to build the perfect assessment design? Not likely, given the intricacies of the learning process! But a white paper available on the Questionmark Web site helps test authors respond effectively to the inevitable tradeoffs in order to create better assessments.

Measuring Learning Results, by Dr. Will Thalheimer of Work-Learning Research, considers findings from fundamental learning research and how they relate to assessment. The paper explores how to create assessments that measure how well learning interventions are preparing learners to retrieve information in future situations—which as Will states it is the ultimate goal of training and education.

The eight bits of wisdom that conclude the paper give plenty of food for thought for test designers. You can download the paper to find out how Will arrived at them.

1. Figure out what learning outcomes you really care about. Measure them. Prioritize the importance of the learning outcomes you are targeting. Use more of your assessment time on high-priority information.

2. Figure out what retrieval situations you are preparing your learners for. Create assessment items that mirror or simulate those retrieval situations.

3. Consider using delayed assessments a week or month (or more) after the original learning ends—in addition to end-of-learning assessments.

4. Consider using delayed assessments instead of end-of-learning assessments, but be aware that there are significant tradeoffs in using this approach.

5. Utilize authentic questions, decisions, or demonstrations of skill that require learners to retrieve information from memory in a way that is similar to how they’ll have to retrieve it in the retrieval situations for which you are preparing them. Simulation-like questions that provide realistic decisions set in real-world contexts are ideal.

6. Cover a significant portion of the most important learning points you want your learners to understand or be able to utilize. This will require you to create a list of the objectives that will be targeted by the instruction.

7. Avoid factors that will bias your assessments. Or, if you can’t avoid them, make sure you understand them, mitigate them as much as possible, and report their influence. Beware of the biasing effects of end-of-learning assessments, pretests, assessments given in the learning context, and assessment items that are focused on low-level information.

8. Follow all the general rules about how to create assessment items. For example, write clearly, use only plausible alternatives (for multiple-choice questions), pilot-test your assessment items to improve them, and utilize psychometric techniques where applicable.