Item Development – Summary and Conclusions

Austin Fossey-42Posted by Austin Fossey

This post concludes my series on item development in large-scale assessment. I’ve discussed some key processes in developing items, including drafting items, reviewing items, editing items, and conducting an item analysis. The goal of this process is to fine-tune a set of items so that test developers have an item pool from which they can build forms for scored assessment while being confident about the quality, reliability, and validity of the items. While the series covered a variety of topics, there are a couple of key themes that were relevant to almost every step.

First, documentation is critical, and even though it seems like extra work, it does pay off. Documenting your item development process helps keep things organized and helps you reproduce processes should you need to conduct development again. Documentation is also important for organization and accountability. As noted in the posts about content review and bias review, checklists can help ensure that committee members consider a minimal set of criteria for every item, but they also provide you with documentation of each committee member’s ratings should the item ever be challenged. All of this documentation can be thought of as validity evidence—it helps support your claims about the results and refute rebuttals about possible flaws in the assessment’s content.

The other key theme is the importance of recruiting qualified and representative subject matter experts (SMEs). SMEs should be qualified to participate in their assigned task, but diversity is also an important consideration. You may want to select item writers with a variety of experience levels, or content experts who have different backgrounds. Your bias review committee should be made up of experts who can help identify both content and response bias across the demographic areas that are pertinent to your population. Where possible, it is best to keep your SME groups independent so that you do not have the same people responsible for different parts of the development cycle. As always, be sure to document the relevant demographics and qualifications of your SMEs, even if you need to keep their identities anonymous.

This series is an introduction for organizing an item development cycle, but I encourage readers to refer to the resources mentioned in the articles for
more information. This series also served as the basis for a session at the 2015 Questionmark Users Conference, which Questionmark customers can watch in the Premium section of the Learning Café.

You can link back to all of the posts in this series by clicking on the links below, and if you have any questions, please comment below!

Item Development – Managing the Process for Large-Scale Assessments

Item Development – Training Item Writers

Item Development – Five Tips for Organizing Your Drafting Process

Item Development – Benefits of editing items before the review process

Item Development – Organizing a content review committee (Part 1)

Item Development – Organizing a content review committee (Part 2)

Item Development – Organizing a bias review committee (Part 1)

Item Development – Organizing a bias review committee (Part 2)

Item Development – Conducting the final editorial review

Item Development – Planning your field test study

Item Development – Psychometric review

Item Development – Managing the Process for Large-Scale Assessments

Austin FosseyPosted by Austin Fossey

Whether you work with low-stakes assessments, small-scale classroom assessments or large-scale, high-stakes assessment, understanding and applying some basic principles of item development will greatly enhance the quality of your results.

This is the first in a series of posts setting out item development steps that will help you create defensible assessments. Although I’ll be addressing the requirements of large-scale, high-stakes testing, the fundamental considerations apply to any assessment.

You can find previous posts here about item development including how to write items, review items, increase complexity, and avoid bias. This series will review some of what’s come before, but it will also explore new territory. For instance, I’ll discuss how to organize and execute different steps in item development with subject matter experts. I’ll also explain how to collect information that will support the validity of the results and the legal defensibility of the assessment.

In this series, I’ll take a look at:

Item Dev.

These are common steps (adapted from Crocker and Algina’s Introduction to Classical and Modern Test Theory) taken to create the content for an assessment. Each step requires careful planning, implementation, and documentation, especially for high-stakes assessments.

This looks like a lot of steps, but item development is just one slice of assessment development. Before item development can even begin, there’s plenty of work to do!

In their article, Design and Discovery in Educational Assessment: Evidence-Centered Design, Psychometrics, and Educational Data Mining, Mislevy, Behrens, Dicerbo, and Levy provide an overview of Evidence-Centered Design (ECD). In ECD, test developers must define the purpose of the assessment, conduct a domain analysis, model the domain, and define the conceptual assessment framework before beginning assessment assembly, which includes item development.

Once we’ve completed these preparations, we are ready to begin item development. In the next post, I will discuss considerations for training our item writers and item reviewers.

Item Analysis Report Revisited

Austin FosseyPosted by Austin Fossey

If you are a fanatical follower of our Questionmark blog, then you already know that we have written more than a dozen articles relating to item analysis in a Classical Test Theory framework. So you may ask, “Austin, why does Questionmark write so much about item analysis statistics? Don’t you ever get out?”

Item analysis statistics are some of the easiest-to-use indicators of item quality, and these are tools that any test developer should be using in their work . By helping people understand these tools, we can help them get the most out of our technologies. And yes, I do get out. I went out to get some coffee once last April.

So why are we writing about item analysis statistics again? Since publishing many of the original blog articles about item analysis, Questionmark has built a new version of the Item Analysis Report in Questionmark Analytics, adding filtering capabilities beyond those of the original Question Statistics Report in Enterprise Reporter.

In my upcoming posts, I will revisit the concepts of item difficulty, item-total score correlation, and high-low discrimination in the context of the Item Analysis Report in Analytics. I will also provide an overview of item reliability and how it would be used operationally in test development.

item analysis report

Screenshot of the Item Analysis Report (Summary View) in Questionmark Analytics

Writing Good Surveys, Part 6: Tips for the form of the survey

Doug Peterson HeadshotPosted By Doug Peterson

In this final installment of this series, we’ll take a look at some tips for the form of the survey itself.

The first suggestion is to avoid labeling sections of questions. Studies have shown that when it is obvious that a series of questions belong to a group, respondents tend to answer all the questions in the group the same way they answer the first question in the group. The same is true with visual formatting, like putting a box around a group of questions or extra space between groups. It’s best to just present all of the questions in a simple, sequentially numbered list.

As much as possible, keep questions at about the same length, and present the same number of questions (roughly, it doesn’t have to be exact) for each topic. Longer questions or more questions on a topic tend to require more reflection by the respondent, and tend to receive higher ratings. I suspect this might have something to do with the respondent feeling like the question or group of questions is more important (or at least more work) because it is longer, possibly making them hesitant to give something “important” a negative rating.

It is important to collect demographic information as part of a survey. However, a suspicion that he or she can be identified can definitely skew a respondent’s answers. Put the demographic information at the end of the survey to encourage honest responses to the preceding questions. Make as much of the demographic information optional as possible, and if the answers are collected and stored anonymously, assure the respondent of this. If you don’t absolutely need a piece of demographic information, don’t ask for it. The more anonymous the respondent feels, the more honest he or she will be.

Group questions with the same response scale together and present them in a matrix format. This reduces the cognitive load on the respondent; the response possibilities do not have to be figured out on each individual question, and the easier it is for respondents to fill out the survey, the more honest and accurate they will be. If you do not use the matrix format, consider listing the response scale choices vertically instead of horizontally. A vertical orientation clearly separates the choices and reduces the chance of accidentally selecting the wrong choice. And regardless of orientation, be sure to place more space between questions than between a question and its response scale.

I hope you’ve enjoyed this series on writing good surveys. I also hope you’ll join us in San Antonio in March 2014 for our annual Users Conference – I’ll be presenting a session on writing assessment and survey items, and I’m looking forward to hearing ideas and feedback from those in attendance!

Problem Questions and Summary – Item Writing Guide, Part 5

Doug Peterson Headshot

Posted By Doug Peterson

Let’s look at two more item writing problems. These last two are a little controversial.5- a

The stimulus for this question tells a wonderful story. The problem is, the first three sentences contain no information that relates to the question. A long stimulus full of extraneous, unneeded information can easily distract or confuse the test-taker. This item needs a re-write of the stimulus to get directly to the question at hand – and nothing else. Let’s change it to “Which Questionmark video explains how to use assessments in solving business problems?”

But here’s the controversy: This question is just fine if you’re trying to ascertain the test-taker’s ability to recognize pertinent information and ignore extraneous information! Therefore I won’t advise that you *never* use a question like this, only that you make sure you use it in the right situation.

And now, on to our last question in this series of posts.

5 -bAt first glance, there doesn’t appear to be a problem with this question – no repetition of a keyword, distracters are the same length, no grammar inconsistencies, short and to the point… But note the word “not” in the stimulus.

The other questions we’ve looked at in part 3 and part 4 of the series ask the test-taker to find the *correct* answer, but this question suddenly has them looking for an *incorrect* answer. This requires the test-taker to reverse their approach to the question, which can be very confusing.

That being said, there are some who advocate putting a certain number of negative questions on a *survey* to help ensure that the person filling it out is paying attention and not just flying through the questions. I’m not sure I agree with this approach. I feel that if they’re not interested and not paying attention to what they’re doing, negative questions aren’t going to change that, but they could lead to some very bad data being collected.

When it comes to quizzes, tests and exams, especially high-stakes exams, I strongly advise against using negative questions. If you absolutely must use a negative question, emphasize the negative by using all capital letters, bold it, and maybe even underline it.

So let’s pull it all together. It’s important to be fair to both the test-taker and the testing organization.

  • The test-taker should only be tested for the knowledge, skills or abilities in question, and nothing else.
  • The testing organization needs to be assured that the assessment accurately and reliably measures the test-taker’s knowledge, skills or abilities.

To do this, your assessment needs to be made up of well-written questions. To write good assessment questions:

  • Be careful with your wording so that you don’t create overly long or confusing questions.
  • Be concise. Sentences should be as short as possible while still posing the question clearly.
  • Keep it simple. Avoid compound sentences and use short, commonly used words whenever possible. Technical terminology is acceptable if it is part of what
    the test measures.
  • Make sure each question has a specific focus, and that you’re not actually testing multiple pieces of knowledge in a single question.
  • Always use positive phrasing to avoid confusion. If you have no choice but to use negative phrasing, make sure that the negative word – for example,
    “not” – is emphasized with capital letters, bold font, and/or underlining.
  • When creating distracters:
  • keep them all the same relative length,
  • as short as possible,
  • avoid using keywords from the stimulus,
  • watch out for grammatical cues, and
  • make sure that all distracters are reasonable answers within the context of the question.

As always, feel free to leave your comments, or contact me directly at

Evaluating the Test — Test Design & Delivery Part 10

Doug Peterson HeadshotPosted By Doug Peterson

In this, the 10th and final installment of the Test Design and Delivery series, we take a look at evaluating the test. Statistical analysis improves as the number of test takers goes up, but data from even a few attempts can provide useful information. In most cases, it we recommended performing analysis on data from at least 100 participants data from 250 or more is considered more trustworthy.

Analysis falls into two categories: item statistics and analysis (the performance of an individual item), and test analysis (the performance of the test as a whole). Questionmark provides both of these analyses in our Reports and Analytics suites.

Item statistics provide information on things like how many times an item has been presented and how many times each choice has been selected. This information can point out a number of problems:

  • An item that has been presented a lot may need to be retired. There is no hard and fast number as far as how many presentations is too many, but items on a high-stakes test should be changed fairly frequently.
  • If the majority of test-takers are getting the question wrong but they are all selecting the same choice, the wrong choice may be flagged as the correct answer, or the training might be teaching the topic incorrectly.
  • If no choice is being selected a majority of the time, it may indicate that the test-takers are guessing, which could in turn indicate a problem with the training. It could also indicate that no choice is completely correct.

Item analysis typically provides two key pieces of information: the Difficulty Index and the Point-Biserial Correlation.

  • Difficulty index: P value = % who answered correctly
  • Too high = too easy
  • Too low = too hard, confusing or misleading, problem with content or instruction
  • Point-Biserial Correlation: how well item discriminated between those who did well on the exam and those who did not
  • Positive value = those who got the item correct also did well on the exam, and those who got the item wrong also did poorly on the exam
  • Negative value = those who did well on the test got the item wrong, those who did poorly on the test got the item right
  • +0.10 or above is typically required to keep an item

Test analysis typically comes down to determining a Reliability Coefficient. In other words, does the test measure knowledge consistently – does it produce similar results under consistent conditions? (Please note that this has nothing to do with validity. Reliability does not address whether or not the assessment tests what it is supposed to be testing. Reliability only indicates that the assessment will return the same results consistently, given the same conditions.)

  • Reliability Coefficient: range of 0 – 1.00
  • Acceptable value depends on consequences of testing error
  • If failing means having to take some training again, a lower value might be acceptable
  • If failing means the health and safety of coworkers might be in jeopardy, a high value is required

part 10

There are a number of different types of consistency:

  • Test – Retest: repeatability of test scores with the passage of time
  • Alternate / Parallel Form: consistency of score across two or more forms by same test taker
  • Inter-Rater: consistency of test score when rated by different raters
  • Internal Consistency: extent to which items on a test measure the same thing
  • Most common: Kuder Richardson-20 (KR-20) or Coefficient Alpha
  • Items must be single answer (right/wrong)
  • May be low if test measures several different, unrelated objectives
  • Low value can also indicate many very easy or hard items, poorly written items that do not discriminate well, or items that do not test the proper content
  • Mastery Classification Consistency
  • Criterion-referenced tests
  • Not affected by items measuring unrelated items
  • 3 common measures:
  • Phi coefficient
  • Agreement coefficient
  • Kappa

Doug will share these and other best practices for test design and delivery at the Questionmark Users Conference in Baltimore March 3 -6. The program includes an optional pre-conference workshop on Criterion-Referenced Test Development led by Sharon Shrock and Bill Coscarelli. Click here for conference and workshop registration.