Item analysis: Selecting items for the test form – Part 2

Austin Fossey-42Posted by Austin Fossey

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

IT 2

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.


The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 1

Austin Fossey-42Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.

But don’t worry, this is actually a new topic for the blog.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. 

Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.

When building a test form, item statistics can help us in two ways.

  • They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
  • They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.

In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.

Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.

Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.

Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).

In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.



Know what your questions are about before you deliver the test

Austin Fossey-42Posted by Austin Fossey

A few months ago, I had an interesting conversation with an assessment manager at an educational institution—not a Questionmark customer, mind you. Finding nothing else in common, we eventually began discussing assessment design.

At this institution (which will remain anonymous), he admitted that they are often pressed for time in their assessment development cycle. There is not enough time to do all of the item development work they need to do before their students take the assessment. To get around this, their item writers draft all of the items, conduct an editorial review, and then deliver the items. The items are assigned topics after administration, and students’ total scores and topic scores are calculated from there. He asked me if Questionmark software allows test developers to assign topics and calculate topic scores after assessing the students, and I answered truthfully that it does not.

But why not? Is there a reason test developers should not do what is being practiced at this institution? Yes, there are in fact two reasons. Get ready for some psychometric finger-wagging.

Consider what this institution is doing. The items are drafted and subjected to an editorial review, but no one ever classifies the items within a topic until after the test has been administered. Recall what people typically do during a content review prior to administration:

  • Remove items that are not relevant to the domain.
  • Ensure that the blueprint is covered.
  • Check that items are assigned to the correct topic.

If topics are not assigned until after the participants have already tested, we risk the validity of the results and the legal defensibility of the test. If we have delivered items that are not relevant to the domain, we have wasted participants’ time and will need to adjust their total score. Okay, we can manage that by telling the participants ahead of time that some of the test items might not count. But if we have not asked the correct number of questions for a given area of the blueprint, the entire assessment score will be worthless—a threat to validity known as construct underrepresentation or construct deficiency in The Standards for Educational and Psychological Testing.

For example, if we were supposed to deliver 20 items from Topic A, but find out after the fact that only 12 items have been classified as belonging to Topic A, then there is little we can do about it besides rebuilding the test form and making everyone take the test again.

The Standards provide helpful guidance in these matters. For this particular case, the Standards point out that:

“The test developer is responsible for documenting that the items selected for the test meet the requirements of the test specifications. In particular, the set of items selected for a new test form . . . must meet both content and psychometric specifications.” (p. 82)

Publications describing best practices for test development also specify that the content must be determined before delivering an operational form. For example, in their chapter in Educational Measurement (4th Edition), Cynthia Schmeiser and Catherine Welch note the importance of conducting a content review of items before field testing, as well a final content review of a draft test form before it becomes operational.

In Introduction to Classical and Modern Test Theory, Linda Crocker and James Algina also made an interesting observation about classroom assessments, noting that students expect to be graded on all of the items they have been asked to answer. Even if notified in advance that some items might not be counted (as one might do in field testing), students might not consider it fair that their score is based on a yet-to-be-determined subset of items that may not fully represent the content that is supposed to be covered.

This is why Questionmark’s software is designed the way it is. When creating an item, item writers must assign an item to a topic, and items can be classified or labeled along other dimensions (e.g., cognitive process) using metatags. Even if an assessment program cannot muster any further content review, at least the item writer has classified items by content area. The person building the test form then has the information they need to make sure that the right questions get asked.

We have a responsibility as test developers to treat our participants fairly and ethically. If we are asking them to spend their time taking a test, then we owe them the most useful measurement that we can provide. Participants trust that we know what we are doing. If we postpone critical, basic development tasks like content identification until after participants have already given us their time, we are taking advantage of that trust.

Writing JTA Task Statements

Austin Fossey-42Posted by Austin Fossey

One of the first steps in an evidence-centered design (ECD) approach to assessment development is a domain analysis. If you work in credentialing, licensure, or workplace assessment, you might accomplish this step with a job task analysis (JTA) study.

A JTA study gathers examples of tasks that potentially relate to a specific job. These tasks are typically harvested from existing literature or observations, reviewed by subject matter experts (SMEs), and rated by practitioners or other stakeholder groups across relevant dimensions (e.g., applicability to the job, frequency of the task). The JTA results are often used later to determine the content areas, cognitive processes, and weights that will be on the test blueprint.

 Questionmark has tools for authoring and delivering JTA items, as well as some limited analysis tools for basic response frequency distributions. But if we are conducting a JTA study, we need to start at the beginning: how do we write task statements?

One of my favorite sources on the subject is Mark Raymond and Sandra Neustel’s chapter, “Determining the Content of Credentialing Examinations,” in The Handbook of Test Development. The chapter provides information on how to organize a JTA study, how to write tasks, how to analyze the results, and how to use the results to build a test blueprint. The chapter is well-written, and easy to understand. It provides enough detail to make it useful without being too dense. If you are conducting a JTA study, I highly recommend checking out this chapter.

Raymond and Neustel explain that a task statement can refer to a physical or cognitive activity related to the job/practice. The format of a task statement should always follow a subject/verb/object format, though it might be expanded to include qualifiers for how the task should be executed, the resources needed to do the task, or the context of its application. They also underscore that most task statements should have only one action and one object. There are some exceptions to this rule, but if there are multiple actions and objects, they typically should be split into different tasks. As a hint, they suggest critiquing any task statement that has the words “and” or “or” in it.

Here is an example of a task statement from the Michigan Commission on Law Enforcement Standards’ Statewide Job Analysis of the Patrol Officer Position: Task 320: “[The patrol officer can] measure skid marks for calculation of approximate vehicle speed.”

I like this example because it is pretty specific, certainly better than just saying “determine vehicle’s speed.” It also provides a qualifier for how good their measurement needs to be (“approximate”). The context might be improved by adding more context (e.g., “using a tape measure”), but that might be understood by their participant population.

Raymond and Neustel also caution researchers to avoid words that might have multiple meanings or vague meanings. For example, the verb “instruct” could mean many different things—the practitioner might be giving some on-the-fly guidance to an individual or teaching a multi-week lecture. Raymond and Neustel underscore the difficult balance of writing task statements at a level of granularity and specificity that is appropriate for accomplishing defined goals in the workplace, but at a high enough level that we do not overwhelm the JTA participants with minutiae. The authors also advise that we avoid writing task statements that describe best practice or that might otherwise yield a biased positive response.

Early in my career, I observed a JTA SME meeting for an entry-level credential in the construction industry. In an attempt to condense the task list, the psychometrician on the project combined a bunch of seemingly related tasks into a single statement—something along the lines of “practitioners have an understanding of the causes of global warming.” This is not a task statement; it is a knowledge statement, and it would be better suited for a blueprint. It is also not very specific. But most important, it yielded a biased response from the JTA survey sample. This vague statement had the words “global warming” in it, which many would agree is a pretty serious issue, so respondents ranked it as of very high importance. The impact was that this task statement heavily influenced the topic weighting of the blueprint, but when it came time to develop the content, there was not much that could be written. Item writers were stuck having to write dozens of items for a vague yet somehow very important topic. They ended up churning out loads of questions about one of the few topics that were relevant to the practice: refrigerants. The end result was a general knowledge assessment with tons of questions about refrigerants. This experience taught me how a lack of specificity and the phrasing of task statements can undermine the entire content validity argument for an assessment’s results.

If you are new to JTA studies, it is worth mentioning that a JTA can sometimes turn into a significant undertaking. I attended one of Mark Raymond’s seminars earlier this year, and he observed anecdotally that he has had JTA studies take anywhere from three months to over a year. There are many psychometricians who specialize in JTA studies, and it may be helpful to work with them for some aspects of the project, especially when conducting a JTA for the first time. However, even if we use a psychometric consultant to conduct or analyze the JTA, learning about the process can make us better-informed consumers and allow us to handle some of work internally, potentially saving time and money.


Example of task input screen for a JTA item in Questionmark Authoring.

For more information on JTA and other reporting tools that are available with Questionmark, check out this Reporting & Analytics page

Question Type Report: Use Cases

Austin Fossey-42Posted by Austin Fossey

A client recently asked me if there is a way to count the number of each type of item in their item bank, so I pointed them toward the Question Type Report in Questionmark Analytics. While this type of frequency data can also be easily pulled using our Results API, it can be useful to have a quick overview of the number of items (split out by item type) in the item bank.

The Question Type Report does not need to be run frequently (and Analytics usage stats reflect that observation), but the data can help indicate the robustness of an item bank.

This report is most valuable in situations involving topics for a specific assessment or set of related assessments. While it might be nice to know that we have a total of 15,000 multiple choice (MC) items in the item bank, these counts are trivial unless we have a system-wide practical application—for example planning a full program translation or selling content to a partner.

This report can provide a quick profile of the population of the item bank or a topic when needed, though more detailed item tracking by status, topic, metatags, item type, and exposure is advisable for anyone managing a large-scale item development project. Below are some potential use cases for this simple report.

Test Development and Maintenance:
The Question Type Report’s value is primarily its ability to count the number of each type of item within a topic. If we know we have 80 MC items in a topic for a new assessment, and they all need be reviewed by a bias committee, then we can plan accordingly.

Form Building:
If we are equating multiple forms using a common-item design, the report can help us determine how many items go on each form and the degree to which the forms can overlap. Even if we only have one form, knowing the number of items can help a test developer check that enough items are available to match the blueprint.

Item Development:
If the report indicates that there are plenty of MC items ready for future publications, but we only have a handful of essay items to cover our existing assessment form, then we might instruct item writers to focus on developing new essay questions for the next publication of the assessment.

Question type

Example of a Question Type Report showing the frequency distribution by item type.


When to Give Partial Credit for Multiple-Response Items

Austin Fossey-42 Posted by Austin Fossey

Three different customers recently asked me how to decide between scoring a multiple-response (MR) item dichotomously or polytomously; i.e., when should an MR item be scored right/wrong, and when should we give partial credit? I gave some garrulous, rambling answers, so the challenge today is for me to explain this in a single blog post that I can share the next time it comes up.

In their chapter on multiple-choice and matching exercises in Educational Assessment of Students (5th ed.), Anthony Nitko and Susan Brookhart explain that matching items (which we may extend to include MR item formats, drag-and-drop formats, survey-matrix formats, etc.) are often a collection of single-response multiple choice (MC) items. The advantage of the MR format is that is saves space and you can leverage dependencies in the questions (e.g., relationships between responses) that might be redundant if broken into separate MC items.

Given that an MR items is often a set of individually scored MC items, then a polytomously scored format almost always makes sense. From an interpretation standpoint, there are a couple of advantages for you as a test developer or instructor. First, you can differentiate between participants who know some of the answers and those who know none of the answers. This can improve the item discrimination. Second, you have more flexibility in how you choose to score and interpret the responses. In the drag-and-drop example below (a special form of an MR item), the participant has all of the dates wrong; however, the instructor may still be interested in knowing that the participant knows the correct order of events for the Stamp Act, the Townshend Act, and the Boston Massacre.

stamp 1

Example of a drag-and-drop item in Questionmark where the participant’s responses are wrong, but the order of responses is partially correct.

Are there exceptions? You know there are. This is why it is important to have a test blueprint document, which can help clarify which item formats to use and how they should be evaluated. Consider the following two variations of a learning objective on a hypothetical CPR test blueprint:

  • The participant can recall the actions that must be taken for an unresponsive victim requiring CPR.
  • The participant can recall all three actions that must be taken for an unresponsive victim requiring CPR.

The second example is likely the one that the test developer would use for the test blueprint. Why? Because someone who knows two of the three actions is not going to cut it. This is a rare all-or-nothing scenario where knowing some of the answers is essentially the same (from a qualifications standpoint) as knowing none of the answers. The language in this learning objective (“recall all three actions”) is an indicator to the test developer that if they use an MR item to assess this learning objective, they should score it dichotomously (no partial credit). The example below shows how one might design an item for this hypothetical learning objective with Questionmark’s authoring tools:

stamp 2

Example of a Questionmark authoring screen for MR item that is scored dichotomously (right/wrong).

To summarize, a test blueprint document is the best way to decide if an MR item (or variant) should be scored dichotomously or polytomously. If you do not have a test blueprint, think critically about what you are trying to measure and the interpretations you want reflected in the item score. Partial-credit scoring is desirable in most use cases, though there are occasional scenarios where an all-or-nothing scoring approach is needed—in which case the item can be scored strictly right/wrong. Finally, do not forget that you can score MR items differently within an assessment. Some MR items can be scored polytomously and others can be scored dichotomously on the same test, though it may be beneficial to notify participants when scoring rules differ for items that use the same format.

If you are interested in understanding and applying some basic principles of item development and enhancing the quality of your results, download the free white paper written by Austin: Managing Item Development for Large-Scale Assessment

« Previous PageNext Page »