Item analysis: Selecting items for the test form – Part 2

Austin Fossey-42Posted by Austin Fossey

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

IT 2

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.

IT2

The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 1

Austin Fossey-42Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.

But don’t worry, this is actually a new topic for the blog.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. 

Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.

When building a test form, item statistics can help us in two ways.

  • They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
  • They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.

In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.

Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.

Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.

Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).

In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

 

 

Assembling the Test Form — Test Design and Delivery Part 7

Posted By Doug Peterson

In the previous post in this series, we looked at putting together assessment instructions for both the participant and the instructor/administrator. Now it’s time to start selecting the actual questions.

Back in Part 2 we discussed determining how many items needed to be written for each content area covered by the assessment. We looked at writing 3 times as many items as were actually needed, knowing that some would not
make it through the review process. Doing this also enables you to create multiple forms of the test, where each form covers the same concepts with equivalent – but different – questions. We also discussed the amount of time a participant needs to answer each question type, as shown in this table:

As you’re putting your assessment together, you have to account for the time required to take the assessment. You have to multiply the number of each question type in the assessment by the values in the table above.

You also need to allow time for:

  • Reading the instructions
  • Reviewing sample items
  • Completing practice items
  • Completing demographic info
  • Taking breaks

If you already know the time allowed for your assessment, you may have to work backwards or make some compromises. For example, if you know that you only have one hour for the assessment, and you have a large amount of content to cover, you may want to consider focusing on multiple choice and fill-in-the-blank questions, and stay away from matching and short-answer to maximize the number of questions you can include in the time period allowed.

To select the actual items for the assessment, you may want to consider using a Test Assembly Form, which might look something like this:

The content area is in the first column. The second column shows how many questions are needed for that content area (as calculated back in Part 2). Each item should have a short identifier associated with it, and this is provided in the “Item #” column. The “Keyword” column is just that – one or two words to remind you what the question addresses. The last column lists the item number of an alternate item in case a problem is found with the first selection during assessment review.

As you select items, watch out for two things:

1. Enemy items. This is when one item gives away the answer to another item. Make sure that the stimulus or answer to one item does not answer or give a clue to the answer of another item.

2. Overlap. This is when two questions basically test the same thing. You want to cover all of the content in a given content area, so each question for that content area should cover something unique. If you find that you have several questions assessing the same thing, you may need to write some new questions or you may need to re-calculate how many questions you actually need.

Once you have your assessment put together, you need to calculate the cutscore. This topic could easily be another (very lengthy) blog series, and there are many books available on calculating cutscores. I recently read the book, Cutscores: A Manual for Setting Standards of Performance on Educational and Occupational Tests, by Zieky, Perie and Livingston. I found it to be a very good book, considering that the subject matter isn’t exactly “thrill a minute”. The authors discuss 18 different methods for setting cutscores, including which methods to use in various situations and how to carry out a cutscore study. They look at setting cutscores for criterion-referenced assessments (where performance is judged against a set standard) as well as norm-referenced assessments (where the performance of one participant is judged against the performance of the other participants). They also look at pass/fail situations as well as more complex judgments such as dividing participants into basic, proficient and advanced categories.