Develop Better Tests with Item Analysis [New eBook]

Posted by Chloe Mendonca

Item Analysis is probably the most important tool for increasing test effectiveness.  In order to write items that accurately and reliably measure what they’re intended to, you need to examine participant responses to each item. You can use this information to improve test items and identify unfair or biased items.

So what’s the process for conducting an item analysis? What should you be looking for? How do you determine if a question is “good enough”?

Questionmark has just published a new eBook “Item Analysis Analytics, which answers these questions. The eBook shares many examples of varying statistics that you may come across item analysis ebookin your own analyses.

Download this eBook to learn about these aspects of analytics:

  • the basics of classical test theory and item analysis
  • the process of conducting an item analysis
  • essential things to look for in a typical item analysis report
  • whether a question “makes the grade” in terms of psychometric quality

This eBook is available as a PDF and ePUB suitable for viewing on a variety of mobile devices and eReaders.

I hope you enjoy reading it!

Item Development – Organizing a bias review committee (Part 2)

Austin Fossey-42Posted by Austin Fossey

The Standards for Educational and Psychological Testing describe two facets of an assessment that can result in bias: the content of the assessment and the response process. These are the areas on which your bias review committee should focus. You can read Part 1 of this post, here.

Content bias is often what people think of when they think about examples of assessment bias. This may pertain to item content (e.g., students in hot climates may have trouble responding to an algebra scenario about shoveling snow), but it may also include language issues, such as the tone of the content, differences in terminology, or the reading level of the content. Your review committee should also consider content that might be offensive or trigger an emotional response from participants. For example, if an item’s scenario described interactions in a workplace, your committee might check to make sure that men and women are equally represented in management roles.

Bias may also occur in the response processes. Subgroups may have differences in responses that are not relevant to the construct, or a subgroup may be unduly disadvantaged by the response format. For example, an item that asks participants to explain how they solved an algebra problem may be biased against participants for whom English is a second language, even though they might be employing the same cognitive processes as other participants to solve the algebra. Response process bias can also occur if some participants provide unexpected responses to an item that are correct but may not be accounted for in the scoring.

How do we begin to identify content or response processes that may introduce bias? Your sensitivity guidelines will depend upon your participant population, applicable social norms, and the priorities of your assessment program. When drafting your sensitivity guidelines, you should spend a good amount of time researching potential sources of bias that could manifest in your assessment, and you may need to periodically update your own guidelines based on feedback from your reviewers or participants.

In his chapter in Educational Measurement (4th ed.), Gregory Camilli recommends the chapter on fairness in the ETS Standards for Quality and Fairness and An Approach for Identifying and Minimizing Bias in Standardized Tests (Office for Minority Education) as sources of criteria that could be used to inform your own sensitivity guidelines. If you would like to see an example of one program’s sensitivity guidelines that are used to inform bias review committees for K12 assessment in the United States, check out the Fairness Guidelines Adopted by PARCC (PARCC), though be warned that the document contains examples of inflammatory content.

In the next post, I will discuss considerations for the final round of item edits that will occur before the items are field tested.

Check out our white paper: 5 Steps to Better Tests for best practice guidance and practical advice for the five key stages of test and exam development.

Austin Fossey will discuss test development at the 2015 Users Conference in Napa Valley, March 10-13. Register before Dec. 17 and save $200.

Evidence that assessments improve learning outcomes

John Kleeman HeadshotPosted by John Kleeman

I’ve written about this research before, but it’s a very compelling example and I think it’s useful as evidence that giving low stakes quizzes during a course correlates strongly with improved learning outcomes.

The study was conducted by two economics lecturers, Dr Simon Angus and Judith Watson, and is titled Does regular online testing enhance student learning in the numerical sciences? Robust evidence from a large data set. It was published in the British Journal of Educational Technology Vol 40 No 2, 255-272 in 2009.

Angus and Watson introduced a series of 4 online, formative quizzes into a business mathematics course, and wanted to determine whether students who took the quizzes learned more and did better on the final exam than those who didn’t. The interesting thing about the study is that they used a statistical technique which allowed them to estimate the effect of several different factors, and isolate the effects of taking the quizzes from the previous mathematical experience of the students, their gender and their general level of effort to determine which impacted the final exam score most.

You can see a summary of their findings in the graph below, which shows the estimated coefficients for four of the main factors, all of which had a statistical significance of p < 0.01.

Factors associated with final exam score graph

You can see from this graph that the biggest factor associated with final exam success was how well students had done in the midterm exam, i.e. how well they were doing in the course generally. But students who took the 4 online quizzes learned from them and did significantly better. The impact of taking or not taking the quizzes was broadly the same as the impact of their prior maths education: i.e. quite reasonable and significant.

We know intuitively that formative quizzes help learning, but it’s nice to see a statistical proof that – to quote the authors – “exposure to a regular (low mark) online quiz instrument has a significant and positive effect on student learning as measured by an end of semester examination”.

Another good resource on the benefits of assessments to check out is the white paper, The Learning Benefits of Questions. In it, Dr. Will Thalheimer of Work-Learning Research reveals research that shows that questions can produce significant learning and performance benefits, potentially improving learning by 150% or more. The white paper is complimentary after registration.

John Kleeman will discuss benefits and good practice in assessments at the 2015 Users Conference in Napa Valley, March 10-13. Register before Dec. 17 and save $200.

How can a randomized test be fair to all?

Joan Phaup 2013 (3) Posted by Joan Phaup

James Parry, who is test development manager at the U.S Coast Guard Training Center in Yorktown, Virginia, will answer this question during a case study presentation the Questionmark Users Conference in San Antonio March 4 – 7. He’ll be co-presenting with LT Carlos Schwarzbauer, IT Lead at the USCG Force Readiness Command’s Advanced Distributed Learning Branch.

James and I spoke the other day about why tests created from randomly drawn items can be useful in some cases—but also about their potential pitfalls and some techniques for avoiding them.

When are randomly designed tests an appropriate choice?

James Parry

James Parry

There are several reasons to use randomized tests.  Randomization is appropriate when you think there’s a possibility of participants sharing the contents of their test with others who have not taken it.  Another reason would be in a computer lab style testing environment where you are testing many on the same subject at the same time with no blinders between the computers. So even if participants look at the screens next to them, chances are they won’t see the same items.

How are you using randomly designed tests?

We use randomly generated tests at all three levels of testing low-, medium- and high-stakes.  The low- and medium-stakes tests are used primarily at the schoolhouse level for knowledge- and performance-based knowledge quizzes and tests.  We are also generating randomized tests for on-site testing using tablet computers or local installed workstations.

Our most critical use is for our high-stakes enlisted advancement tests, which are administered both on paper and by computer. Participants are permitted to retake this test every 21 days if they do not achieve a passing score.  Before we were able to randomize the test there were only three parallel paper versions. Candidates knew this so some would “test sample” without studying to get an idea of every possible question. They would retake the first version, then the second, and so forth until they passed it. With randomization the word has gotten out that this is not possible anymore.

What are the pitfalls of drawing items randomly from an item bank?

The biggest pitfall is the potential for producing tests that have different levels of difficulty or that don’t present a balance of questions on all the subjects you want to cover. A completely random test can be unfair.  Suppose you produce a 50-item randomized test from an entire test item bank of 500 items.   Participant “A” might get an easy test, “B” might get a difficult test and “C” might get a test with 40 items on one topic and 10 on the rest and so on.

How do you equalize the difficulty levels of your questions?

This is a multi-step process. The item author has to make sure they develop sufficient numbers of items in each topic that will provide at least 3 to 5 items for each enabling objective.  They have to think outside the box to produce items at several cognitive levels to ensure there will be a variety of possible levels of difficulty. This is the hardest part for them because most are not trained test writers.

Once the items are developed, edited, and approved in workflow, we set up an Angoff rating session to assign a cut score for the entire bank of test items.  Based upon the Angoff score, each item is assigned a difficulty level of easy, moderate or hard and assigned a metatag to match within Questionmark.  We use a spreadsheet to calculate the number and percentage of available items at each level of difficulty in each topic. Based upon the results, the spreadsheet tells how many items to select from the database at each difficulty level and from each topic. The test is then designed to match these numbers so that each time it is administered it will be parallel, with the same level of difficulty and the same cut score.

Is there anything audience members should do to prepare for this session?

Come with an open mind and a willingness to think outside of the box.

How will your session help audience members ensure their randomized tests are fair?

I will give them the tools to use starting with a quick review of using the Angoff method to set a cut score and then discuss the inner workings of the spreadsheet that I developed to ensure each test is fair and equal.

***

See more details about the conference program here and register soon.