Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful?

Posted by John Kleeman

I’m running a session at the Questionmark user conference next month on Item Analysis for Beginners and thought I’d share the answer to an interesting question in this blog.

Item analysis fragment showing a question with difficulty of 0.998 and discrimination of 0.034When you run an Item Analysis report, one of the useful statistics you get on a question is its “p-value” or “item difficulty”. This is a number from 0 to 1, with the higher the value the easier the question. An easy question might have a p-value of 0.9 to 1.0, meaning 90% to 100% of participants answer the question correctly. A difficult question might have a p-value of 0.0 to 0.25 meaning less than 25% of participants answer the question correctly. For example, the report fragment to the right shows a question with p-value 0.998 which means it is very easy and almost everyone gets it right.

Whether such questions are appropriate depends on the purpose of the assessment. Most participants will get difficult questions wrong and easy questions right. In general, very easy and very difficult questions will not be as helpful as other questions in helping you discriminate between participants and so use the assessment for measurement purposes.

Here are three reasons why you might decide to include very difficult questions in an assessment:

  1. Sometimes your test blueprint requires questions on a topic and the only ones you have available are difficult ones – if so, you need to use them until you can write more.
  2. If a job has high performance needs and you need to filter out a few participants from many, then very difficult questions can be useful. This might apply for example if you are selecting potential astronauts or special forces team members.
  3. If you need to assess a wide range of ability within a single assessment, then you may need some very difficult questions to be able to assess abilities within the top performing participants.

And here are five reasons why you might decide to include very easy questions in an assessment:

  1. Answering questions gives retrieval practice and helps participants remember things in future – so including easy questions still helps reduce people’s forgetting.
  2. In compliance or health and safety, you may choose to include basic questions that almost everyone gets right. This is because if someone gets it wrong, you want to know and be able to intervene.
  3. More broadly, sometimes a test blueprint requires you to cover some topics that almost everyone knows, and it’s not practical to write difficult questions about.
  4. Easy questions at the start of an assessment can build confidence and reduce test anxiety. See my blog post Ten tips on reducing test anxiety for online test-takers for other ways to deal with test anxiety.
  5. If the purpose of your assessment is to measure someone’s ability to process information quickly and accurately at speed, then including many low difficulty questions that need to be answered in a short time might be appropriate.

If you want to learn more about Item Analysis, search this blog for other articles. You might also find the Questionmark user conference useful, since as well as my session on Item Analysis, there are also many other useful sessions including setting the cut-score in a fair, defensible way and identifying knowledge gaps. The conference also gives opportunity to learn and network with other assessment practitioners – I look forward to seeing some of you there.

The Importance of Safety in the Utilities Industry: A Q&A with PG&E

Headshot Julie

Posted by Julie Delazyn

Wendy Lau is a Psychometrician at Pacific Gas and Electric Company (PG&E). She will be leading a discussion at Questionmark Conference 2016 in Miami, about Safety and the Utilities Industry: Why Assessments Matter.

WendyLau_Q&A

Wendy Lau, Psychometrician, PG&E

Wendy’s session will describe a day in the life of a psychometrician in the utilities industry. It will explore the role assessments play at PG&E, and how Questionmark has helped the company focus on safety and train its employees.

I recently asked her about her session:

Tell me about PG&E and its use of assessments:

PG&E is a utilities company that provides natural gas and electricity to most of the northern two-thirds of California. Over the years, we have evolved into a more data-driven company, and Questionmark has been a part of that for the past 7 years. Having assessments readily available and secured within a platform that we can trust is very important to PG&E. We are also glad to have found a testing tool that offers such a wide variety of question types.

Why is safety important in the utilities industry?

Depending on the activity that our employees perform, most of the work has serious safety implications — whether it is a lineman climbing up a  pole to perform liveline work or a utility worker digging near a major gas pipeline. Our technical training must have safety in mind and, more importantly, it must ensure that after going through training, employees are competent to perform their tasks safely and proficiently. In order to ensure workforce capability, we rely heavily on testing to prove that our workforce is in fact safe and proficient and that the community we serve and our employees are safe and receiving reliable services.

What role does Questionmark play in ensuring that safety?

Questionmark helps us focus on safety-related questions by allowing special assessment strategies such as identifying critical versus coachable assessment items and identifying cutscores for each accordingly. Questionmark also allows a secured platform so that we can ensure our test items are never compromised and that our employees are truly being assessed under fair circumstances.

To find out more about the role of Questionmark plays in ensuring safety, you’ll just have to attend my session at Questionmark Conference 2016 in Miami!

What are you looking forward to at the conference?

I am very much looking forward to ‘talking shop’ with other Psychometricians and sharing best practices with others in the utilities industry and other companies alike!

Thank you, Wendy for taking time out of your busy schedules to discuss your session with us!

palm tree emoji 2If you have not already done so, you still have a chance to attend this important learning event. Click here to register.

 

Checklists for Test Development

Austin Fossey-42Posted by Austin Fossey

There are many fantastic books about test development, and there are many standards systems for test development, such as The Standards for Educational and Psychological Testing. There are also principled frameworks for test development and design, such as evidence-centered design (ECD). But it seems that the supply of qualified test developers cannot keep up with the increased demand for high-quality assessment data, leaving many organizations to piece together assessment programs, learning as they go.checklist

As one might expect, this scenario leads to new tools targeted at these rookie test developers—simplified guidance documents, trainings, and resources attempting to idiot-proof test development. As a case in point, Questionmark seeks to distill information from a variety of sources into helpful, easy-to-follow white papers and blog posts. At an even simpler level, there appears to be increased demand for checklists that new test developers can use to guide test development or evaluate assessments.

For example, my colleague, Bart Hendrickx, shared a Dutch article from the Research Center for Examination and Certification (RCEC) at University of Twente describing their Beoordelingssysteem. He explained that this system provides a rubric for evaluating education assessments in areas like representativeness, reliability, and standard setting. The Buros Center for Testing addresses similar needs for users of mental assessments. In the Assessment Literacy section of their website, Buros has documents with titles like “Questions to Ask When Evaluating a Test”—essentially an evaluation checklist (though Buros also provides their own professional ratings of published assessments). There are even assessment software packages that seek to operationalize a test development checklist by creating a rigid workflow that guides the test developer through different steps of the design process.

The benefit of these resources is that they can help guide new test developers through basic steps and considerations as they build their instruments. It is certainly a step up from a company compiling a bunch of multiple choice questions on the fly and setting a cut score of 70% without any backing theory or test purpose. On the other hand, test development is supposed to be an iterative process, and without the flexibility to explore the nuances and complexities of the instrument, the results and the inferences may fall short of their targets. An overly simple, standardized checklist for developing or evaluating assessments may not consider an organization’s specific measurement needs, and the program may be left with considerable blind spots in its validity evidence.

Overall, I am glad to see that more organizations are wanting to improve the quality of their measurements, and it is encouraging to see more training resources to help new test developers tackle the learning curve. Checklists may be a very helpful tool for a lot of applications, and test developers frequently create their own checklists to standardize practices within their organization, like item reviews.

What do our readers think? Are checklists the way to go? Do you use a checklist from another organization in your test development?

 

 

 

 

G Theory and Reliability for Assessments with Randomly Selected Items

Austin Fossey-42Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

  • Error variance.
  • Variance in the item mean scores.
  • Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

Is There Value in Reporting Changes in Subscores?

Austin Fossey-42Posted by Austin Fossey

I had the privilege of meeting with an organization that is reporting subscores to show how their employees are improving across multiple areas of their domain, as determined by an assessment given before and after training. They have developed some slick reports to show these scores, including the participant’s first score, second score (after training is complete), and the change in those scores.

At first glance, these reports are pretty snazzy and seem to suggest huge improvements resulting from the training, but looks can be deceiving. I immediately noticed one participant had made a subscore gain of 25%, which sounds impressive—like he or she is suddenly 25% better at the tasks in that domain—but here is the fine print: that subscore was measured with only four items. To put it another way, that 25% improvement means that the participant answered one more item correctly. Other subscores were similarly underrepresented—most with four or fewer items in their topic.

In a previous post, I reported on an article by Richard Feinberg and Howard Wainer about how to determine if a subscore is worth reporting. My two loyal readers (you know who you are) may recall that a reported subscore has to be reliable, and it must contain information that is sufficiently different from the information contained in the assessment’s total score (AKA “orthogonality”).

In an article titled Comments on “A Note on Subscores” by Samuel A. Livingston, Sandip Sinharay and Shelby Haberman defended against a critique that their previous work (which informed Feinberg and Wainer’s proposed Value Added Ratio (VAR) metric) indicated that subscores should never be reported when examining changes across administrations. Sinharay and Haberman explained that in these cases, one should examine the suitability of the change scores, not the subscores themselves. One may then find that the change scores are suitable for reporting.

A change score is the difference in scores from one administration to the next. If a participant gets a subscore of 12 on their first assessment and a subscore of 30 on their next assessment, their change score for that topic is 18. This can then be thought of as the subscore of interest, and one can then evaluate whether or not this change score is suitable for reporting.

Change scores are also used to determine if a change in scores is statistically significant for a group of participants. If we want to know whether a group of participants is performing statistically better on an assessment after completing training (at a total score or subscore level), we do not compare average scores on the two tests. Instead, we look to see if the group’s change scores across the two tests are significantly greater than zero. This is typically analyzed with a dependent samples t-test.

The reliability, orthogonality, and significance of changes in subscores are statistical concerns, but scores must be interpretable and actionable to make a claim about the validity of the assessment. This raises the concern of domain representation. Even if the statistics are fine, a subscore cannot be meaningful if the items do not sufficiently represent the domain they are supposed to measure. Making an inference about a participant’s ability in a topic based on only four items is preposterous—you do not need to know anything about statistics to come to that conclusion.

To address the concern of domain representation, high-stakes assessment programs that report subscores will typically set a minimum for the number of items that are needed to sufficiently represent a topic before a subscore is reported. For example, one program I worked for required (perhaps somewhat arbitrarily) a minimum of eight items in a topic before generating a subscore. If this domain representation criterion is met, one can presumably use methods like the VAR to then determine if the subscores meet the statistical criteria for reporting.

4 Ways to Identify a Content Breach

Austin Fossey-42Posted by Austin Fossey

In my last post, I discussed five ways you can limit the use of breached content so that a person with unauthorized access to your test content will have limited opportunities to put that information to use; however, those measures only control the problem of a content breach. Our next goal is to identify when a content breach has occurred so that we can remedy the problem through changes to the assessment or disciplinary actions against the parties involved in the breach.

mitigating risk

Interested in learning about item analysis or how-to take your test planning to the next level? I will be presenting a series of workshops on at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. 

Channel for Reporting

In most cases, you (the assessment program staff) will not be the first to find out that content has been stolen. You are far more likely to learn about the problem through a tip from another participant or stakeholder. One of the best things your organization can do to identify a content breach is to have a clear process for letting people report these concerns, as well as a detailed policy for what to do if a breach is found.

For example, you may want to have a disciplinary policy to address the investigation process, potential consequences, and an appeals process for participants who allegedly gained unauthorized access to the content (even if they did not pass the assessment). You may want to have legal resources lined up to help address non-participant parties who may be sharing your assessment content illegally (e.g., so-called “brain dump” sites). Finally, you should have an internal plan in place for what you will do if content is breached. Do you have backup items that can be inserted in the form? Can you release an updated form ahead of your republishing schedule? Will your response be different depending on the extent of the breach?

Web Patrol Monitoring

Several companies offer a web patrol service that will search the internet for pages where your assessment content has been posted without permission. Some of these companies will even purchase unauthorized practice exams that claim to have your assessment content and look for item breaches within them. Some of Questionmark’s partners provide web patrol services.

Statistical Models

There are several publicly available statistical models that can be used to identify abnormalities in participants’ response patterns or matches between a response pattern and a known content breach, such as the key patterns posted on a brain-dump site. Several companies, including some of Questionmark’s partners, have developed their own statistical methods for identifying cases where a participant may have used breached content.

In their chapter in Educational Measurement (4th ed.), Allan Cohen and James Wollack explain that all of these models tend to explore whether the amount of similarity between two sets of responses can be explained by chance alone. For example, one could look for two participants who had similar responses, possibly suggesting collusion or indicating that one participant copied the other. One could also look for similarity between a participant’s responses and the keys given in a leaked assessment form. Models also exist for identifying patterns within groups, as might be the case when a teacher chooses to provide answers to an entire class.

These models are a sophisticated way to look for breaches in content, but they are not foolproof. None of them prove that a participant was cheating, though they can provide weighty statistical evidence. Cohen and Wollack warn that several of the most popular models have been shown to suffer from liberal or conservative Type I error rates, though new models continue to improve in this area.

Item Drift

When considering content breaches, you might also be interested in cases where an item appears to become easier (or harder) for everyone over time. Consider a situation where your participant population has global access to information that changes how they respond to an item. This could be for some unsavory reasons (e.g., a lot of people stole your content), or it could be something benign, like a newsworthy event that caused your population to learn more about content related to your assessment. In these cases, you might expect certain items to become easier for everyone in the population.

To detect whether an item is becoming easier over time, we do not use the p value from Classical Test Theory. Instead, we use Item Response Theory (IRT) and a Differential Item Functioning to detect item drift, which is changes in an item’s IRT parameters over time. This is done with Thissen, Steinberg, and Wainer’s likelihood ratio test that they detailed in Test Validity. Creators of IRT assessments use item parameter drift analyses to see if an item has become easier over time. This information helps test developers make decisions about cycling out items from production or planning new calibration studies.

Interested in learning about item analysis or how-to take your test planning to the next level? I will be presenting a series of workshops at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.