Item Analysis for Beginners – Getting Started

Do you use assessments to make decisions about people? If so, then you should regularly run Item Analysis on your results.  Item Analysis can help find questions which are ambiguous, mis-keyed or which have choices that are rarely chosen. Improving or removing such questions will improve the validity and reliability of your assessment, and so help you use assessment results to make better decisions. If you don’t use Item Analysis, you risk using poor questions that make your assessments less accurate.

Sometimes people can be fearful of Item Analysis because they are worried it involves too much statistics. This blog post introduces Item Analysis for people who are unfamiliar with it, and I promise no maths or stats! I’m also giving a free webinar on Item Analysis with the same promise.

An assessment contains many items (another name for questions) as figuratively shown below. You can use Item Analysis to look at how each item performs within the assessment and flag potentially weak items for review. By keeping only stronger questions in the assessment, the assessment will be more effective.

Picture of a series of items with one marked as being weak

Item Analysis looks at the performance of all your participants on the items, and calculates how easy or hard people find the items (“item difficulty” or “p-value”) and how well the scores on items correlate with or show a relationship with the scores on the assessment as a whole (“item discrimination” or correlation). Some of problematic questions that Item Analysis can identify are:

  • Questions almost all participants get right, and so which are very easy. You might want to review to these to see if they are appropriate for the assessment. See my earlier post Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful? for more information.
  • Questions which are difficult, where a lot of participants get the questionwrong. You should check such questions in case they are mis-keyed or ambiguous.
  • Multiple choice questions where some choices are rarely picked. You might want to improve such questions to make the wrong choices more plausible.
  • Questions where there is a poor correlation between participants who get the question right and who do well on the assessment. For example it will flag questions that high performing participants perform poorly on. You should look at such questions in case they are ambiguous, mis-keyed or off-topic.

There is a huge wealth of information available in an Item Analysis report, and assessment experts will delve into the report in detail. But much of the key information in an Item Analysis report is useful to anyone creating and delivering quizzes, tests and exams.

The Questionmark Item Analysis report includes a graph which shows the difficulty of items compared against their discrimination, like in the example below. It flags questions by marking them amber or red if they fall into categories which may need review. For example, in the illustration below, four questions are marked in amber as having low discrimination and so potentially be worth looking at.

Illustration of Questionmark item analysis report showing some questions green and some amber

If you are running an assessment program, and not using Item Analysis regularly, then this throws doubt on the trustworthiness of your results. By using it to identify and improve weak questions you should be able to improve your validity and reliability.

Item Analysis is surprisingly effective in practice. I’m one of the team responsible at Questionmark for managing our data security test which all employees have to take annually to check their understanding of information security and data protection. We recently reviewed the test and ran Item Analysis and very quickly found a question with poor stats where the technology had changed but we’d not updated the wording, and another question where two of the choices could be considered right, which made it hard to answer. It made our review faster and more effective and helped us improve the quality of the test.

If you want to learn a little more about Item Analysis, I’m running a free webinar on the subject “Item Analysis for Beginners” on May 2nd. You can see details and register for the webinar at https://www.questionmark.com/questionmark_webinars. I look forward to seeing some of you there!

 

Can you be GDPR compliant without testing your employees?

Posted by John Kleeman

The GDPR is a new extra-territorial, data protection law which imposes obligations on anyone who processes personal data on European residents. It impacts companies with employees in Europe, awarding bodies and test publishers who test candidates in Europe, universities and colleges with students in Europe and many others. Many North American and other non-European organizations will need to comply.

See my earlier post How to use assessments for GDPR compliance for an introduction to GDPR. The question this blog post addresses is whether it’s practical for a large organization to be compliant with the GDPR without giving tests and assessments to their employees?

I’d argue that for most organizations with 100s or 1000s of employees, you will need to test your employees on your policies and procedures for data protection and the GDPR. Putting it simply, if you don’t and your people make mistakes, fines are likely to be higher.

Here are four things the GDPR law says (I’ve paraphrased the language and linked to the full text for those interested):


1. Organizations must take steps to ensure that everyone who works for them only processes personal data based on proper instructions. (Article 32.4)

2. Organizations must conduct awareness-raising and training of staff who process personal data (Article 39.1). This is extended to include “monitoring training” for some organizations in Article 47.2.

3. Organizations must put in place risk-based security measures to ensure confidentiality and integrity and must regularly test, assess and evaluate the effectiveness of these measures. (Article 32.1)

4. If you don’t follow the rules, you could be fined up to 20 million Euros or 4% of turnover. How well you’ve implemented the measures in article 32 (i.e. including those above) will impact how big these fines might be. (Article 83.2d)


So let’s join up the dots.

Firstly, a large company has to ensure that everyone who works for it only processes data based on proper instructions. Since the nature of personal data, processing and instructions each have particular meanings, this needs training to help people understand. You could just train and not test, but given that the concepts are not simple, it would seem sensible to test or otherwise check their understanding.

A company is required to train its employees under Article 39. But the requirement in Article 32 is for most companies stronger. For most large organizations the risk of employees making mistakes and the risk of insider threat to confidentiality and integrity is considerable. So you have to put in place training and other security measures to reduce this risk. Given that you have to regularly assess and evaluate the effectiveness of these measures, it seems hard to envisage an efficient way of doing this without testing your personnel. Delivering regular online tests or quizzes to your employees is the obvious way to check that training has been effective and your people know, understand and can apply your processes and procedures.

Lastly, imagine your company makes a mistake and one of your employees causes a breach of personal data or commits another infraction under the GDPR? How are you going to show that you took all the steps you could to minimize the risk? An obvious question is whether you did your best to train that employee in good practice and in your processes and procedures? If you didn’t train, it’s hard to argue that you took the proper steps to be compliant. But even if you trained, a regulator will ask you how you are evaluating the effectiveness of your training. As a regulator in another context has stated:

“”where staff understanding has not been tested, it is hard for firms to judge how well the relevant training has been absorbed”

So yes, you can imagine a way in which a large company might manage to be compliant with the GDPR without testing employees. There are other ways of checking understanding, for example 1:1 interviews, but they are very time consuming and hard to roll out in time for May 2018. Or you may be lucky and have personnel who don’t make mistakes! But for most of us, testing our employees on knowledge of our processes and procedures under the GDPR will be wise.

Questionmark OnDemand is a trustable, easy to use and easy to deploy system for creating and delivering compliance tests and assessments to your personnel. For more information on using assessments to help ensure GDPR compliance visit this page of our website or register for our upcoming webinar on 29 June.

Secrets to Measuring & Enhancing Learning Results: Webinar

Julie ProfilePosted by Julie Delazyn

Research has shown that assessments play an important role on learning and retention — and the benefits vary before, during and after a learning experience. No matter where learning occurs, the goal remains the same: ensuring people have the knowledge, skills and abilities to perform well.

So, how can you use assessments to measure and enhance learning within your organization?

Check out our newest 30-minute webinar – and register today!

  • The Secrets to Measuring and Enhancing Learning Results
  • Date & Time: Wed, Dec 7  at 4:00 p.m. UK GMT / 11:00 a.m. US EDT

Join us as we discuss the important role assessments play within the learning process and explore the benefits of using them before, during and after learning. We’ll also give you some useful pointers and resources to take away.

Register for the webinar now. We look forward to seeing you at the session!

The tips and tools you need to get the most out of your assessments [Webinars]

Chloe Mendonca

Posted by Chloe Mendonca

What’s the big deal about assessments anyway? Though they’ve been around for decades, the assessment and eLearning industry is showing no sign of slowing down. Organisations large and small are using a wide variety of assessment types to measure knowledge, skills, abilities, personality and more.

Join us for one of our upcoming 60-minute webinars and discover the tools, technologies and processes organisations are using worldwide to increase the effectiveness of their assessment programs.

How to transform recruitment and hiring with online testing

This webinar, presented by Dr. Glen Budgell, Senior Strategic HR Advisor at Human Resource Systems Group (HRSG), will discuss the importance and effectiveness of using online testing within HR. This is a must-attend event for anyone exploring the potential of online testing for improving recruitment.

How to Build a Highly Compliant Team in a Fast Moving Market

Organisations across highly regulated industries contend with both stringent regulatory requirements and the need for rigorous asessment programs.  With life, limb, and livelihood on the line, safety and compliance requires much more than “checking a box”. During this webinar, hosted by Questionmark and SAP we will examine ways in which organisations can use online assessment to enhance and strengthen their compliance initiatives.

Introduction to Questionmark’s Assessment Management System

Join us for a live demonstration and learn how Questionmark’s online assessment platform provides organisations with the tools to efficiently develop and deliver assessments.

You can also catch this introductory webinar in Portuguese!

Conhecendo a Questionmark e seu Portal de Gestão de Avaliações [Portuguese]

 

5 Steps to Better Tests

Julie ProfilePosted by Julie Delazyn

Creating fair, valid and reliable tests requires starting off right: with careful planning. Starting with that foundation, you will save time and effort while producing tests that yield trustworthy results.five steps white paper

Five essential steps for producing high-quality tests:

1. Plan: What elements must you consider before crafting the first question? How do you identify key content areas?

2. Create: How do you write items that increase the cognitive load, avoid bias and stereotyping?

3. Build: How should you build the test form and set accurate pass/ fail scores?

4. Deliver: What methods can be implemented to protect test content and discourage cheating?

5. Evaluate: How do you use item-, topic-, and test-level data to assess reliability and improve quality?

Download this complimentary white paper full of best practices for test design, delivery and evaluation.

 

An argument against using negative item scores in CTT

Austin Fossey-42Posted by Austin Fossey

Last year, a client asked for my opinion about whether or not to use negative scores on test items. For example, if a participant answers an item correctly, they would get one point, but if they answer the item incorrectly, they would lose one point. This means the item would be scored dichotomously [-1,1] instead of in the more traditional way [0,1].

I believe that negative item scores are really useful if the goal is to confuse and mislead participants. They are not appropriate for most classical test theory (CTT) assessment designs, because they do not add measurement value, and they are difficult to interpret.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015  *space is limited

Measurement value of negative item scores

Changing the item scoring format from [0,1] to [-1,1] does not change anything about your ability to measure participants—after all, the dichotomous scores are just symbols. You are simply using a different total score scale.

Consider a 60-item assessment made up of dichotomously scored items. If the items are scored [0,1], the total score scale ranges from 0 to 60 points. If scored [-1,1], the score range doubles, now ranging from -60 to 60 points.

From a statistical standpoint, nothing has changed. The item-total discrimination statistics will be the same under both designs, as will the assessment’s reliability. The standard error of measurement will double, but that is to be expected because the score range has doubled. Thus there is no change in the precision of scores or misclassification rates. How you score the items does not matter as long as they are scored dichotomously on the same scale.

The figure below illustrates the score distributions for 1,000 normally distributed assessment scores that were simulated using WinGen. This sample’s item responses have been scored with three different models: [-1,1], [0,1], and [0,2]. While this shifts and stretches the distribution of scores on to different scales, there is no change in reliability or the standard error of measurement (as a percentage of the score range).

Distribution and assessment statistics for 1,000 simulated test scores with items dichotomously scored three ways: [-1,1], [0,1], and [0,2]

 Interpretation issues of negative item scores

If the item scores do not make a difference statistically, and they are just symbols, then why not use negative scores? Remember that an item is a mechanism for collecting and quantifying evidence to support the student model, so how we score our items (and the assessment as a whole) plays a big role in how people interpret the participant’s performance.

Consider an item scored [0,1]. In a CTT model, a score of 1 represents accumulated evidence about the presence or magnitude of a construct, whereas a score of 0 suggests that no evidence was found in the response to this item.

Now suppose we took the same item and scored it [-1,1]. A score of 1 still suggests accumulated evidence, but now we are also changing the total score based on wrong answers. The interpretation is that we have collected evidence about the absence of the construct. To put it another way, the test designer is claiming to have positive evidence that the participants does not know something.

This is not an easy claim to make. In psychometrics, we can attempt to measure the presence of a hypothetical construct, but it is difficult to make a claim that a construct is not there. We can only make inferences about what we observe, and I argue that it is very difficult to build an evidentiary model for someone not knowing something.

Furthermore, negative scores negate evidence we have collected in other items. If a participant gets one item right and earns a point but then loses that point on the next item, we have essentially canceled out the information about the participant from a total score perspective. By using negative scores in a CTT model, we also introduce the possibility that someone can get a negative score on the whole test, but what would a negative score mean? This lack of interpretability is one major reason people do not use negative scores.

Consider a participant who answers 40 items correctly on the 60-item assessment I mentioned earlier. When scored [0,1], the raw score (40 points) corresponds to the number of correct responses provided by the participant. This scale is useful for calculating percentage scores (40/60 = 67% correct), setting cut scores, and supporting the interpretation of the participant’s performance.

When the same items are scored [1,-1], the participant’s score is more difficult to interpret. The participant answered 40 questions correctly, but they only get a score of 20. They know the maximum score on the assessment is 60 points, yet their raw score of 20 corresponds to a correct response rate of 67%, not 33%, since 20 points corresponds to 67% of the range between -60 to 60 points.

There are times when items need to be scored differently from other items on the assessment. Polytomous items clearly need different scoring models (though similar interpretive arguments could be leveled against people who try to score items in fractions of points), and there are times when an item may need to be weighted differently from other items. (We’ll discuss that in my next post.)Some item response theory (IRT) assessments like the SAT use negative points to correct for guessing, but this should only be done if you can demonstrate improved model fit and you have a theory and evidence to justify doing so. In general, when using CTT, negative item scores only serve to muddy the water.

Interested in learning more about classical test theory and item statistics? Psychometrician Austin Fossey will be delivering a free 75 minute online workshop — Item Analysis: Concepts and Practice Tuesday, June 23, 2015 11:00 AM – 12:15 PM EDT  *spots are limited