Six tips to increase reliability in competence tests and exams

Posted by John Kleeman

Reliability (how consistent an assessment is in measuring something) is a vital criterion on which to judge a test, exam or quiz. This blog post explains what reliability is, why it matters and gives a few tips on how to increase it when using competence tests and exams within regulatory compliance and other work settings

What is reliability?

Picture of a kitchen scaleAn assessment is reliable if it measures the same thing consistently and reproducibly.

If you were to deliver an assessment with high reliability to the same participant on two occasions, you would be very likely to reach the same conclusions about the participant’s knowledge or skills. A test with poor reliability might result in very different scores across the two instances.

It’s useful to think of a kitchen scale. If the scale is reliable, then when you put a bag of flour on the scale today and the same bag of flour on tomorrow, then it will show the same weight. But if the scale is not working properly and is not reliable, it could give you a different weight each time.

Why does reliability matter?

Just like a kitchen scale that doesn’t work, an unreliable assessment does not measure anything consistently and cannot be used for any trustable measure of competency.

As well as reliability, it’s also important that an assessment is valid, i.e. measures what it is supposed to. Continuing the kitchen scale metaphor, a scale might consistently show the wrong weight; in such a case, the scale is reliable but not valid. To learn more about validity, see my earlier post Six tips to increase content validity in competence tests and exams.

How can you increase the reliability of your assessments?

Here are six practical tips to help increase the reliability of your assessment:

  1. Use enough questions to assess competence. Although you need a sensible balance to avoid tests being too long, reliability increases with test length. In their excellent book, Criterion-Referenced Test Development, Shrock and Coscarelli suggest a rule of thumb is 4-6 questions per objective, with more for critical objectives. You can also get guidance from an earlier post on this blog How many questions do I need on my assessment?
  2.  Have a consistent environment for participants. For test results to be consistent, it’s important that the test environment is consistent – try to ensure that all participants have the same amount of time to take the test in and have a similar environment. For example, if some participants are taking the test in a hurry in a public and noisy place and others are taking it at leisure in their office, this could impact reliability.
  3. Ensure participants are familiar with the assessment user interface. If a participant is new to the user interface or the question types, then they may not show their true competence due to the unfamiliarity. It’s common to provide practice tests to participants to allow them to become familiar with the assessment user interface. This can also reduce test anxiety which also influences reliability.
  4. If using human raters, train them well. If you are using human raters, for example in grading essays or in observational assessments that check practical skills, make sure to define your scoring rules very clearly and as objectively as possible. Train your observers/raters, review their performance, give practice sessions and provide exemplars.
  5. Measure reliability. There are a number of ways of doing this, but the most common way is to calculate what is called “Cronbach’s Alpha” which measures internal consistency reliability (the higher it is, the better). It’s particularly useful if all questions on the assessment measure the same construct. You can easily calculate this for Questionmark assessments using our Test Analysis Report.
  6. Conduct regular item analysis to weed out ambiguous or poor performing questions. Item analysis is an automated way of flagging weak questions for review and improvement. If questions are developed through sound procedures and so well crafted and non-ambiguously worded they are more likely to discriminate well and so contribute to a reliable test. Running regular item analysis is the best way to identify poorly performing questions. If you want to learn more about item analysis, I recently gave a webinar on “Item Analysis for Beginners”, and you can access the recording of this here.

 

I hope this blog post reminds you why reliability matters and gives some ideas on how to improve reliability. There is lots more information on how to improve reliability and write better assessments on the Questionmark website – check out our resources at www.questionmark.com/learningresources.

Six predictions now the GDPR is in place

Posted by John Kleeman
So the European GDPR is in place now. Questionmark like most other companies has been working hard in the last two years to ensure we are compliant and that our customers in and outside Europe can be compliant with the GDPR. See our trust center or summary for information on Questionmark’s compliance.

Is it all done and dusted? My email inbox seems to have a few less promotional emails in it. But is this because of the holiday weekend or have companies really taken my name off their mailing lists? Here are six predictions for what we’ll see going forwards with the GDPR.

1. The May 25th 2018 date will matter much less going forwards than backwards

A picture of a dog with a Christmas hatCompanies have been rushing to meet the May 25th date, but GDPR and privacy is a destination not a journey. There is a famous slogan “a dog is for life not just for Christmas” encouraging people to look after their dog and not just buy it as a cute puppy. Similarly the GDPR is not just something you get compliant with and then ignore. You need to include privacy and compliance in your processes forever.

No one will care much whether you were compliant on May 25th 2018. But everyone will care whether you are meeting their privacy needs and following the law when they interact with you.

2. History will judge the GDPR as a watershed moment where privacy became more real

Nevertheless I do think that history will judge the GDPR as being a seminal moment for privacy. Back in the early 2000s, Microsoft popularized the concept of security by design and security by default when they delayed all their products for a year as they improved their security. Nowadays almost everyone builds security into their systems and makes it the default because you have to to survive.

Similarly the GDPR encourages us to think of privacy when we design products and to make privacy the default not an afterthought. For example, when we collect data, we should plan how long to keep it and how to erase it later. I suspect in ten years time, privacy by design will be as commonplace as security by design – and the GDPR will be the key reason it became popular.

3. Many other jurisdictions will adopt GDPR like laws

Although the GDPR is over-complex, it has some great concepts in it, that I’m sure other countries will adopt. It is appropriate that organizations have to take care about processing peoples’ data. It is appropriate that when you pass people’s data onto a third party, there should be safeguards. And if you breach that data, it is appropriate that you should have to be held accountable.

We can expect lawmakers in other countries to make GDPR-like laws.

4. Supply chain management will become more important

Diagram showing one data controller with two data processors. One data processor has two sub-processors and one data processor has one sub-processorUnder the GDPR, a Data Controller contracts with Data Processors and those Data Processors must disclose their Sub-processors (sub-contractors). There is positive encouragement to choose expert Data Processors and Sub-processors and there are consequences if processors fail their customers. This will encourage organizations to choose reputable suppliers and to review processors down the chain to make sure that everyone is following the rules. Choosing suppliers and Sub-processors that get themselves audited for security, e.g. under ISO 27001, is going to become more commonplace.

This will mean that some suppliers who do not have good enough processes in place for security, privacy and reliability will struggle to survive.

5. People will be the biggest cause of compliance failures

Organizations set up processes and procedures and put in place systems and technology to run their operations, but people are needed to design and run those processes and technology. Some GDPR compliance failures are going to be down to technology failures, but I predict the majority will be down to people. People will make mistakes or judgement errors and cause privacy and GDPR breaches.

If you are interested in this subject, Amanda Maguire of SAP and I gave a webinar last week entitled “GDPR is almost here – are your people ready?” which should shortly be available to view on the SAP website. The message we shared is that if you want to stay compliant with the GDPR, you need to check your people know what to do with personal data. Testing them regularly is a good way of checking their knowledge and understanding.

6. The GDPR and privacy concerns will encourage more accurate assessments

Last but not least, I think that the GDPR will encourage people to expect more accurate and trustworthy tests and exams. The GDPR requires that we pay attention to the accuracy of personal data; “every reasonable step must be taken to ensure that personal data that are inaccurate … are erased or rectified without delay”.

There is a strong argument this means that if someone creates a test or exam to measure competence, that the assessment should be accurate in what it claims to measure. So it needs to be authored using appropriate procedures to make it valid, reliable and trustworthy. If someone takes an assessment which is invalid or unfair, and fails it, they might reasonably argue that the results are not an accurate indication of their competence and so that personal data is inaccurate and needs correcting.

For some help on how you can make more accurate assessments, check out Questionmark white papers at www.questionmark.com/learningresources including “Assessment Results You Can Trust”.

 

 

Washington DC – OnDemand for Government Briefing Recap

Posted by Kristin Bernor

Last Thursday, May 17, Questionmark hosted a briefing in Washington, DC that powerfully presented the journey to delivering the Questionmark OnDemand for Government assessment management system to our government customers. We would like to thank industry experts that made this possible including speakers from the Department of State, FedRAMP, Microsoft and Schellman. In just 3 1/2 hours, they were able to present the comprehensive process of what goes into delivering the Questionmark OnDemand for Government system to market. This new government community cloud-based service dedicated to the needs of U.S. governmental and defense agencies is currently wrapping up the onsite portion of the audit. The auditors will be finalizing their testing and review culminating in an assessment report. The complete security package is expected to be available in July. Questionmark OnDemand for Government is designed to be compliant with FedRAMP and hosted in a FedRAMP certified U.S. data center.

Highlights from the briefing included presentations from:

  • Eric Shepherd, CEO of Questionmark, hosted the event.
  • Ted Stille, Department of State, discussed the agency’s motivations and experience as project sponsor for Questionmark OnDemand for Government.
  • Stacy Poll and David Hunt, Public Sector Business Development Manager and Information Security Officer of Questionmark respectively, presented a system overview including demonstration screens, migration paths and detailed next steps to plan for implementation.
  • Christina McGhee, Schellman audit team, spoke about the 3PAO role in the FedRAMP authorization process.
  • Zaree Singer and Laurie Southerton, FedRAMP PMO Support, explained the FedRAMP ATO approval process.
  • Ganesh Shenbagaraman, Microsoft, discussed Microsoft Azure’s government cloud service.

This unique opportunity to learn about the OnDemand for Government assessment management system, meet with peers and other customers, and hear from FedRAMP and our 3PAO themselves proved invaluable to attendees who rated the briefing a near 5 out of 5.

Please reach out to Stacy Poll at stacy.poll@questionmark.com for more information.

Judgement is at the Heart of nearly every Business Scandal: How can we Assess it?

Posted by John Kleeman
How does an organization protect itself from serious mistakes and resultant corporate fines?

An excellent Ernst & Young report on risk reduction explains that an organization needs rules and that they are immensely important in defining the parameters in which teams and individuals operate. But the report suggests that rules alone are not enough, it’s how they are adopted by people when making decisions that matter. Culture is a key part of such decision making. And that ultimately when things go wrong “judgement is at the heart of nearly every business scandal that ever occurred”.

Clearly judgement is important for almost every job role and not just to prevent scandals but to improve results. But how do you measure it? Is it possible to test individuals to identify how they would react in dilemmas and what judgement that would apply? And is it possible to survey an organization to discover what people think their peers would do in difficult situations?  One answer to these questions is that you can use Situational Judgement Assessments (SJAs) to measure judgement, both for individuals and across an organization.

Questionmark has published a white paper on Situational Judgement Assessments, written by myself and Eugene Burke. The white paper describes how to assess judgement where you:

  1. Identify job roles and competencies or aspects of that role in your organization or workforce where judgement is important.
  2. Identify dilemmas which are relevant to your organization and each of which requires a choice to be made and where that choice is linked to the relevant job role.
  3. Build questions based on the dilemmas which asks someone to select from the choices –   SJA (Situational Judgement Assessment) questions.

There are two ways of presenting such questions, either to survey someone or to assess individuals on their judgement.

  • You can present the dilemma and survey your workforce on how they think others would do in such a situation. For example “Rate how you think people in the organization are likely to behave in a situation like this. Use the following scale to rate each of the options below: 1 = Very Unlikely 2 = Unlikely 3 = Neutral 4 = Likely 5 = Very Likely”.
  • You can present the dilemma and test individuals on what they personally would do in such a situation, for example as shown in the screenshot below.

You work in the back office in the team approving new customers, ensuring that the organization’s procedures have been followed (such as credit rating and know your customer). Your manager is away on holiday this week. A senior manager in the company (three levels above you) comes into your office and says that there is an important new customer who needs to be approved today. They want to place a big order, and he can vouch that the customer is good. You review the customer details, and one piece of information required by your procedures is not present. You tell the senior manager and he says not to worry, he is vouching for the customer. You know this senior manager by reputation and have heard that he got a colleague fired a few months ago when she didn’t do what he asked. You would: A. Take the senior manager’s word and approve the customer B. Call your manager’s cellphone and interrupt her holiday to get advice C. Tell the manager you cannot approve the customer without the information needed D. Ask the manager for signed written instructions to override standard procedures to allow you to approve the customer

You can see this question “live” with other examples of SJA questions in one of our example assessments on the Questionmark website at www.questionmark.com/go/example-sja.

Once you deliver such questions, you can easily report on the results segmented by attributes of participants (such as business function, location and seniority as well as demographics such as age, gender and tenure). Such reports can help indicate whether compliance will be acted out in the workplace, evaluate where compliance professionals need to focus their efforts and measure whether compliance programs are gaining traction.

SJAs can be extremely useful as a tool in a compliance programme to reduce regulatory risk. If you’re interesting in learning more about SJAs, read Questionmark’s white paper “Assessing for Situational Judgment”, available free (with registration) at https://www.questionmark.com/sja-whitepaper.

Special Briefing: Cloud-based Assessment Management for Government and Defense Agencies

Posted by Kristin Bernor

In just two weeks, the special briefing about Questionmark OnDemand for Government, a new cloud-based service dedicated to the needs of U.S. governmental and defense agencies, takes place. You don’t want to miss it!

Join us on Thursday, May 17th in Washington, DC, to learn about how this new service enables agencies to securely author, deliver and analyze assessments, and hear from dynamic speakers including:

  • Jono Poltrack, contributor to the Sharable Content Object Reference Model (SCORM) while at Advanced Distributed Learning (ADL)
  • Ted Stille, Department of State, will discuss the agency’s motivations and experience as project sponsor for Questionmark OnDemand for Government
  • Christina McGhee, Schellman audit team, will discuss the 3PAO role in the FedRAMP authorization process
  • Zaree Singer, FedRAMP PMO Support, will explain the FedRAMP ATO approval process
  • Ganesh Shenbagaraman, Microsoft, will discuss Microsoft Azure’s government cloud service

Space is limited, so register today! Here’s how:

Questionmark has finalized its FedRAMP System Security Plan and this plan, which documents our security systems and processes, is now being reviewed by an accredited FedRAMP Third Party Assessment Organization (3PAO); this means that we are officially in audit. Once this document has been audited it becomes part of the FedRAMP library for Security Officers to review and provide individual agencies with an “Authorization to Operate” (ATO). Note: Briefing attendees will be eligible to receive a pre-release copy of the FedRAMP System Security Plan.

Questionmark is widely deployed by U.S. governmental and defense agencies to author, deliver and report on high-stakes advancement exams, post-course tests for distance learning, job task analysis, medical training and education, competency testing, course evaluations and more. For government agencies currently using the on-premise installed Questionmark Perception, OnDemand for Government provides a cost-effective option to upgrade to a secure, best-in-class cloud-based assessment management system.

We look forward to seeing you in Washington for a morning of learning and networking!

Item Analysis for Beginners – Getting Started

Posted by John Kleeman
Do you use assessments to make decisions about people? If so, then you should regularly run Item Analysis on your results.  Item Analysis can help find questions which are ambiguous, mis-keyed or which have choices that are rarely chosen. Improving or removing such questions will improve the validity and reliability of your assessment, and so help you use assessment results to make better decisions. If you don’t use Item Analysis, you risk using poor questions that make your assessments less accurate.

Sometimes people can be fearful of Item Analysis because they are worried it involves too much statistics. This blog post introduces Item Analysis for people who are unfamiliar with it, and I promise no maths or stats! I’m also giving a free webinar on Item Analysis with the same promise.

An assessment contains many items (another name for questions) as figuratively shown below. You can use Item Analysis to look at how each item performs within the assessment and flag potentially weak items for review. By keeping only stronger questions in the assessment, the assessment will be more effective.

Picture of a series of items with one marked as being weak

Item Analysis looks at the performance of all your participants on the items, and calculates how easy or hard people find the items (“item difficulty” or “p-value”) and how well the scores on items correlate with or show a relationship with the scores on the assessment as a whole (“item discrimination” or correlation). Some of problematic questions that Item Analysis can identify are:

  • Questions almost all participants get right, and so which are very easy. You might want to review to these to see if they are appropriate for the assessment. See my earlier post Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful? for more information.
  • Questions which are difficult, where a lot of participants get the questionwrong. You should check such questions in case they are mis-keyed or ambiguous.
  • Multiple choice questions where some choices are rarely picked. You might want to improve such questions to make the wrong choices more plausible.
  • Questions where there is a poor correlation between participants who get the question right and who do well on the assessment. For example it will flag questions that high performing participants perform poorly on. You should look at such questions in case they are ambiguous, mis-keyed or off-topic.

There is a huge wealth of information available in an Item Analysis report, and assessment experts will delve into the report in detail. But much of the key information in an Item Analysis report is useful to anyone creating and delivering quizzes, tests and exams.

The Questionmark Item Analysis report includes a graph which shows the difficulty of items compared against their discrimination, like in the example below. It flags questions by marking them amber or red if they fall into categories which may need review. For example, in the illustration below, four questions are marked in amber as having low discrimination and so potentially be worth looking at.

Illustration of Questionmark item analysis report showing some questions green and some amber

If you are running an assessment program, and not using Item Analysis regularly, then this throws doubt on the trustworthiness of your results. By using it to identify and improve weak questions you should be able to improve your validity and reliability.

Item Analysis is surprisingly effective in practice. I’m one of the team responsible at Questionmark for managing our data security test which all employees have to take annually to check their understanding of information security and data protection. We recently reviewed the test and ran Item Analysis and very quickly found a question with poor stats where the technology had changed but we’d not updated the wording, and another question where two of the choices could be considered right, which made it hard to answer. It made our review faster and more effective and helped us improve the quality of the test.

If you want to learn a little more about Item Analysis, I’m running a free webinar on the subject “Item Analysis for Beginners” on May 2nd. You can see details and register for the webinar at https://www.questionmark.com/questionmark_webinars. I look forward to seeing some of you there!