Is Safe Harbor still safe for assessment data?

John Kleeman HeadshotPosted by John Kleeman

A European legal authority last week advised that the Safe Harbor framework which allows European organizations to send personal data to the US  should no longer be legal. I’d like to explain what this means and discuss the potential consequences to those delivering assessments and training in Europe.

What European data protection law says about transfers outside Europe

According to European data protection law, personal data such as assessment results or course completion data can only leave Europe if an adequate level of protection is guaranteed. All organizations with European participants must ensure that they follow strict rules if they allow personal data to be transferred outside Europe. Data controllers can be fined if they don’t comply.

Data controller has data processors which have sub processorsA few countries, including Canada, are considered to have an adequate level of protection. But in order to send information to the United States and most other countries outside Europe, it’s necessary to ensure that each data processor who has access to the data  guarantees its protection. This includes every processor and sub-processor with access to the data including data centers, backup storage vendors and any organization that accesses the data for support or troubleshooting purposes. Even if data is hosted in Europe, the rules must still be followed if there is any access to it or any copy of it in the US.

There are two main ways in which US organizations can bind themselves to follow data protection rules and so be legitimate processors of European data: the EU Model Clauses or Safe Harbor.

EU Model Clauses

EU FlagThe EU Model Clauses are a standard set of contractual clauses, several pages long, which a data processor can sign with each data controller. Signing signifies a commitment to following EU data protection law when processing data. These clauses cannot be changed or negotiated in any way. Questionmark uses these EU model clauses with all our sub-processors for Questionmark OnDemand data to ensure that our customers will be compliant with EU data protection law.

Safe Harbor

An alternative to the EU model clauses in the US is Safe Harbor. Safe safe harborHarbor (formal name – the US-EU Safe Harbor Framework) is run by the US Department of Commerce and allows US companies to certify that they will follow EU rules for EU data without needing to sign the EU model clauses. You can certify once, and then it applies to all your customers. It’s very widely used, and most large US organizations in assessment and learning are Safe Harbor certified, including Questionmark’s US company, Questionmark Corporation. You can see a full list at

There is some concern, particularly in Germany, that Safe Harbor is not well enough enforced, so some organizations like Questionmark also use the EU Model Clauses. For example, Microsoft offer these for their cloud products. But Safe Harbor is widely used to ensure the legality and safety of European data sent to the US.

The legal threat to Safe Harbor

Last week, the advocate general of the Court of Justice of the European Union made a ruling that the Safe Harbor scheme should no longer be legal. He argues that the widespread government surveillance by the US is incompatible with the privacy rights set out in the EU Data Protection directive, so the whole of Safe Harbor should be invalidated. His ruling is not yet binding, but rulings by advocate generals are often confirmed and made binding by the court, so there is a genuine threat that Safe Harbor could be suspended.

Negotiations on data protection are underway between the US and Europe, and it is likely that this will be resolved in some way. But there are significant differences in attitude on data protection between Europe and the US.  Much anger remains about Edward Snowden’s revelations about US surveillance, so the situation is hard to predict.

What can organizations do to protect themselves?

It’s likely that a deal will be found and that Safe Harbor will remain safe. And if it is ruled illegal, this is going to affect the whole technology sector, not just learning and assessment. But it’s a further argument to use a European vendor for assessment and learning needs and/or one who is familiar with and has their suppliers signed up to the EU Model Clauses.

For more information and background on data protection, see Questionmark’s white paper:  Responsibilities of a Data Controller When Assessing Knowledge, Skills and Abilities. John Kleeman will also be presenting at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 2

Austin Fossey-42Posted by Austin Fossey

In my last post, I talked about how item discrimination is the primary statistic used for item selection in classical test theory (CTT). In this post, I will share an example from my item analysis webinar.

The assessment below is fake, so there’s no need to write in comments telling me that the questions could be written differently or that the test is too short or that there is not good domain representation or that I should be banished to an island.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15.

In this example, we have field tested 16 items and collected item statistics from a representative sample of 1,000 participants. In this hypothetical scenario, we have been asked to create an assessment that has 11 items instead of 16. We will begin by looking at the item discrimination statistics.

Since this test has fewer than 25 items, we will look at the item-rest correlation discrimination. The screenshot below shows the first five items from the summary table in Questionmark’s Item Analysis Report (I have omitted some columns to help display the table within the blog).

IT 2

The test’s reliability (as measured by Cronbach’s Alpha) for all 16 items is 0.58. Note that one would typically need at least a reliability value of 0.70 for low-stakes assessments and a value of 0.90 or higher for high-stakes assessments. When reliability is too low, adding extra items can often help improve the reliability, but removing items with poor discrimination can also improve reliability.

If we remove the five items with the lowest item-rest correlation discrimination (items 9, 16, 2, 3, and 13 shown above), the remaining 11 items have an alpha value of 0.67. That is still not high enough for even low-stakes testing, but it illustrates how items with poor discrimination can lower the reliability of an assessment. Low reliability also increases the standard error of measurement, so by increasing the reliability of the assessment, we might also increase the accuracy of the scores.

Notice that these five items have poor item-rest correlation statistics, yet four of those items have reasonable item difficulty indices (items 16, 2, 3, and 13). If we had made selection decisions based on item difficulty, we might have chosen to retain these items, though closer inspection would uncover some content issues, as I demonstrated during the item analysis webinar.

For example, consider item 3, which has a difficulty value of 0.418 and an item-rest correlation discrimination value of -0.02. The screenshot below shows the option analysis table from the item detail page of the report.


The option analysis table shows that, when asked about the easternmost state in the Unites States, many participants are selecting the key, “Maine,” but 43.3% of our top-performing participants (defined by the upper 27% of scores) selected “Alaska.” This indicates that some of the top-performing participants might be familiar with Pochnoi Point—an Alaskan island which happens to sit on the other side of the 180th meridian. Sure, that is a technicality, but across the entire sample, 27.8% of the participants chose this option. This item clearly needs to be sent back for revision and clarification before we use it for scored delivery. If we had only looked at the item difficulty statistics, we might never had reviewed this item.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the Questionmark Conference 2016: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.

Item analysis: Selecting items for the test form – Part 1

Austin Fossey-42Regular readers of our blog know that we ran an initial series on item analysis way back in the day, and then I did a second item analysis series building on that a couple of years ago, and then I discussed item analysis in our item development series, and then we had an amazing webinar about item analysis, and then I named my goldfish Item Analysis and wrote my senator requesting that our state bird be changed to an item analysis. So today, I would like to talk about . . . item analysis.

But don’t worry, this is actually a new topic for the blog.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. 

Today, I am writing about the use of item statistics for item selection. I was surprised to find, from feedback we got from many of our webinar participants, that a lot of people do not look at their item statistics until after the test has been delivered. This is a great practice (so keep it up), but if you can try out the questions as unscored field test items before making your final test form, you can use the item analysis statistics to build a better instrument.

When building a test form, item statistics can help us in two ways.

  • They can help us identify items that are poorly written, miskeyed, or irrelevant to the construct.
  • They can help us select the items that will yield the most reliable instrument, and thus a more accurate score.

In the early half of the 20th century, it was common belief that good test instruments should have a mix of easy, medium, and hard items, but this thinking began to change after two studies in 1952 by Fred Lord and by Lee Cronbach and Willard Warrington. These researchers (and others since) demonstrated that items with higher discrimination values create instruments whose total scores discriminate better among participants across all ability levels.

Sometimes easy and hard items are useful for measurement, such as in an adaptive aptitude test where we need to measure all abilities with similar precision. But in criterion-referenced assessments, we are often interested in correctly classifying those participants who should pass and those who should fail. If this is our goal, then the best test form will be one with a range of medium-difficulty items that also have high discrimination values.

Discrimination may be the primary statistic used for selecting items, but item reliability is also occasionally useful, as I explained in an earlier post. Item reliability can be used as a tie breaker when we need to choose between two items with the same discrimination, or it can be used to predict the reliability or score variance for a set of items that the test developer wants to use for a test form.

Difficulty is still useful for flagging items, though an item flagged for being too easy or too hard will often have a low discrimination value too. If an easy or hard item has good discrimination, it may be worth reviewing for item flaws or other factors that may have impacted the statistics (e.g., was it given at the end of a timed test that did not give participants enough time to respond carefully).

In my next post, I will share an example from the webinar of how item selection using item discrimination improves the test form reliability, even though the test is shorter. I will also share an example of a flawed item that exhibits poor item statistics.

Interested in learning more about item analysis? I will be presenting a series of workshops on this topic at the 2016 Questionmark Conference: Shaping the Future of Assessment in Miami, April 12-15. I look forward to seeing you there! Click here to register and learn more about this important learning event.



SAP to present their global certification program at London briefing

Chloe MendoncaPosted by Chloe Mendonca

A key to SAP’s success is ensuring that the professional learning path of skilled SAP practitioners is continually supported – thereby making qualified experts on their cloud solutions readily available to customers, partners and consultants.

In a world where current knowledge and skills are more important than ever, SAP needed a way to verify that their cloud consultants around the world were keeping their knowledge and skills up-to-date  with rapidly changing technology. A representative of the certification program at SAP comments:breakfast briefing

It became clear that a certification that lasted for two or three years didn’t cut it any longer – in all areas of the portfolio. Everything is evolving so quickly, and SAP has to always support current, validated knowledge.”

Best Practices from SAP

The move to the cloud required some fundamental changes to SAP’s existing certification program. What challenges did they face? What technologies are they using to ensure the security of the program? Join us on the 21st of October for a breakfast briefing in London, where Ralf Kirchgaessner, Manager of Global Certification at SAP, will discuss the answers to these questions. Ralf will tell how the SAP team planned for the program, explain its benefits and share lessons learned.

Click here to learn more and register for this complimentary breakfast briefing *Seats are limited

High-Stakes Assessments

The briefing will  include a best-practice seminar on the types of technologies and techniques to consider using as part of your assessment program to securely create, deliver and report on high-stakes tests around the world. It will highlight technologies such as online invigilation, secure browsers and item banking tools that alleviate the testing centre burden and allow organisations and test publishers to securely administer trustable tests and exams and protect valuable assessment content.

What’s a breakfast briefing?

You can expect a morning of networking, best practice tips and live demonstrations of the newest assessment technologies.The event will include a complimentary breakfast at 8:45 a.m. followed by presentations and discussions until about 12:30 p.m.

Who should attend?

These gatherings are ideal for people involved in certification, compliance and/or risk management, and learning and development.

When? Where?

Wednesday 21st October at Microsoft’s Office in London, Victoria — 8:45 a.m. – 12:30 p.m

Click here to learn more and register to attend

Agree or disagree? 10 tips for better surveys — part 3

John Kleeman HeadshotPosted by John Kleeman

This is the third and last post in my “Agree or disagree” series on writing effective attitude surveys. In the first post I explained the process survey participants go through when answering questions and the concept of satisficing – where some participants give what they think is a satisfactory answer rather than stretching themselves to give the best answer.

In the second post I shared these five tips based on research evidence on question and survey design.

Tip #1 – Avoid Agree/Disagree questions

Tip #2 – Avoid Yes/No and True/False questions

Tip #3 – Each question should address one attitude only

Tip #4 – Minimize the difficulty of answering each question

Tip #5 – Randomize the responses if order is not important

Here are five more:

Tip #6 –  Pretest your survey

Just as with tests and exams, you need to pretest or pilot your survey before it goes live. Participants may interpret questions differently than you intended. It’s important to get the language right so as to trigger in the participant the right judgement. Here are some good pre-testing methods:

  • Get a peer or expert to review the survey.
  • Pre-test with participants and measuring the response time for each question (shown in some Questionmark reports). A longer response time could be connected with a more confusing question.
  • Allow participants to provide comments on questions they think they are confusing.
  • Follow up with your pretesting group by asking them why they gave particular answers or asking them what they thought you meant by your  questions.

Tip #7 – Make survey participants realize how useful the survey is

The more motivated a participant is, the more likely he or she is to answer optimally rather than just satisficing and choosing a good enough answer. To quote Professor Krosnick in his paper The Impact of Satisficing on Survey Data Quality:

“Motivation to optimize is likely to be greater among respondents who think that the survey in which they are participating is important and/or useful”

Ensure that you communicate the goal of the survey and make participants feel that filling it in usefully will be a benefit to something they believe in or value.

Tip #8. Don’t include a “don’t know” option

Including a “don’t know” option usually does not improve the accuracy of your survey. In most cases it reduces it. To those of us used to the precision of testing and assessment, this is surprising.

Part of the reason is that providing a “don’t know” or “no opinion” option allows participants to disengage from your survey and so diminishes useful responses. Also,  people are better at guessing or estimating than they think they are, so they will tend to choose an appropriate answer if they do not have an option of “don’t know”. See this paper by Mondak and Davis, which illustrates this in the political field.

Tip #9. Ask questions about the recent past only

The further back in time they are asked to remember, the less accurately participants will answer your questions. We all have a tendency to “telescope” the timing of events and imagine that things happened earlier or later than they did. If you can, ask about the last week or the last month, not about the last year or further back.

Picture of a trends graphTip #10 – Trends are good

Error can creep into survey results in many ways. Participants can misunderstand the question. They can fail to recall the right information. Their judgement can be influenced by social pressures. And they are limited by the choices available. But if you use the same questions over time with a similar population, you can be pretty sure that changes over time are meaningful.

For example, if you deliver an employee attitude survey with the same questions for two years running, then changes in the results to a question (if statistically significant) probably mean a change in employee attitudes. If you can use the same or similar questions over time and can identify trends or changes in results, such data can be very trustworthy.

I hope you’ve found this series of articles useful.  For more information on how Questionmark can help you create, deliver and report on surveys, see I’ll also be presenting at Questionmark’s 2016 Conference: Shaping the Future of Assessment in Miami April 12-15. Check out the conference page for more information.

Know what your questions are about before you deliver the test

Austin Fossey-42Posted by Austin Fossey

A few months ago, I had an interesting conversation with an assessment manager at an educational institution—not a Questionmark customer, mind you. Finding nothing else in common, we eventually began discussing assessment design.

At this institution (which will remain anonymous), he admitted that they are often pressed for time in their assessment development cycle. There is not enough time to do all of the item development work they need to do before their students take the assessment. To get around this, their item writers draft all of the items, conduct an editorial review, and then deliver the items. The items are assigned topics after administration, and students’ total scores and topic scores are calculated from there. He asked me if Questionmark software allows test developers to assign topics and calculate topic scores after assessing the students, and I answered truthfully that it does not.

But why not? Is there a reason test developers should not do what is being practiced at this institution? Yes, there are in fact two reasons. Get ready for some psychometric finger-wagging.

Consider what this institution is doing. The items are drafted and subjected to an editorial review, but no one ever classifies the items within a topic until after the test has been administered. Recall what people typically do during a content review prior to administration:

  • Remove items that are not relevant to the domain.
  • Ensure that the blueprint is covered.
  • Check that items are assigned to the correct topic.

If topics are not assigned until after the participants have already tested, we risk the validity of the results and the legal defensibility of the test. If we have delivered items that are not relevant to the domain, we have wasted participants’ time and will need to adjust their total score. Okay, we can manage that by telling the participants ahead of time that some of the test items might not count. But if we have not asked the correct number of questions for a given area of the blueprint, the entire assessment score will be worthless—a threat to validity known as construct underrepresentation or construct deficiency in The Standards for Educational and Psychological Testing.

For example, if we were supposed to deliver 20 items from Topic A, but find out after the fact that only 12 items have been classified as belonging to Topic A, then there is little we can do about it besides rebuilding the test form and making everyone take the test again.

The Standards provide helpful guidance in these matters. For this particular case, the Standards point out that:

“The test developer is responsible for documenting that the items selected for the test meet the requirements of the test specifications. In particular, the set of items selected for a new test form . . . must meet both content and psychometric specifications.” (p. 82)

Publications describing best practices for test development also specify that the content must be determined before delivering an operational form. For example, in their chapter in Educational Measurement (4th Edition), Cynthia Schmeiser and Catherine Welch note the importance of conducting a content review of items before field testing, as well a final content review of a draft test form before it becomes operational.

In Introduction to Classical and Modern Test Theory, Linda Crocker and James Algina also made an interesting observation about classroom assessments, noting that students expect to be graded on all of the items they have been asked to answer. Even if notified in advance that some items might not be counted (as one might do in field testing), students might not consider it fair that their score is based on a yet-to-be-determined subset of items that may not fully represent the content that is supposed to be covered.

This is why Questionmark’s software is designed the way it is. When creating an item, item writers must assign an item to a topic, and items can be classified or labeled along other dimensions (e.g., cognitive process) using metatags. Even if an assessment program cannot muster any further content review, at least the item writer has classified items by content area. The person building the test form then has the information they need to make sure that the right questions get asked.

We have a responsibility as test developers to treat our participants fairly and ethically. If we are asking them to spend their time taking a test, then we owe them the most useful measurement that we can provide. Participants trust that we know what we are doing. If we postpone critical, basic development tasks like content identification until after participants have already given us their time, we are taking advantage of that trust.

Next Page »