Ten Tips to Translate Tests Thoughtfully

John KleemanPosted by John Kleeman

Tests and exams are used for serious purposes and have significant impact on people’s lives. If they are translated wrongly, it can result in distress. As a topical illustration, poor translation of an important medical admissions test in India was the subject of a major law case ruled on by the Indian Supreme Court last week.

Because language and cultures vary, fairly translating tests and exams is hard. I recently attended a seminar organized by the OECD on translating large scale assessments which gave me a lot of insight into the test translation process.  If you are interested  in the OECD seminar, Steve Dept of Questionmark partner cApStAn has written a blog here, and the seminar presentations are available on the OECD website.

Here are some tips from what I’ve learned at the seminar and elsewhere on good practice in translating tests and exams.

  1. Put together a capable translation management team. A team approach works well when translating tests. For example a subject matter expert, a linguist/translator, a business person and a testing expert would work well together as a review and management committee.
  2. Think through the purpose of your translation. Experts say that achieving perfect equivalence of a test in two languages is close to impossible, so you need to define your goals. For example, are you seeking to adapt the test to measure the same thing or are you looking for a literal translation? The former may be more realistic especially if your test includes some culturally specific examples or context.  Usually what you will be looking for is that the test in two languages is comparable in that a pass score in the test in either language means a similar thing for competence.
  3. Define a glossary for your project. If your test is on a specialist or technical subject, it will have some words specific to the content area. You can save time and increase the quality of the translation if you identify the expected translation of these words in advance. This will guide the translating team and ensure that test takers see consistent vocabulary.
  4. Use a competent translator (or translation company). A translator must be native in the target language but also needs current cultural knowledge, ideally from living in the target locale. A translator who is not native to the language will not be effective, and a translator who does not have knowledge of the culture may miss some references in question content  (e.g. local names or slang). An ideal translator will also have subject matter knowledge and assessment knowledge.
  5. Diagram showing export into XLIFF XML and then re-importExport to allow a translator to use their own tools. Translators have many automated tools available to them including translation memories, glossaries and automated checking systems. For simple translation, you can translate interactively within an assessment system, but you will get more professional results if you export from your assessment management system, allow the translator to translate in their system, and then re-import (as shown in the diagram).
  6. Put in place a verification procedure. Translators are human and make mistakes, questions can also rely on context or knowledge that a translator may not have. A verification process will involve manual review by stakeholders looking at things like accuracy, style, country issues, culture, no clues given in choices, right choice not obviously longer than other choices and different translation word choices used in stem/choices.
  7. Also review by piloting and looking at item difficulty. Linguistic review is helpful but you should also look at item performance in practice. The difficulty of a translated item will vary slightly between languages. Generally small errors will be up and down and roughly cancel out. You want to catch the big errors, where ambiguity or mis-translation makes a material difference to test accuracy. You can catch some of these by running a small pilot to 50 (or even 25) participants and comparing the p-value (item difficulty or proportion who get right) in the languages. This can flag questions with significant differences in difficulty; such questions need review as they may well be badly translated.
  8. Consider using bilingual reviewers. If you have access to bilingual people (who speak the target and source language), it can be worth asking them to look at both versions of the questions and comment. This shouldn’t be your only verification procedure but can be very helpful and spot issues.
  9. Update translations as questions change. In any real world test, questions in your item bank get updated over time, and that means you need to update the translations and keep track of which ones have been updated in which languages. It can be helpful  to use a translation management system, for example the one included within Questionmark OnDemand to help you manage this process, as it’s challenging and error-prone to manage manually.
  10. Read community guidelines. The International Test Commission have produced well-regarded guidelines on adapting/ translating tests – you can access them here. The OECD PISA guidelines, although specific to the international PISA tests, have  good practice applicable to other programs. I personally like the heading to one of the sections in the PISA guidance: “Keep in mind that some respondents will misunderstand anything that can be misunderstood”!

I hope you found this post interesting – all suggestions are personal and not validated by the OECD or others. If you did find it interesting, you may also want to read my earlier blog post: Twelve tips to make questions translation ready.

To learn more about Questionmark OnDemand and Questionmark’s translation management system, see here or request a demo.

Item Analysis for Beginners – Getting Started

Posted by John Kleeman
Do you use assessments to make decisions about people? If so, then you should regularly run Item Analysis on your results.  Item Analysis can help find questions which are ambiguous, mis-keyed or which have choices that are rarely chosen. Improving or removing such questions will improve the validity and reliability of your assessment, and so help you use assessment results to make better decisions. If you don’t use Item Analysis, you risk using poor questions that make your assessments less accurate.

Sometimes people can be fearful of Item Analysis because they are worried it involves too much statistics. This blog post introduces Item Analysis for people who are unfamiliar with it, and I promise no maths or stats! I’m also giving a free webinar on Item Analysis with the same promise.

An assessment contains many items (another name for questions) as figuratively shown below. You can use Item Analysis to look at how each item performs within the assessment and flag potentially weak items for review. By keeping only stronger questions in the assessment, the assessment will be more effective.

Picture of a series of items with one marked as being weak

Item Analysis looks at the performance of all your participants on the items, and calculates how easy or hard people find the items (“item difficulty” or “p-value”) and how well the scores on items correlate with or show a relationship with the scores on the assessment as a whole (“item discrimination” or correlation). Some of problematic questions that Item Analysis can identify are:

  • Questions almost all participants get right, and so which are very easy. You might want to review to these to see if they are appropriate for the assessment. See my earlier post Item Analysis for Beginners – When are very Easy or very Difficult Questions Useful? for more information.
  • Questions which are difficult, where a lot of participants get the questionwrong. You should check such questions in case they are mis-keyed or ambiguous.
  • Multiple choice questions where some choices are rarely picked. You might want to improve such questions to make the wrong choices more plausible.
  • Questions where there is a poor correlation between participants who get the question right and who do well on the assessment. For example it will flag questions that high performing participants perform poorly on. You should look at such questions in case they are ambiguous, mis-keyed or off-topic.

There is a huge wealth of information available in an Item Analysis report, and assessment experts will delve into the report in detail. But much of the key information in an Item Analysis report is useful to anyone creating and delivering quizzes, tests and exams.

The Questionmark Item Analysis report includes a graph which shows the difficulty of items compared against their discrimination, like in the example below. It flags questions by marking them amber or red if they fall into categories which may need review. For example, in the illustration below, four questions are marked in amber as having low discrimination and so potentially be worth looking at.

Illustration of Questionmark item analysis report showing some questions green and some amber

If you are running an assessment program, and not using Item Analysis regularly, then this throws doubt on the trustworthiness of your results. By using it to identify and improve weak questions you should be able to improve your validity and reliability.

Item Analysis is surprisingly effective in practice. I’m one of the team responsible at Questionmark for managing our data security test which all employees have to take annually to check their understanding of information security and data protection. We recently reviewed the test and ran Item Analysis and very quickly found a question with poor stats where the technology had changed but we’d not updated the wording, and another question where two of the choices could be considered right, which made it hard to answer. It made our review faster and more effective and helped us improve the quality of the test.

If you want to learn a little more about Item Analysis, I’m running a free webinar on the subject “Item Analysis for Beginners” on May 2nd. You can see details and register for the webinar at https://www.questionmark.com/questionmark_webinars. I look forward to seeing some of you there!

 

Psychometrics 101: Sample size and question difficulty (p-values)

greg_pope-150x1502

Posted by Greg Pope

With just a week to go before the Questionmark Users Conference, here’s a little taste of the presentation I will be doing on  psychometrics. I will also be running a session on Item Analysis and Test Analysis.

So, let’s talk about sample size and question difficulty!

How does the number of participants that take a question relate to the robustness/stability of the question difficulty statistic (p-value)? Basically the smaller the number of participants tested the less robust/stable the statistic. So if 30 participants take a question and the p-value that appears in the Item Analysis Report is 0.600 the range that the theoretical “true” p-value (if all participants in the world took the question) could fall into 95% of the time is between 0.775 and 0.425. This means that if another 30 participants were tested you could get a p-value on the Item Analysis Report anywhere from 0.775 to 0.425 (95% confidence range). The take away is that if high stakes decisions are being made using p-values (e.g., whether to drop a question from a certification exam) the more participants that can be tested the better to get more robust results. Another example is that if you are conducting beta testing and you want to know which questions to include in your test form based on the beta test results the more participants you can beta test the better in terms of the confidence you will have in the stability of the statistics. Below is a graph that illustrates this relationship.sample-size-influences-p-value-chart1

This relationship between sample size and the stability of other statistics applies to other common statistics used in psychometrics. For example the item-total correlation (point biserial correlation coefficient) can vary a great deal when small sample sizes are used to calculate it. In the example below we see that an observed correlation of 0 can actual vary by over 0.8 (plus or minus).sample-sixe-influences-chart1