White Paper: Delivering Assessments Safely and Securely

julie-smallPosted by Julie Chazyn

We have just updated our white paper on Delivering Assessments Safely and Securely, which helps people choose security measures that match up with the types of assessments they’re delivering – from low stakes to high stakes.info1

This new paper takes into account changes in technologies and standards that have taken place in the last few years—as well as new testing environments and methods. We’ve also added some tips to help prevent cheating.

You can download the paper here.

Psychometrics 101: Sample size and question difficulty (p-values)


Posted by Greg Pope

With just a week to go before the Questionmark Users Conference, here’s a little taste of the presentation I will be doing on  psychometrics. I will also be running a session on Item Analysis and Test Analysis.

So, let’s talk about sample size and question difficulty!

How does the number of participants that take a question relate to the robustness/stability of the question difficulty statistic (p-value)? Basically the smaller the number of participants tested the less robust/stable the statistic. So if 30 participants take a question and the p-value that appears in the Item Analysis Report is 0.600 the range that the theoretical “true” p-value (if all participants in the world took the question) could fall into 95% of the time is between 0.775 and 0.425. This means that if another 30 participants were tested you could get a p-value on the Item Analysis Report anywhere from 0.775 to 0.425 (95% confidence range). The take away is that if high stakes decisions are being made using p-values (e.g., whether to drop a question from a certification exam) the more participants that can be tested the better to get more robust results. Another example is that if you are conducting beta testing and you want to know which questions to include in your test form based on the beta test results the more participants you can beta test the better in terms of the confidence you will have in the stability of the statistics. Below is a graph that illustrates this relationship.sample-size-influences-p-value-chart1

This relationship between sample size and the stability of other statistics applies to other common statistics used in psychometrics. For example the item-total correlation (point biserial correlation coefficient) can vary a great deal when small sample sizes are used to calculate it. In the example below we see that an observed correlation of 0 can actual vary by over 0.8 (plus or minus).sample-sixe-influences-chart1

Which Question Type To Use?

julie-small1Posted by Julie Chazyn

Test writers often say it’s hard sometimes to know what question type to use in some situations.  I was pleased to find an article by Monique Donahue of the American Hotel & Lodging Educational Institute  on this subject. Monique explains appropriate uses for various question types such as true/false, multiple choice, matching and fill-in the blank . Test Writing 101: Making The Grade is worth a look. The article covers testing fundamentals and some do’s don’ts of test and question writing…check it out!

You can get more question writing advice in this Powerpoint from the Questionmark Learning Cafe: Creating Effective Assessments.

Psychometrics 101: Item Total Correlation


Posted by Greg Pope

I’ll be talking about a subject dear to my heart — psychometrics — at the Questionmark Users Conference April 5 -8. Here’s a sneak preview on one of my topics: item total correlation! What is it, and what does it mean?

The item total correlation is a correlation between the question score (e.g., 0 or 1 for multiple choice) and the overall assessment score (e.g., 67%). It is expected that if a participant gets a question correct they should, in general, have higher overall assessment scores than participants who get a question wrong. Similarly with essay type question scoring where a question could be scored between 0 and 5 participants who did a really good job on the essay (got a 4 or 5) should have higher overall assessment scores (maybe 85-90%). This relationship is shown in an example graph below.


This relationship in psychometrics is called ‘discrimination’ referring to how well a question differentiates between participants who know the material and those that do not know the material. Participants who know the material taught to them should get high scores on questions and high overall assessment scores. Participants who did not master the material should get low scores on questions and lower overall assessment scores. This is the relationship that an item-total correlation provides to help evaluate the performance of questions. We want to have lots of highly discriminating questions on our tests because they are the most fine-tuned measurements to find out what participants know and can do. When looking at an item-total correlation generally negative values are a major red flag it is unexpected that participants who get low scores on the questions get high scores on the assessment. This could indicate a mis-keyed question or that the question was highly ambiguous and confusing to participants. Values for an item-total correlation (point-biserial) between 0 and 0.19 may indicate that the question is not discriminating well, values between 0.2 and 0.39 indicate good discrimination, and values 0.4 and above indicate very good discrimination.

Looking forward to Breakfast Briefings this spring!

coffee-cup-web16julie-small2Posted by Julie Chazyn

Come to London, Manchester and Edinburgh in May

It won’t be long before our UK team sets out on their annual round of Breakfast Briefings in England and Scotland. We started the briefings several years ago and have found they’re a great way to connect with people and show them all the new technologies that are coming on board. For people who’ve never used online quizzes and tests it’s a chance to learn the basics. There will also be tips about blended delivery, the growing use of smartphones and other mobile devices, and new ways to harvest content from subject matter experts.

This year’s breifings are set for Manchester on 12 May, London on 13 May, and Edinburgh  on 19 May.  We hope you’ll  come along for breakfast and a good morning’s conversation. Please register ahead of time so we will know to expect you!

Welcome to Getting Results—The Questionmark Blog!

joan-small1Posted by Joan Phaup

We have always believed in promoting best practices in assessment and enjoy lively dialog about the important role assessments play in measuring learning and improving performance. This blog gives us yet another way to keep the conversation going!

Here you will find news about learning events, products, and trends in learning and assessment. We will also post technical pointers, case studies and advice about effective item writing, secure delivery, reporting, analyzing results and other essentials. Check in here for links to learning resources including white papers and podcast interviews with assessment professionals and experts on best practices.

Stay connected with us and with the wider testing and assessment community right here. Ask questions, post comments and take the opportunity to spark discussions.

Together, let’s explore how best to create, deliver and report on assessments that help individuals and organizations work more effectively. And let’s have some fun in the process!

We’ll update the blog often, and you can stay in touch easily by subscribing to this blog through our RSS feed. We hope you will contribute to the blog, too, by submitting your comments and sharing any particular post you like!