Psychometrics 101: Sample size and question difficulty (p-values)


Posted by Greg Pope

With just a week to go before the Questionmark Users Conference, here’s a little taste of the presentation I will be doing on  psychometrics. I will also be running a session on Item Analysis and Test Analysis.

So, let’s talk about sample size and question difficulty!

How does the number of participants that take a question relate to the robustness/stability of the question difficulty statistic (p-value)? Basically the smaller the number of participants tested the less robust/stable the statistic. So if 30 participants take a question and the p-value that appears in the Item Analysis Report is 0.600 the range that the theoretical “true” p-value (if all participants in the world took the question) could fall into 95% of the time is between 0.775 and 0.425. This means that if another 30 participants were tested you could get a p-value on the Item Analysis Report anywhere from 0.775 to 0.425 (95% confidence range). The take away is that if high stakes decisions are being made using p-values (e.g., whether to drop a question from a certification exam) the more participants that can be tested the better to get more robust results. Another example is that if you are conducting beta testing and you want to know which questions to include in your test form based on the beta test results the more participants you can beta test the better in terms of the confidence you will have in the stability of the statistics. Below is a graph that illustrates this relationship.sample-size-influences-p-value-chart1

This relationship between sample size and the stability of other statistics applies to other common statistics used in psychometrics. For example the item-total correlation (point biserial correlation coefficient) can vary a great deal when small sample sizes are used to calculate it. In the example below we see that an observed correlation of 0 can actual vary by over 0.8 (plus or minus).sample-sixe-influences-chart1