Psychometrics 101: Item Total Correlation

greg_pope

Posted by Greg Pope

I’ll be talking about a subject dear to my heart — psychometrics — at the Questionmark Users Conference April 5 -8. Here’s a sneak preview on one of my topics: item total correlation! What is it, and what does it mean?

The item total correlation is a correlation between the question score (e.g., 0 or 1 for multiple choice) and the overall assessment score (e.g., 67%). It is expected that if a participant gets a question correct they should, in general, have higher overall assessment scores than participants who get a question wrong. Similarly with essay type question scoring where a question could be scored between 0 and 5 participants who did a really good job on the essay (got a 4 or 5) should have higher overall assessment scores (maybe 85-90%). This relationship is shown in an example graph below.

chart-35

This relationship in psychometrics is called ‘discrimination’ referring to how well a question differentiates between participants who know the material and those that do not know the material. Participants who know the material taught to them should get high scores on questions and high overall assessment scores. Participants who did not master the material should get low scores on questions and lower overall assessment scores. This is the relationship that an item-total correlation provides to help evaluate the performance of questions. We want to have lots of highly discriminating questions on our tests because they are the most fine-tuned measurements to find out what participants know and can do. When looking at an item-total correlation generally negative values are a major red flag it is unexpected that participants who get low scores on the questions get high scores on the assessment. This could indicate a mis-keyed question or that the question was highly ambiguous and confusing to participants. Values for an item-total correlation (point-biserial) between 0 and 0.19 may indicate that the question is not discriminating well, values between 0.2 and 0.39 indicate good discrimination, and values 0.4 and above indicate very good discrimination.

22 Responses to “Psychometrics 101: Item Total Correlation”

  1. Cheryl Lamerson says:

    Greg:
    I’d really like to use all the blogs you’ve posted this year as a reference document, but when I try to print them I get the typed info, but not the great graphs and figures. Is there anything you can send me that is more easily printed?

    You may recognize my name. I was previously a Colonel in the Canadian Forces in charge of the Directorate of Human Resources Research and Evaluation. You had done some work for us when you were working with Bruno Zumbo. I ran across your name again at NOCA last year and have been enjoying info from Questionmark ever since. Hope all is well with you.

  2. Greg Pope says:

    Hi Cheryl, great to hear from you! It is really nice to hear that you have been enjoying my posts and I would be happy to send them to you. I will get them packaged up into one document and email them to you.

    I will be at NOCA again this year with several presentations so if you are attending NOCA this year it would be great to see you there!

    All the best,

    Greg

  3. […] of the question. Extremely easy or extremely hard questions have a harder time obtaining those high discrimination statistics that we look for. In the graph below, I show the relationship between question difficulty p-values […]

  4. Arthi Veerasamy says:

    Hi Greg Pope,
    I have a doubt regarding this. I have done item-total correlation to test unidimensionality following a study (same questionnaire which used mine). My supervisor said it is wrong and unidimensionality can be measured only using factor analysis. Please give your suggestion. Thanks.

  5. Greg Pope says:

    Hello Arthi, yes your supervisor is correct, exploratory or confirmatory factor analysis (FA; http://en.wikipedia.org/wiki/Factor_analysis) or principal component analysis (PCA; http://en.wikipedia.org/wiki/Principal_component_analysis) are the most typical ways of conducting dimensionality analyses. Statistical programs like SPSS provide these analytics features. I was not advocating using item-total correlations directly to do dimensionality research, although one would expect higher item-total correlations for questions on assessments that all measure the same construct.

  6. Gregory Poon says:

    Hi Greg,

    What is the basis for the cutoff ranges (0 to 0.2, 0.2 to 0.4, 0.4 to 1) in item-total correlation? If they are arbitrary, do you know who the source is? Many thanks.

  7. Greg Pope says:

    Hi Gregory, thanks for your question. Yes the cut-offs for item-total correlations are semi-arbitrary in that different organizations can use different ranges. Also there are a lot of factors to consider when analyzing Classical Test Theory item statistics. The ranges I stated are fairly common amongst organizations that conduct item analyses for item-total correlations. In previous places that I have worked the cut-off for an acceptable question (i.e., whether it should continue on into the actual assessment for large scale administration) in terms of discrimination was around 0.300.

    Academic references are not always easy to find as many books don’t suggest ranges but rather state “the higher the better above zero.” For example, in Shrock and Coscarelli’s 2007 book on Criterion-Referenced Test Development (http://www.amazon.com/Criterion-referenced-Test-Development-Technical-Guidelines/dp/0787988502) they discuss item-total correlations as an important part of item analyses but do not provide suggested ranges of values. They are not alone, many well written and well respected books on the subject avoid stating specific values.

    However, there are some academic references out there if you dig for them:
    • Nunnally & Bernstein (1994). Psychometric Theory. New York: McGraw Hill, 3rd ed.
    o Page 304: “A cutoff of .3 is an arbitrary guide to defining a discriminating item.” However on the next page they suggest that items with item-total correlation values greater than 0.300 are “discriminating”
    o Page 306: The authors states that “…very poorly discriminating items (r 0.2”
    • Traub (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA, Sage.
    o Page 108: “…relatively large indices of discrimination (say 0.30 or more)…”
    • de Vaus (2002). Analyzing social science data: 50 Key Problems In Data Analysis. Thousand Oaks, CA, Sage.
    o Page 128: “To remain in a scale an item should have an item-total correlation of at least 0.3.”
    • Leong & Austin eds. (2006). The psychology research handbook: a guide for graduate students and research assistants. Thousand Oaks, CA: Sage, 1996. Chapter 9 (Scale Development; Lounsbury, Gibson, Saudardas)
    o Page 144: Authors recommend corrected item-total correlations should be 0.400 and higher

    In a previous post I provided more range suggestions based on my experience and based on discussions with other psychometric professionals from a range of organizations: http://blog.questionmark.com/item-analysis-analytics-part-6-determining-whether-a-question-makes-the-grade

    I hope this helps and thanks again for your question!

    Greg

  8. Kathleen says:

    Hi Greg,
    I am doing an assignment on psychometrics at the moment and a question asks “when the item-total correlation is _________the slope of the item characteristic curve is ____.
    The choices are negative, negative
    b. negative positive
    c. positive negative or d. positive zero. Everything I have read points to it being negative, negative. I have come up with this because I have ruled out positive, zero and positive negative. It also doesn’t make sense that an item total correlation could be negative and the ICC positive. What are your thoughts on this. I can’t find any resource that talks about negatuve item-total correlations.
    Many thanks,
    Kathleen

  9. Fransisca Sidjaja says:

    thank you SOOO MUCH!! Your SIMPLE and CLEAR explanation really help me to understand!

  10. […] Psychometrics 101: Item Total Correlation – Questionmark BlogMar 26, 2009 … I’ll be talking about a subject dear to my heart — psychometrics — at the … This could indicate a mis-keyed question or that the question was … […]

  11. “Psychometrics: Item Total Correlation | Getting Results — The Questionmark Blog” was
    indeed actually engaging and helpful! Within modern universe that’s tricky to
    accomplish. I am grateful, Maxine

  12. Suzanne Czech says:

    Hi Greg,

    A Canadian myself, I moved from Vancouver to Australia a year ago to obtain my PhD in psychology. I am currently employed as a lecturer at USQ. I have greatly appreciated your responses that come up periodically in my searches, and I have a specific question I am hoping you can assist with:

    I am conducting item analysis on a scale I designed to (be a better) measure the construct of juvenile psychopathy (for the purposes of questioning the validity of the construct itself, not simply the measurement of it).

    I have computed the item difficulty indices and wish to retain only the ‘difficult’ items. I know that this is a practice that is sometimes used (rather than the traditional p = [0.3-0.7] as optimal in an achievement test for example), but I cannot find a reference.

    Can you help me?

    Thank you,
    Suzanne

  13. Austin Fossey says:

    Dear Suzanne,
    Your use case is very interesting! If I understand correctly, the construct your instrument is measuring is “validity of a test” not “juvenile psychopathy.” You are correct that we normally want medium difficulty items because they provide the greatest discrimination in the population, but I assume that you wish to only use difficult items in order to provide maximum discrimination for the high performing members of the population (in this case, the tests with the most validity).
    This is a common issue in education where high performing students have scores with large standard errors of measurement because they are not asked enough difficult questions. When testing students, this issue can be addressed with computer adaptive testing and an item response theory measurement model, thus ensuring that high performing students are asked more difficult questions.
    While I am not familiar with any particular studies where this approach to item selection has been used with classical test theory statistics, I am sure that there will be examples in fields like psychology, sociology, or even marketing. You may be able to defend the decision by explaining your reasoning for sacrificing score accuracy for low or medium validity scores in order to obtain maximum discrimination between high validity scores. You could perhaps even demonstrate how the scores and reliability change when your instrument is only comprised of difficult items.
    Though I do not know of any studies that can be used as examples, it may be helpful to cite references on test reliability. Haertel has a good chapter in Educational Measurement, 4th Edition (Eds. Brennan), and Crocker and Algina also have a good chapter about reliability in their book, Introduction to Classical and Modern Test Theory.
    Good luck with the study, and I hope you will keep us posted on the results!
    Sincerely,
    Austin Fossey
    Reporting and Analytics Manager
    Questionmark

  14. Austin Fossey says:

    Dear Kathleen,
    Great question! You are correct that when the item-total correlation is negative, the slope of the ICC is also negative. We are used to most items having a positive item-total correlation; i.e., higher performing students are more likely to answer the item correctly.
    However, we can also imagine an item that where this is the opposite, either by design or by mistake: an item where poorly performing students get the answer right more than the higher performing students. In this scenario, the item-total correlation would be negative, and the ICC would have a negative slope.
    A common example of this would be if you had an item that was miskeyed. All of the poorly performing students would select the wrong answer, but they would get a point since the item is miskeyed.
    Good luck on your course!
    Sincerely,
    Austin Fossey
    Reporting and Analytics Manager
    Questionmark

  15. akhila says:

    sir ,
    your post is very informative I want to know how we can do this item to total correlation procedure

  16. Austin Fossey says:

    Hi Akhila

    Thanks for the kind feedback! The item-total correlation statistic is the Pearson product-moment correlation between item scores and total test scores. This value is automatically calculated in Questionmark’s Question Statistics Report and Item Analysis Report.

    Sincerely,

    Austin Fossey
    Reporting and Analytics Manager
    Questionmark

    Sincerely,

    Austin Fossey
    Reporting and Analytics Manager
    Questionmark

  17. Ana Flor Orna says:

    Hi Greg. I am in charge of training in our organization. As part of determining the effectiveness of the trainings conducted, we give post tests to participants. I am in charge of determining the passing score for each test. Can you give me ways on how to determine the passing score? Right now, I am trying item analysis specifically determing item-difficulty level. Please provide me also of references which I can read to fully understand it. I will highly appreciate your kind assistance. Thank you very much.
    Ana

  18. Ana Flor Orna says:

    Hi. Good day! I am in charged of determining the passing score of our training post tests. I have a little knowledge with regards to item analysis and reliability coefficient. But I calculate it manually. Can you help me with other ways on how to simply obtain those values? Our company does not allow us to download or install software. So I do not know if I could use SPSS. Besides I do not have a licensed one and I do not know how to use it.
    My other concern is, is it ok to get the item difficulty level only? I have a hard time with item discrimination.
    What do you use as reference values for item difficulty? I am using values from the book of Reynold and Livingston. Can you give me reference values if the test is multiple choice, and constructed response?
    My major concern is that, can you help me with the methods on how to determine a passing score with our post tests? I really have a hard time with this one. If you have references which I can read, I will really appreciate. Thank you very much.
    Ana

  19. Austin Fossey says:

    Hi Ana,

    Unfortunately, Greg no longer works at Questionmark. My name is Austin Fossey, and I am Questionmark’s current psychometrician, so I hope you don’t mind me stepping in and providing a reply on Greg’s post.

    Greg and I have both written a couple of posts about determining the passing scores (AKA standard setting studies, cut score studies). These Questionmark blog articles might be a good place to start to get an idea about what each one does:

    Standard Setting: Methods for Establishing Cut Scores
    Standard Setting: Angoff Method Considerations
    Standard Setting: Compromise and Normative Methods

    Alan Wheaton and Jim Parry also did a webinar about the Angoff Method, which is publicly available (with a Communities account) on the Questionmark website.

    The Modified Angoff Method is probably the most used method in Classical Test Theory approaches, but there are a lot of different ways to set standards. These different methods are sometimes classified as content-based methods, participant-based methods, and compromise methods. There are two good references that I would recommend which provide overviews and details of many common standard setting methods:

    Ron Hambleton and Mary Pitoniak’s chapter, “Setting Performance Standards” in Educational Measurement, 4th Edition
    Greg Cizek’s chapter, “Standard Setting” in the Handbook of Test Development (a new edition may be coming out soon though)

    I hope this helps!

    Cheers,

    Austin

  20. Austin Fossey says:

    Hi Ana,

    SPSS could be used to calculate item statistics and test form statistics, but some of the item statistics are not available out of the box in SPSS, so the calculation would need to be created by the user. If test development is the only use case, it is likely just as simple to create the calculations in a spreadsheet application like Excel rather than paying for an SPSS license. If you can get permission to download software and you are comfortable working with code, R software is free, open source, and has a package for item analysis called CTT (though I have never used this particular package, so I cannot recommend it either way).

    If your company does not allow employees to download software, you could also use a cloud-based solution where the software is hosted online. Our software, Questionmark OnDemand, has an item analysis report which handles all of these calculations for our customers and can be accessed through a browser. I am sure there are other companies with similar solutions.

    Another option is to hire a psychometric consultant to do the analysis for you. There are many small consulting companies or freelancers that conduct item analyses, form analyses, and other research for test programs (they sometimes call these “in service reports”). These consultants typically have their own software.

    In general, it is not recommended that test developers use item difficulty as the primary statistic for item selection for criterion-referenced assessments using classical test theory (CTT). Item discrimination is a much better statistic to use for analyzing items because items with higher discrimination create instruments that discriminate better at a total score level. This was demonstrated in two studies in 1952: one by Fred Lord and one by Lee Cronbach and Willard Warrington, though the findings have also been replicated in later studies. Item difficulty is informative, but item discrimination is usually the most valuable statistic for making decisions about item retention and revision.

    The range of acceptable difficulty values depends somewhat on the assessment design (e.g., aptitude assessments may need a wider range of item difficulty). If discrimination is the primary goal, Henryssen had a study in 1971 which said that the difficulty values should be between 0.40 and 0.60 when discrimination values are between 0.30 and 0.40, though he said that wider ranges of item difficulty could be used for higher discrimination statistics. In my experience, a lot of test developers will go by a rule of thumb out of necessity–a range between 0.30 and 0.70 is common. I have even seen some organizations with limited item pools stretch that to 0.25 – 0.90. For multiple choice items, you want to make sure that lower bound is not lower than the correct response rate we would observe by random guessing.

    I left some suggestions for standard setting methods in my other response, which I hope are helpful. Depending on your needs and resources, you might also consider working with a psychometric consultant on a standard setting, either to plan the study, provide training, or run the entire study. This is not to say that you can’t do a standard setting yourself, but when I was new to standard setting, I found it helpful to observe a couple of studies before I did my own.

    Cheers,

    Austin

  21. Ana Flor Orna says:

    Thank you very much Sir Austin on your response.

    Book of Reynolds and Livingston (2012) suggests some item-difficulty values depending on the number of responses. I am not sure if it is for norm-referenced test only or can be applied to mastery tests. The values vary depending if the test is selected-response or constructed-response. Is it advisable to use varying values? Our training post-tests usually have 15 or 20 items only. Type of questions also differ. For example one test has T-F items and short-answer items. Do you have item-difficulty values which are appropriate for mastery test/criterion-referenced tests with different types of test? I find it tasky to refer to different values for each type of test. For discrimination, I am using 0.3 above – good, 0.1-0.3 fair, below 0.1 poor.

    I have another question with regards to setting the cut score. Do we have other methods in which there is no need for SMEs or other persons rating? If only you could suggest other method which I can do on my own, I would greatly appreciate it. Hiring consultants is not feasible as well.

    Also, how many examinees should have taken the test before I can use the data? Is it okay if n=50 minimum?

    Thank you very much.

    Regards,
    Ana

  22. Austin Fossey says:

    Hi Ana,

    It is hard to say for sure since I have not read Reynolds and Livingston’s book, but I am going to guess they know what they are talking about since they are accomplished psychometricians with numerous publications, whereas I am just a psychometrician who writes a blog for a software company. As I said before, item difficulty is usually not the primary item statistic used for selection, but Reynolds and Livingston likely have some solid research backing their suggestions, and I will defer to them. Henryssen did a study in 1971 suggesting that the ideal item difficulty range for criterion-referenced assessments is 0.40 – 0.60, though that range could be extended if the average item-total correlation discrimination exceeded 0.60.

    I could see where one might adjust the range of acceptable difficulty values if they were accommodating a guessing parameter. For example, a multiple choice item with four options would return a difficulty value of 0.25 just based on random guessing. Anything lower than that would be a red flag (though it would also likely show up as a poor discrimination statistic anyway). That lower bound could be adjusted depending on what the probability of a correct response would be for a participant who was randomly guessing. Most of the classical test theory assessments I have worked on have been multiple choice or multiple response items anyway, so I have directed clients to focus on the item discrimination statistics.

    Someone else might jump in to correct me, but to my knowledge, there are no standard setting methods for criterion-referenced assessments that do not require some aspect of judgment, and those judgments have to be made by a qualified panel of judges representing the relative stakeholders. Hambleton and Pitoniak’s chapter in Educational Measurement (4th ed) supports this observation, and there is a lot of literature about the judgment process and controlling for threats to validity. Even in K12 classroom assessments, teachers do not set cut scores in a vacuum. Classroom assessment cut scores are linked back to performance and grading standards often set by researchers and SMEs at a district or state level. Some corporations do the same thing for their training programs. In high stakes assessments, cut scores for every assessment are determined by a panel of judges reviewing content, participants, work products, or score profiles, and many programs will actually use multiple panels and multiple standard setting methods to help validate and compare their recommended cut scores.

    As for your last question, you can use the data with any sample size as long as you understand the limitations of the statistics. A small sample size might still be useful for some interpretations, especially if the population is not that big. In general, the required sample size should depend on a statistical power analysis, but if you are looking for an all-encompassing rule of thumb, you can kind of take your pick. International Credentialing Associates (who no longer exist) released a brief at an ATP conference (I forget which one) citing sources that differed in their recommendations. Some experts said 50 participants, others said 100. One even said 20, but I am guessing that is assuming some enormous statistical power. Questionmark’s software reports the confidence intervals for item statistics, so test developers can make an informed decision about the stability of their statistics when conducting an item review or building a test form.

    Cheers,

    Austin

Leave a Reply