When to weight items differently in CTT

Austin Fossey-42Posted by Austin Fossey

In my last post, I explained the statistical futility and interpretive quagmires that result from using negative item scores in Classical Test Theory (CTT) frameworks. In this post, I wanted to address another question I get from a lot of customers: when can we make one item worth more points?

This question has come up in a couple of cases. One customer wanted to make “hard” items on the assessment worth more points (with difficulty being determined by subject-matter experts). Another customer wanted to make certain item types worth more points across the whole assessment. In both cases, I suggested they weight all of the items equally.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

Before I reveal the rationale behind the recommendation, please permit me a moment of finger-wagging. The impetus behind these questions was that these test developers felt that some items were somehow better indicators of the construct, thus certain items seemed like more important points of evidence than others. If we frame the conversation as a question of relative importance, then one recognizes that the test blueprint document should contain all of the information about the importance of domain content, as well as how the assessment should be structured to reflect those evaluations. If the blueprint cannot answer these questions, then it may need to be modified. Okay, wagging finger back in its holster.

In general, weights should be applied at a subscore level that corresponds to the content or process areas on the blueprint. A straightforward way to achieve this structure is to present a lot of items. For example, if Topic A is supposed to be 60% of the assessment score and Topic B is supposed to be 40% of the assessment score, it might be best to ask 60 questions about Topic A and 40 questions about Topic B, all scored dichotomously [0,1].

There are times when this is not possible. Certain item formats may be scored differently or be too complex to deliver in bulk. For example, if Topic B is best assessed with long-format essay items, it might be necessary to have 60 selected response items in Topic A and four essays in Topic B—each worth ten points and scored on a rubric.

Example of a simple blueprint where items are worth more points due to their topic’s relative importance (weight)

The critical point is that the content areas (e.g., Topics) are driving the weighting, and all items within the content area are weighted the same. Thus, an item is not worth more because it is hard or because it is a certain format; it is worth more because it is in a topic that has fewer items, and all items within the topic are weighted more because of the topic’s relative importance on the test blueprint.

One final word of caution. If you do choose to weight certain dichotomous items differently, regardless of your rationale, remember that it may bias the item-total correlation discrimination. In these cases, it is best to use the item-rest correlation discrimination statistic, which is provided in Questionmark’s Item Analysis Report.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

When and where should I use randomly delivered assessments?


Posted by Greg Pope

I am often asked my psychometric opinion regarding when and where random administration of assessments is most appropriate.

To refresh memories, this is a feature in Questionmark Perception Authoring Manager that allows you to select questions at random from one or more topics when creating an assessment. Rather than administering the same 10 questions to all participants, you can give each participant a different set of questions that are pulled at random from the bank of questions in the repository.

So when is it appropriate to use random administration? I think that depends on the answer this question: What are the assessment’s  stakes and purpose? If the stakes are low and the assessment scores are used to help reinforce information learned, or to give participants a rough guess as to how they are doing in an area, I would say that using random administration is defensible. However, if the stakes are medium/high and the assessment scores are used for advancing or certifying participants I usually caution against random administration.  Here are a few reasons why:

  • Expert review of the assessment form(s) cannot be conducted in advance (each participant gets a unique form)
  • Generally SMEs, psychometricians, and other experts will thoroughly review a test form before it is put into live production. This is to ensure that the form meets difficulty, content and other criteria before being administered to participants in a medium/high stakes context. In the case of randomly administered assessments, this review in advance is not possible as every participant obtains a different set of questions.
  • Issues with the calculation of question statistics using Classical Test Theory (CTT)
  • Smaller numbers of participants will be answering each individual question. (Rather than all 200 participants answering all 50 questions in a fixed form test, randomly administered tests generated from a bank of 100 questions may only have a few participants answering each question.)
  • As we saw in a previous blog post, sample size has an effect on the robustness of item statistics. With fewer participants taking each question it becomes difficult to have confidence in the stability of the statistics generated.
  • Equivalency of assessment scores is difficult to achieve and prove
  • An important assumption of CTT is equivalence of forms or parallel forms. In assessment contexts where more than one form of an exam is administered to participants, a great deal of time is spent ensuring that the forms of the assessment are parallel in every way possible (e.g.., difficulty of questions, blueprint coverage, question types, etc.) so that the scores participants obtain are equivalent.
  • With random administration it is not possible to control and verify in advance of an assessment session that the forms are parallel because the questions are pulled at random. This leads to the following problem in terms of the equivalence of participant scores:
  • If one participant got 2/10 on a randomly administered assessment and another participant got 8/10 on the same randomly administered assessment it would be difficult to know whether the participant who got 2/10 scored low because they (by chance) got harder questions than the participant who got 8/10 or whether the low-scoring participant actually did not know the material and therefore scored low.
  • Using meta tags one can mitigate this issue to some degree (e.g.,  by randomly administering questions within topics by difficulty ranges and other meta tag data) but this would not completely guarantee randomly equivalent forms.
  • Issues with calculation of test reliability statistics using CTT
  • Statistics such as Cronbach’s Alpha have trouble with randomly administered assessment administration. Random administration produces a lot of missing data for questions (e.g., not all participants answer all questions), which psychometric statistics rarely handle well.

There are other alternatives to random administration depending on what the needs are. For example, if random administration is being looked at to curb cheating, options such as shuffling answer choices and randomizing presentation order could serve this need, making it very difficult for participants to copy answers off of one another.

It is important for an organization to look at their context to determine what is best for them. Questionmark provides many options for our customers when it comes to assessment solutions and invites them to work with us in adopting workable solutions.

Item Analysis Analytics Part 1: What is Classical Test Theory?


Posted by Greg Pope

Item analysis is a hot-button topic for social conversation (Okay, maybe just for some people). I thought it might be useful to talk about Classical Test Theory (CTT) and item analysis analytics in a series of blog posts over the next few weeks. This first one today will focus on some of the theory and background of CTT. In subsequent posts on this topic I will lay out a high-level overview of item analysis and then drill down into details. Some other testing theories include Item Response Theory (IRT), which might be fun to talk about in another post (at least fun for me).

CTT is a body of theory and research regarding psychological testing that predicts/explains the difficulty of questions, provides insight into the reliability of assessment scores, and helps us represent what examinees know and can do. In a similar manner to theories regarding weather prediction or ocean current flow, CTT provides a theoretical framework for understanding educational and psychological measurement. The essential basis of CTT is that many questions combine to produce a measurement (assessment score) representing what a test taker knows and can do.

CTT has been around a long time (since the early 20th century) and is probably the most widely used theory in the area of educational and psychological testing. CTT works well for most assessment applications for reasons such as its ability to work with smaller sample sizes (e.g., 100 or less), and that it is relatively simple to compute and understand the statistics.

The general CTT model is based on the notion that the observed score that test takers obtain from assessments is composed of a theoretical un-measurable “true score” and error. Just as most measurement devices have some error inherent in their measurement (e.g., a thermometer may be accurate to within 0.1 degree 9 times out of 10), so too do assessment scores. For example, if a participant’s observed score (what they got reported back to them) on an exam was 86%, their “true score” may actually be between 80% and 92%.

Measurement error can be estimated and relates back to reliability: greater assessment score reliability means less error of measurement. Why does error relate so directly to reliability? Well, reliability has to do with measurement consistency. So if you could take the average of all the scores that a participant obtained–if they took the same assessment an infinite number of times with no remembering effects–this would be a participant’s true score. The more reliability in the measurement the less wildly diverse the scores would be each time a participant took that assessment over eternity. (This would be a great place for an afterlife joke but I digress…)

For a more detailed overview of CTT, that won’t make your lobes fall off, try Chapter 5 in Dr. Theresa Kline’s book, “Psychological Testing: A Practical Approach to Design and Evaluation.”

In my next post I will provide a high-level picture of item analysis to continue this conversation.