Writing JTA Task Statements

Austin Fossey-42Posted by Austin Fossey

One of the first steps in an evidence-centered design (ECD) approach to assessment development is a domain analysis. If you work in credentialing, licensure, or workplace assessment, you might accomplish this step with a job task analysis (JTA) study.

A JTA study gathers examples of tasks that potentially relate to a specific job. These tasks are typically harvested from existing literature or observations, reviewed by subject matter experts (SMEs), and rated by practitioners or other stakeholder groups across relevant dimensions (e.g., applicability to the job, frequency of the task). The JTA results are often used later to determine the content areas, cognitive processes, and weights that will be on the test blueprint.

 Questionmark has tools for authoring and delivering JTA items, as well as some limited analysis tools for basic response frequency distributions. But if we are conducting a JTA study, we need to start at the beginning: how do we write task statements?

One of my favorite sources on the subject is Mark Raymond and Sandra Neustel’s chapter, “Determining the Content of Credentialing Examinations,” in The Handbook of Test Development. The chapter provides information on how to organize a JTA study, how to write tasks, how to analyze the results, and how to use the results to build a test blueprint. The chapter is well-written, and easy to understand. It provides enough detail to make it useful without being too dense. If you are conducting a JTA study, I highly recommend checking out this chapter.

Raymond and Neustel explain that a task statement can refer to a physical or cognitive activity related to the job/practice. The format of a task statement should always follow a subject/verb/object format, though it might be expanded to include qualifiers for how the task should be executed, the resources needed to do the task, or the context of its application. They also underscore that most task statements should have only one action and one object. There are some exceptions to this rule, but if there are multiple actions and objects, they typically should be split into different tasks. As a hint, they suggest critiquing any task statement that has the words “and” or “or” in it.

Here is an example of a task statement from the Michigan Commission on Law Enforcement Standards’ Statewide Job Analysis of the Patrol Officer Position: Task 320: “[The patrol officer can] measure skid marks for calculation of approximate vehicle speed.”

I like this example because it is pretty specific, certainly better than just saying “determine vehicle’s speed.” It also provides a qualifier for how good their measurement needs to be (“approximate”). The context might be improved by adding more context (e.g., “using a tape measure”), but that might be understood by their participant population.

Raymond and Neustel also caution researchers to avoid words that might have multiple meanings or vague meanings. For example, the verb “instruct” could mean many different things—the practitioner might be giving some on-the-fly guidance to an individual or teaching a multi-week lecture. Raymond and Neustel underscore the difficult balance of writing task statements at a level of granularity and specificity that is appropriate for accomplishing defined goals in the workplace, but at a high enough level that we do not overwhelm the JTA participants with minutiae. The authors also advise that we avoid writing task statements that describe best practice or that might otherwise yield a biased positive response.

Early in my career, I observed a JTA SME meeting for an entry-level credential in the construction industry. In an attempt to condense the task list, the psychometrician on the project combined a bunch of seemingly related tasks into a single statement—something along the lines of “practitioners have an understanding of the causes of global warming.” This is not a task statement; it is a knowledge statement, and it would be better suited for a blueprint. It is also not very specific. But most important, it yielded a biased response from the JTA survey sample. This vague statement had the words “global warming” in it, which many would agree is a pretty serious issue, so respondents ranked it as of very high importance. The impact was that this task statement heavily influenced the topic weighting of the blueprint, but when it came time to develop the content, there was not much that could be written. Item writers were stuck having to write dozens of items for a vague yet somehow very important topic. They ended up churning out loads of questions about one of the few topics that were relevant to the practice: refrigerants. The end result was a general knowledge assessment with tons of questions about refrigerants. This experience taught me how a lack of specificity and the phrasing of task statements can undermine the entire content validity argument for an assessment’s results.

If you are new to JTA studies, it is worth mentioning that a JTA can sometimes turn into a significant undertaking. I attended one of Mark Raymond’s seminars earlier this year, and he observed anecdotally that he has had JTA studies take anywhere from three months to over a year. There are many psychometricians who specialize in JTA studies, and it may be helpful to work with them for some aspects of the project, especially when conducting a JTA for the first time. However, even if we use a psychometric consultant to conduct or analyze the JTA, learning about the process can make us better-informed consumers and allow us to handle some of work internally, potentially saving time and money.

JTA

Example of task input screen for a JTA item in Questionmark Authoring.

For more information on JTA and other reporting tools that are available with Questionmark, check out this Reporting & Analytics page

Question Type Report: Use Cases

Austin Fossey-42Posted by Austin Fossey

A client recently asked me if there is a way to count the number of each type of item in their item bank, so I pointed them toward the Question Type Report in Questionmark Analytics. While this type of frequency data can also be easily pulled using our Results API, it can be useful to have a quick overview of the number of items (split out by item type) in the item bank.

The Question Type Report does not need to be run frequently (and Analytics usage stats reflect that observation), but the data can help indicate the robustness of an item bank.

This report is most valuable in situations involving topics for a specific assessment or set of related assessments. While it might be nice to know that we have a total of 15,000 multiple choice (MC) items in the item bank, these counts are trivial unless we have a system-wide practical application—for example planning a full program translation or selling content to a partner.

This report can provide a quick profile of the population of the item bank or a topic when needed, though more detailed item tracking by status, topic, metatags, item type, and exposure is advisable for anyone managing a large-scale item development project. Below are some potential use cases for this simple report.

Test Development and Maintenance:
The Question Type Report’s value is primarily its ability to count the number of each type of item within a topic. If we know we have 80 MC items in a topic for a new assessment, and they all need be reviewed by a bias committee, then we can plan accordingly.

Form Building:
If we are equating multiple forms using a common-item design, the report can help us determine how many items go on each form and the degree to which the forms can overlap. Even if we only have one form, knowing the number of items can help a test developer check that enough items are available to match the blueprint.

Item Development:
If the report indicates that there are plenty of MC items ready for future publications, but we only have a handful of essay items to cover our existing assessment form, then we might instruct item writers to focus on developing new essay questions for the next publication of the assessment.

Question type

Example of a Question Type Report showing the frequency distribution by item type.

 

When to Give Partial Credit for Multiple-Response Items

Austin Fossey-42 Posted by Austin Fossey

Three different customers recently asked me how to decide between scoring a multiple-response (MR) item dichotomously or polytomously; i.e., when should an MR item be scored right/wrong, and when should we give partial credit? I gave some garrulous, rambling answers, so the challenge today is for me to explain this in a single blog post that I can share the next time it comes up.

In their chapter on multiple-choice and matching exercises in Educational Assessment of Students (5th ed.), Anthony Nitko and Susan Brookhart explain that matching items (which we may extend to include MR item formats, drag-and-drop formats, survey-matrix formats, etc.) are often a collection of single-response multiple choice (MC) items. The advantage of the MR format is that is saves space and you can leverage dependencies in the questions (e.g., relationships between responses) that might be redundant if broken into separate MC items.

Given that an MR items is often a set of individually scored MC items, then a polytomously scored format almost always makes sense. From an interpretation standpoint, there are a couple of advantages for you as a test developer or instructor. First, you can differentiate between participants who know some of the answers and those who know none of the answers. This can improve the item discrimination. Second, you have more flexibility in how you choose to score and interpret the responses. In the drag-and-drop example below (a special form of an MR item), the participant has all of the dates wrong; however, the instructor may still be interested in knowing that the participant knows the correct order of events for the Stamp Act, the Townshend Act, and the Boston Massacre.

stamp 1

Example of a drag-and-drop item in Questionmark where the participant’s responses are wrong, but the order of responses is partially correct.

Are there exceptions? You know there are. This is why it is important to have a test blueprint document, which can help clarify which item formats to use and how they should be evaluated. Consider the following two variations of a learning objective on a hypothetical CPR test blueprint:

  • The participant can recall the actions that must be taken for an unresponsive victim requiring CPR.
  • The participant can recall all three actions that must be taken for an unresponsive victim requiring CPR.

The second example is likely the one that the test developer would use for the test blueprint. Why? Because someone who knows two of the three actions is not going to cut it. This is a rare all-or-nothing scenario where knowing some of the answers is essentially the same (from a qualifications standpoint) as knowing none of the answers. The language in this learning objective (“recall all three actions”) is an indicator to the test developer that if they use an MR item to assess this learning objective, they should score it dichotomously (no partial credit). The example below shows how one might design an item for this hypothetical learning objective with Questionmark’s authoring tools:

stamp 2

Example of a Questionmark authoring screen for MR item that is scored dichotomously (right/wrong).

To summarize, a test blueprint document is the best way to decide if an MR item (or variant) should be scored dichotomously or polytomously. If you do not have a test blueprint, think critically about what you are trying to measure and the interpretations you want reflected in the item score. Partial-credit scoring is desirable in most use cases, though there are occasional scenarios where an all-or-nothing scoring approach is needed—in which case the item can be scored strictly right/wrong. Finally, do not forget that you can score MR items differently within an assessment. Some MR items can be scored polytomously and others can be scored dichotomously on the same test, though it may be beneficial to notify participants when scoring rules differ for items that use the same format.

If you are interested in understanding and applying some basic principles of item development and enhancing the quality of your results, download the free white paper written by Austin: Managing Item Development for Large-Scale Assessment

Simpson’s Paradox and the Steelyard Graph

Austin Fossey-42Posted by Austin Fossey

If you work with assessment statistics or just about any branch of social science, you may be familiar with Simpson’s paradox—the idea that data trends between subgroups change or disappear when the subgroups are aggregated. There are hundreds of examples of Simpson’s paradox (and I encourage you to search some on the internet for kicks), but here is a simple example for the sake of illustration.

Simpson’s Paradox Example

Let us say that I am looking to get trained as a certified window washer so that I can wash windows on Boston’s skyscrapers. Two schools in my area offer training, and both had 300 students graduate last year. Graduates from School A had an average certification test score of 70.7%, and graduates from School B had an average score of 69.0%. Ignoring for the moment whether these differences are significant, as a student I will likely choose School A due to its higher average test scores.

But here is where the paradox happens. Consider now that I have a crippling fear of heights, which may be a hindrance for my window-washing aspirations. It turns out that School A and School B also track test scores for their graduates based on whether or not they have a fear of heights. The table below reports the average scores for these phobic subgroups.

Simpson's
Notice anything? The average score for people with and without a fear of heights in School B is higher than the same groups in School A. The paradox is that School A has a higher average test score overall, yet School B can boast better average test scores for students with a fear of heights and students without a fear of heights. School B’s overall average is lower because they simply had more students with a fear of heights. If we want to test the significance of these differences, we can do so with ANOVA.

Gaviria and González-Barbera’s Steelyard Graph

Simpson’s paradox occurs in many different fields, but it is sometimes difficult to explain to stakeholders. Tables (like the one above) are often used to
illustrate the subgroup differences, but in the Fall 2014 issue of Educational Measurement, José-Luis Gaviria and Coral González-Barbera from the Universidad Complutense de Madrid won the publication’s data visualization contest with their Steelyard Graph, which illustrates Simpson’s Paradox with a graph resembling a steelyard balance. The publication’s visual editor, ETS’s Katherine Furgol Castellano, wrote the discussion piece for the Steelyard Graph, praising Gaviria and González-Barbera for the simplicity of the approach and the novel yet astute strategy of representing averages with balanced levers.

The figure below illustrates the same data from the table above using Gaviria and González-Barbera’s Steelyard Graph approach. The size of the squares corresponds to the number of students, the location on the lever indicates the average subgroup score, and the triangular fulcrum represents the school’s overall average score. Notice how clear it is that the subgroups in School B have higher average scores than their counterparts in School A. The example below has only two subgroups, but the same approach can be used for more subgroups.

Simpson's 2

Example of Gaviria and González-Barbera’s Steelyard Graph to visualize Simpson’s paradox for subgroups’ average test scores.

Making a Decision when Faced with Simpson’s Paradox

When one encounters Simpson’s paradox, decision-making can be difficult, especially if there are no theories to explain why the relational pattern is different at a subgroup level. This is why exploratory analysis often must be driven by and interpreted through a lens of theory. One could come up with arbitrary subgroups that reverse the aggregate relationships, even though there is no theoretical grounding for doing so. On the other hand, relevant subgroups may remain unidentified by researchers, though the aggregate relationship may still be sufficient for decision-making.

For example, as a window-washing student seeing the phobic subgroups’ performances, I might decide that School B is the superior school for teaching the trade, regardless of which subgroup a student belongs to. This decision is based on a theory that a fear of heights may impact performance on the certification assessment, in which case School B does a better job at preparing both subgroups for their assessments. If that theory is not tenable, it may be that School A is really the better choice, but as an acrophobic would-be window washer, I will likely choose School B after seeing this graph . . . as long as the classroom is located on the ground floor.

When to weight items differently in CTT

Austin Fossey-42Posted by Austin Fossey

In my last post, I explained the statistical futility and interpretive quagmires that result from using negative item scores in Classical Test Theory (CTT) frameworks. In this post, I wanted to address another question I get from a lot of customers: when can we make one item worth more points?

This question has come up in a couple of cases. One customer wanted to make “hard” items on the assessment worth more points (with difficulty being determined by subject-matter experts). Another customer wanted to make certain item types worth more points across the whole assessment. In both cases, I suggested they weight all of the items equally.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

Before I reveal the rationale behind the recommendation, please permit me a moment of finger-wagging. The impetus behind these questions was that these test developers felt that some items were somehow better indicators of the construct, thus certain items seemed like more important points of evidence than others. If we frame the conversation as a question of relative importance, then one recognizes that the test blueprint document should contain all of the information about the importance of domain content, as well as how the assessment should be structured to reflect those evaluations. If the blueprint cannot answer these questions, then it may need to be modified. Okay, wagging finger back in its holster.

In general, weights should be applied at a subscore level that corresponds to the content or process areas on the blueprint. A straightforward way to achieve this structure is to present a lot of items. For example, if Topic A is supposed to be 60% of the assessment score and Topic B is supposed to be 40% of the assessment score, it might be best to ask 60 questions about Topic A and 40 questions about Topic B, all scored dichotomously [0,1].

There are times when this is not possible. Certain item formats may be scored differently or be too complex to deliver in bulk. For example, if Topic B is best assessed with long-format essay items, it might be necessary to have 60 selected response items in Topic A and four essays in Topic B—each worth ten points and scored on a rubric.

Example of a simple blueprint where items are worth more points due to their topic’s relative importance (weight)

The critical point is that the content areas (e.g., Topics) are driving the weighting, and all items within the content area are weighted the same. Thus, an item is not worth more because it is hard or because it is a certain format; it is worth more because it is in a topic that has fewer items, and all items within the topic are weighted more because of the topic’s relative importance on the test blueprint.

One final word of caution. If you do choose to weight certain dichotomous items differently, regardless of your rationale, remember that it may bias the item-total correlation discrimination. In these cases, it is best to use the item-rest correlation discrimination statistic, which is provided in Questionmark’s Item Analysis Report.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

Item Development – Psychometric review

Austin FosseyPosted by Austin Fossey

The final step in item development is the psychometric review. You have drafted the items, edited them, sent them past your review committee, and tried them out with your participants. The psychometric review will use item statistics to flag any items that may need to be removed before you build your assessment forms for production. It is common to look at statistics relating to difficulty, discrimination, and bias.

As with other steps in the item development process, you should assemble an independent, representative, qualified group of subject matter experts (SMEs) to look at flagged items. If you are short on time, you may only want to have them review the items with statistical flags. Their job is to figure out what is wrong with items that return poor statistics.

Difficulty – Items that are too hard or too easy are often not desirable on criterion-referenced assessments because they do not discriminate well. However, they may be desirable for norm-referenced assessments or aptitude tests where you want to accurately measure a wide spectrum of ability.

If using classical test theory, you will flag items based on their p-value (difficulty index). Remember that lower values are harder items, and higher values are easier items. I typically flag anything over 0.90 or 0.95 and anything under 0.25 or 0.20, but others have different preferences.

If an item is flagged by its difficulty, there are several things to look for. If an item is too hard, it may be that the content has not been taught yet to your population, or it is obscure content. This may not be justification for removing the item from the assessment if it aligns well with your  blueprint. However, it could also be that the item is confusing, mis-keyed, or has overlapping options, in which case you should consider removing it from the assessment before you go to production.

If an item is too easy, it may be that the population of participants has mastered this content, though it may still be relevant to the blueprint. You will need to make the decision about whether or not that item should remain. However, there could be other reasons an item is too easy, such as item cluing, poor distractors, identifiable key patterns, or compromised content. Again, in these scenarios you should consider removing the item before using it on a live form.

Discrimination – If an item does not discriminate well, it means that it does not help differentiate between high- and low-performing participants. These items do not add much to the information available in the assessment, and if they have negative discrimination values, they may actually be adding construct-irrelevant variance to your total scores.

If using classical test theory, you will flag your items based on their item-total discrimination (Pearson product-moment correlation) or their item-rest correlation (item-remainder correlation). The latter is most useful for short assessments (25 items or fewer), small sample sizes, or assessments with items weighted differently. I typically flag items with discrimination values below 0.20 or 0.15, but again, other test developers will have their own preferences.

If an item is flagged for discrimination, it may have some of the same issues that cause problems with item difficulty, such as a mis-keyed response or overlapping options. Easy items and difficult items will also tend to have lower discrimination values due to the lack of variance in the item scores. There may be other issues impacting discrimination, such as when high-performing participants overthink an item and end up getting it wrong more often than lower-performing participants.

Statistical Bias – In earlier posts, we talked about using differential item functioning (DIF) to identify statistical bias in items. Recall that this can only be done with item response theory models (IRT), so you cannot use classical test theory statistics to determine statistical bias. Logistic regression can be used to identify both uniform and non-uniform DIF. DIF software will typically classify DIF effect sizes as A, B, or C. If possible, review any item flagged with DIF, but if there are too many items or you are short on time, you may want to focus on the items that fall into categories B or C.

DIF can occur from bias in the content or response process, which are the same issues your bias review committee was looking for. Sometimes DIF statistics help uncover content bias or response process bias that your bias review committee missed; however, you may have an item flagged for DIF, but no one can explain why it is performing differently between demographic groups. If you have a surplus of items, you may still want to discard these flagged items just to be safe, even if you are not sure why they are exhibiting bias.

Remember, not all items flagged in the psychometric review need to be removed. This is why you have your SMEs there. They will help determine whether there is a justification to keep an item on the assessment even though it may have poor item statistics. Nevertheless, expect to cull a lot of your flagged items before building your production forms.

psych review

Example of an item flagged for difficulty (p = 0.159) and discrimination (item-total correlation = 0.088). Answer option information table shows that this item was likely mis-keyed.