How should we measure an organization’s level of psychometric expertise?

greg_pope-150x1502

Posted by Greg Pope

A colleague recently asked for my opinion on an organization’s level of knowledge, experience, and sophistication applying psychometrics to their assessment program. I came to realize that it was difficult to summarize in words, which got me thinking why. I concluded that it was because there currently is not a common language to describe how advanced an organization is regarding the psychometric expertise they have and the rigour they apply to their assessment program. I thought maybe if there were such a common vocabulary, it would make conversations like the one I had a whole lot easier.

I thought it might be fun (and perhaps helpful) to come up with a proposed first cut of a shared vocabulary around the levels of psychometric expertise. I wanted to keep it simple, yet effective in allowing people to quickly and easily communicate about where an organization would fall in terms of their level of psychometric sophistication. I thought it might make sense to break it out by areas (I thought of seven) and assign points according to the expertise/rigour an organization contains/applies. Not all areas are always led by psychometricians directly, but usually psychometricians play a role.

Areas
1.    Item and test level psychometric analysis

  • Classical Test Theory (CTT) and/or Item Response Theory (IRT)
  • Pre hoc analysis (beta testing analysis)
  • Ad hoc analysis (actual assessment)
  • Post hoc analysis (regular reviews over time)

2.    Psychometric analysis of bias and dimensionality

  • Factor analysis or principal component analysis to evaluate dimensionality
  • Differential Item Functioning (DIF) analysis to ensure that items are performing similarly across groups (e.g., gender, race, age, etc.)

3.    Form assembly processes

  • Blueprinting
  • Expert review of forms or item banks
  • Fixed forms, computerized adaptive testing (CAT), automated test assembly

4.    Equivalence of scores and performance standards

  • Standard setting
  • Test equating
  • Scaling scores

5.    Test security

  • Test security plan in place
  • Regular security audits are conducted
  • Statistical analyses are conducted regularly (e.g., collusion and plagiarism detection analysis)

6.    Validity studies

  • Validity studies conducted on new assessment programs and ongoing programs
  • Industry experts review and provide input on study design and finding
  • Improvements are made to the program if required as a result of studies

7.    Reporting

  • Provide information clearly and meaningfully to all stakeholders (e.g., students, parents, instructors, etc.)
  • High quality supporting documentation designed for non-experts (interpretation guides)
  • Frequently reviewed by assessment industry experts and improved as required

Expertise/rigour points
0.    None: Not rigorous, no expertise whatsoever within the organization
1.    Some: Some rigour, marginal expertise within the organization
2.    Full: Highly rigorous, organization has a large amount of experience

So an organization that has decades of expertise in each area would be at the top level of 14 (7 areas x 2 for expertise/rigour in each area = 14). An elementary school doing simple formative assessment would probably be at the lowest level (7 areas x 0 expertise/rigour = 0). I have provided some examples of how organizations might fall into various ranges in the illustration below.

There are obviously lots of caveats and considerations here. One thing to keep in mind is that not all organizations need to have full expertise in all areas. For example, an elementary school that administers formative tests to facilitate learning doesn’t need to have 20 psychometricians working for them doing DIF analysis and equipercentile test equating. Their organization being low on the scale is expected. Another consideration is expense: To achieve the highest level requires a major investment (and maintaining an army of psychometricians isn’t cheap!). Therefore, one would expect an organization that is conducting high stakes testing where people’s lives or futures are at stake based on assessment scores to be at the highest level. It’s also important to remember that some areas are more basic than others and are a starting place. For example, it would be pretty rare for an organization to have a great deal of expertise in the psychometric analysis of bias and dimensionality but no expertise in item and test analysis.

I would love to get feedback on this idea and start a dialog. Does this seem roughly on target? Would it would be useful? Is something similar out there that is better that I don’t know about? Or am I just plain out to lunch? Please feel free to comment to me directly or on this blog.

On a related note, Questonmark CEO Eric Shepherd has given considerable thought to the concept of an “Assessment Maturity Model,” which focuses on a broader assessment context. Interested readers should check out: http://www.assessmentmaturitymodel.org/

Understanding Assessment Validity: An Introduction

greg_pope-150x1502

Posted by Greg Pope

In previous posts I discussed some of the theory and applications of classical test theory and test score reliability. For my next series of posts, I’d like to explore the exciting realm of validity. I will discuss some of the traditional thinking in the area of validity as well as some new ideas, and I’ll share applied examples of how your organization could undertake validity studies.

According to the “standards bible” of educational and psychological testing, the Standards for Educational and Psychological Testing (AERA/NCME, 1999), validity is defined as “The degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.”

The traditional thinking around validity, familiar to most people, is that there are three main types:

validity 1

The most recent thinking on validity takes a more unifying approach which I will go into in more detail in upcoming posts.

Now here is something you may have heard before: “In order for an assessment to be valid it must be reliable.” What does this mean? Well, as we learned in previous Questionmark blog posts, test score reliability refers to how consistently an assessment measures the same thing. One of the criteria to make the statement, “Yes this assessment is valid,” is that the assessment must have acceptable test reliability, such as high Cronbach’s Alpha test reliability index values as found in the Questionmark Test Analysis Report and Results Management System (RMS). Other criteria for making the statement, “Yes this assessment is valid,” is to show evidence for criterion related validity, content related validity, and construct related validity.

In my next posts on this topic I will provide some illustrative examples of how organizations may undertake investigating each of these traditionally defined types of validity for their assessment program.