Meet Questionmark’s New Product Owner for Analytics

Jim Farrell HeadshotPosted by Jim Farrell

On Monday, Austin Fossey takes over as Questionmark’s Product Owner of Reporting and Analytics – the person in charge of our reporting tools. Austin will be working with our customers and product development teams to make sure our reporting and analytics teams solve real business problems.


Austin Fossey

The other day I asked Austin some questions to help you see what a tremendous addition he will be to the Questionmark team.

What has been your career track so far?

Most recently, I have been developing assessment and value-added reporting systems at the American Institutes for Research. Before that, I spent three years doing test development and psychometric work for a certification company in the construction industry. I also spent a year at the Independent Evaluation Group at the World Bank, which does program evaluation.

What are some of your career highlights?

I was very excited to be a member on a team that developed data training for K-12 educators. This training built upon previous programs I had implemented, and it really took data analysis to a higher level. The training focused on using data to drive instructional decisions at the class, school, and district level. What I liked about this project was that we did not just cover what their assessment data meant. Instead, we thought critically about the valid uses and limitations of the data, and we worked through real-life problems where data could be used to inform instruction.

Another highlight was the work I did developing a credential based on a portfolio assessment. This was particularly challenging because the portfolio products took many different forms, even though they all reflected the same domain. I got to work with some very creative psychometricians and subject matter experts to research and implement an assessment process that was standardized enough to be defensible, yet flexible enough to accommodate a wide range of candidate profiles. The project was especially fun for me because of the unique measurement challenges inherent in a portfolio assessment.

What attracted you to the position at Questionmark?

Questionmark stood out for me because they are a company that takes reporting and analytics seriously. Some companies seem to treat assessment reporting as an afterthought–a byproduct of the assessment. But in many ways, the reports are the manifestation of the goals of assessment. We collect data so that we can make valid, informed decisions, and this can only be done with efficient, comprehensive reporting strategies. When I saw that Questionmark had a position dedicated solely to reporting and analytics, I knew that this was a company I wanted to work for.

Psychometrics: that’s quite a word! Can you describe what it means in terms that all of us will understand?

It is quite a word–one that spell checker can’t even handle. The most concise and comprehensive description I have read was from Mislevy et al. in the Journal of Educational Data Mining, 4(1). Psychometrics measures educational and psychological constructs with a probabilistic model. In short, psychometricians must take observable evidence (e.g., answers on a test) and make a probabilistic inference about something unobservable (e.g., the test taker’s true ability).

What do you hope to achieve as Product Owner of Analytics?

I am going to continue to build on Questionmark’s reports so that they best suit the needs of our clients and the changing practices in the assessment industry. I plan on using my background in psychometrics and reporting to make sure that our reports are designed and organized in a way that lets users quickly and intuitively leverage their data to make good decisions about their tests and testing programs. I am very fortunate because Questionmark already has a great reporting and analytics structure, and I am excited to continue that work.

How has the field of measurement and evaluation changed in the last 10 years?

In my opinion, measurement and evaluation has changed substantially over the past ten years as a result of improved technology. Fantastic work is being done to improve the accuracy, validity, and reliability of assessment. It is great to see organizations researching new item types, methods for shorter tests, and models that provide diagnostic feedback to test takers. Unfortunately, with such easy access to testing software, there are the occasional cases of “testing for testing’s sake,” where tests are administered without consideration for design or how the results will be used.

Overall, I am optimistic about the course of measurement. There is increased accountability, more support for methods like Evidence Centered Design, and better training and tools available for professional test developers.

How do you see psychometrics contributing to the future of learning?

Psychometrics is contributing to the future of learning by helping educators and students make sense of patterns in the data. While statistical models will never be a replacement for the expertise and instincts of a trained educator, assessments can be a handy tool for understanding students’ strengths, weaknesses, and work strategies. Tests no longer have to be about classifying who passed and who failed. They can now help provide diagnostic feedback so that educators and students can interpret their performance and adjust their learning accordingly. In this way, I think psychometrics has a large role to play in bridging the gap between grades and learning.

What are some of your current research interests?

Right now, I am interested in evidence models for task assessments like games and simulations. Technology lets us create some stunning virtual environments, but the research around how to score these assessments is still in an inchoate stage. While expert ratings and rubrics remain the standard for scoring these assessments, I am researching how we can model difficulty and discrimination in a complex task environment, especially as they relate to the student’s observed process for solving the task.

Tell me some things that interest you outside of work.

I love being around good friends, good food, and good music. I love to travel and go camping with my family, but most weekends you will find us puttering around our plot in our community garden or riding our bikes around the city. I also like working with my hands a lot–I help plant trees with a local nonprofit, and I try to fix stuff around the house (with mixed results).

If you’re attending the Questionmark Users Conference in Baltimore next week, please introduce yourself to Austin!

How should we measure an organization’s level of psychometric expertise?


Posted by Greg Pope

A colleague recently asked for my opinion on an organization’s level of knowledge, experience, and sophistication applying psychometrics to their assessment program. I came to realize that it was difficult to summarize in words, which got me thinking why. I concluded that it was because there currently is not a common language to describe how advanced an organization is regarding the psychometric expertise they have and the rigour they apply to their assessment program. I thought maybe if there were such a common vocabulary, it would make conversations like the one I had a whole lot easier.

I thought it might be fun (and perhaps helpful) to come up with a proposed first cut of a shared vocabulary around the levels of psychometric expertise. I wanted to keep it simple, yet effective in allowing people to quickly and easily communicate about where an organization would fall in terms of their level of psychometric sophistication. I thought it might make sense to break it out by areas (I thought of seven) and assign points according to the expertise/rigour an organization contains/applies. Not all areas are always led by psychometricians directly, but usually psychometricians play a role.

1.    Item and test level psychometric analysis

  • Classical Test Theory (CTT) and/or Item Response Theory (IRT)
  • Pre hoc analysis (beta testing analysis)
  • Ad hoc analysis (actual assessment)
  • Post hoc analysis (regular reviews over time)

2.    Psychometric analysis of bias and dimensionality

  • Factor analysis or principal component analysis to evaluate dimensionality
  • Differential Item Functioning (DIF) analysis to ensure that items are performing similarly across groups (e.g., gender, race, age, etc.)

3.    Form assembly processes

  • Blueprinting
  • Expert review of forms or item banks
  • Fixed forms, computerized adaptive testing (CAT), automated test assembly

4.    Equivalence of scores and performance standards

  • Standard setting
  • Test equating
  • Scaling scores

5.    Test security

  • Test security plan in place
  • Regular security audits are conducted
  • Statistical analyses are conducted regularly (e.g., collusion and plagiarism detection analysis)

6.    Validity studies

  • Validity studies conducted on new assessment programs and ongoing programs
  • Industry experts review and provide input on study design and finding
  • Improvements are made to the program if required as a result of studies

7.    Reporting

  • Provide information clearly and meaningfully to all stakeholders (e.g., students, parents, instructors, etc.)
  • High quality supporting documentation designed for non-experts (interpretation guides)
  • Frequently reviewed by assessment industry experts and improved as required

Expertise/rigour points
0.    None: Not rigorous, no expertise whatsoever within the organization
1.    Some: Some rigour, marginal expertise within the organization
2.    Full: Highly rigorous, organization has a large amount of experience

So an organization that has decades of expertise in each area would be at the top level of 14 (7 areas x 2 for expertise/rigour in each area = 14). An elementary school doing simple formative assessment would probably be at the lowest level (7 areas x 0 expertise/rigour = 0). I have provided some examples of how organizations might fall into various ranges in the illustration below.

There are obviously lots of caveats and considerations here. One thing to keep in mind is that not all organizations need to have full expertise in all areas. For example, an elementary school that administers formative tests to facilitate learning doesn’t need to have 20 psychometricians working for them doing DIF analysis and equipercentile test equating. Their organization being low on the scale is expected. Another consideration is expense: To achieve the highest level requires a major investment (and maintaining an army of psychometricians isn’t cheap!). Therefore, one would expect an organization that is conducting high stakes testing where people’s lives or futures are at stake based on assessment scores to be at the highest level. It’s also important to remember that some areas are more basic than others and are a starting place. For example, it would be pretty rare for an organization to have a great deal of expertise in the psychometric analysis of bias and dimensionality but no expertise in item and test analysis.

I would love to get feedback on this idea and start a dialog. Does this seem roughly on target? Would it would be useful? Is something similar out there that is better that I don’t know about? Or am I just plain out to lunch? Please feel free to comment to me directly or on this blog.

On a related note, Questonmark CEO Eric Shepherd has given considerable thought to the concept of an “Assessment Maturity Model,” which focuses on a broader assessment context. Interested readers should check out:

When and where should I use randomly delivered assessments?


Posted by Greg Pope

I am often asked my psychometric opinion regarding when and where random administration of assessments is most appropriate.

To refresh memories, this is a feature in Questionmark Perception Authoring Manager that allows you to select questions at random from one or more topics when creating an assessment. Rather than administering the same 10 questions to all participants, you can give each participant a different set of questions that are pulled at random from the bank of questions in the repository.

So when is it appropriate to use random administration? I think that depends on the answer this question: What are the assessment’s  stakes and purpose? If the stakes are low and the assessment scores are used to help reinforce information learned, or to give participants a rough guess as to how they are doing in an area, I would say that using random administration is defensible. However, if the stakes are medium/high and the assessment scores are used for advancing or certifying participants I usually caution against random administration.  Here are a few reasons why:

  • Expert review of the assessment form(s) cannot be conducted in advance (each participant gets a unique form)
  • Generally SMEs, psychometricians, and other experts will thoroughly review a test form before it is put into live production. This is to ensure that the form meets difficulty, content and other criteria before being administered to participants in a medium/high stakes context. In the case of randomly administered assessments, this review in advance is not possible as every participant obtains a different set of questions.
  • Issues with the calculation of question statistics using Classical Test Theory (CTT)
  • Smaller numbers of participants will be answering each individual question. (Rather than all 200 participants answering all 50 questions in a fixed form test, randomly administered tests generated from a bank of 100 questions may only have a few participants answering each question.)
  • As we saw in a previous blog post, sample size has an effect on the robustness of item statistics. With fewer participants taking each question it becomes difficult to have confidence in the stability of the statistics generated.
  • Equivalency of assessment scores is difficult to achieve and prove
  • An important assumption of CTT is equivalence of forms or parallel forms. In assessment contexts where more than one form of an exam is administered to participants, a great deal of time is spent ensuring that the forms of the assessment are parallel in every way possible (e.g.., difficulty of questions, blueprint coverage, question types, etc.) so that the scores participants obtain are equivalent.
  • With random administration it is not possible to control and verify in advance of an assessment session that the forms are parallel because the questions are pulled at random. This leads to the following problem in terms of the equivalence of participant scores:
  • If one participant got 2/10 on a randomly administered assessment and another participant got 8/10 on the same randomly administered assessment it would be difficult to know whether the participant who got 2/10 scored low because they (by chance) got harder questions than the participant who got 8/10 or whether the low-scoring participant actually did not know the material and therefore scored low.
  • Using meta tags one can mitigate this issue to some degree (e.g.,  by randomly administering questions within topics by difficulty ranges and other meta tag data) but this would not completely guarantee randomly equivalent forms.
  • Issues with calculation of test reliability statistics using CTT
  • Statistics such as Cronbach’s Alpha have trouble with randomly administered assessment administration. Random administration produces a lot of missing data for questions (e.g., not all participants answer all questions), which psychometric statistics rarely handle well.

There are other alternatives to random administration depending on what the needs are. For example, if random administration is being looked at to curb cheating, options such as shuffling answer choices and randomizing presentation order could serve this need, making it very difficult for participants to copy answers off of one another.

It is important for an organization to look at their context to determine what is best for them. Questionmark provides many options for our customers when it comes to assessment solutions and invites them to work with us in adopting workable solutions.

Item Analysis Analytics Part 8: Some problematic questions


Posted by Greg Pope

In my last post, I showed a few more examples of item analyses where we drilled down into why some questions had problems. I thought it might be useful  to show a few examples of some questions that have bad and downright terrible psychometric performance to show the ugly side of item analysis.

Below is an example of a question that is fairly terrible in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 65, which is not so good: there are too few participants in the sample to be able to make sound judgements about the psychometric performance of the question
  • Next we see that 25 participants didn’t answer the question (“Number not Answered” = 25), which means there was a problem with people not finishing or finding the questions confusing and giving up.
  • The “P Value Proportion Correct” shows us that this question is hard with 20% of participants ‘getting it right.’
  • The “Item Discrimination” indicates very low discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘More than 40’ at only 5%. This means that of the participants with high overall exam scores, 27% selected the correct answer versus 22% of the participants with the lowest overall exam scores. This is a very small difference between the Upper and Lower groups. Participants who know the material should have got the question right more often.
  • The “Item Total Correlation” reflects the Item Discrimination with a negative value of -0.01. A value like this would definitely not meet most organizations’ internal criteria in terms of what is considered an acceptable item. Negative item-total correlations are a major red flag!
  • Finally we look at the Outcome information to see how the distracters perform. We find that participants are all over the map selecting distracters in an erratic way. When I look at the question wording I realize how vague and arbitrary this question is: the number of questions that should be in an assessment depends on numerous factors and contexts. It is impossible to say that in any context a certain number of questions are required. It looks like the Upper Group are selecting the response options ’21-40’ and ‘More than 40’ response options more than the other two options, which have smaller numbers of questions. This makes sense from a participant guessing perspective, because in many assessment contexts having more questions than fewer questions is better for reliability.

The psychometricians, SMEs, and test developers reviewing this question would need to send the SME who wrote this question back to basic authoring training to ensure that they know how to write questions that are clear and concise. This question does not really have a correct answer and needs to be re-written to clarify the context and provide many more details to the participants. I would even be tempted to throw out questions along this content line, because how long an assessment should be has no one “right answer.” How long an assessment should be depends on so many things that there will always be room for ambiguity, so it would be quite challenging to write a question that performs well statistically on this topic.


Below is an example of a question that is downright awful in terms psychometric performance. Here are some reasons why:

  • Going from left to right, first we see that the “Number of Results” is 268, which is really good. That is a nice healthy sample. Nothing wrong here, let’s move on.
  • Next we see that 56 participants didn’t answer the question (“Number not Answered” = 56), which means there was a problem with people not finishing or finding the questions confusing and giving up. It gets worse, much, much worse.
  • The “P Value Proportion Correct” shows us that this question is really hard, with 16% of participants ‘getting it right.’
  • The “Item Discrimination” indicates a negative discrimination, with the difference between the Upper and Lower group in terms of the proportion selecting the correct answer of ‘44123’ at  -23%. This means that of the participants with high overall exam scores, 12% selected the correct answer versus 35% of the participants with the lowest overall exam scores. What the heck is going on? This means that participants with the highest overall assessment scores are selecting the correct answer LESS OFTEN than participants with the lowest overall assessment scores. That is not good at all; lets dig deeper.
  • The “Item Total Correlation” reflects the Item Discrimination with a large negative value of -0.26. This is a clear indication that there is something incredibly wrong with this question.
  • Finally we look at the Outcome information to see how the distracters perform. This is where the true psychometric horror of this question is manifested. There is neither rhyme nor reason here: participants, regardless of their performance on the overall assessment, are all over the place in terms of selecting response options. You might as well have blindfolded everyone taking this question and had them randomly select their answers. This must have been extremely frustrating for the participants who had to take this question and would have likely led to many participants thinking that the organization administering this question did not know what they were doing.

The psychometricians, SMEs, and test developers reviewing this question would need to provide a pink slip to the SME who wrote this question immediately. Clearly the SME failed basic question authoring training. This question makes no sense and was written in such a way to suggest that the author was under the influence, or otherwise not in a right state of mind, when crafting this question. What is this question testing? How can anyone possibly make sense of this and come up with a correct answer? Is there a correct answer? This question is not salvageable and should be stricken from the Perception repository without a second thought. A question like this should have never gotten in front of a participant to take, let alone 268 participants. The panel reviewing questions should review their processes to ensure that in the future questions like this are weeded out before an assessment goes out live for people to take.


Item Analysis Analytics Part 6: Determining Whether a Question Makes the Grade


Posted by Greg Pope

In my previous blog post I talked about outcome discrimination and outcome correlation and their relationship to one another. Now I will provide some criteria that can be used for outcome discrimination and outcome correlation coefficients to judge whether a question is making the grade in terms of psychometric quality.

Outcome discrimination (high-low)


Outcome correlation (Point-biserial correlation)


I’ll be back with more juicy psychometrics soon!

Item Analysis Analytics Part 5: Outcome Discrimination and Outcome Correlation


Posted by Greg Pope

In my previous blog post I dived into some details of item analysis, looking at example questions and how to use the Questionmark Perception Item Analysis Report in an applied context. I thought it might be useful in this post to talk about outcome discrimination and outcome correlation, as people sometimes ask me how are these different or the same, when should I use one or the other, and so on. The fact of the matter is that you can use one or the other and often it comes down to preference as they both yield quite similar results.

Outcome discrimination is the proportion of the top (27% according to assessment score) of participants who selected a response option minus the lowest (27% according to assessment score) of participants who selected each response option to the question. What you would expect is that participants with the highest assessment scores should select the correct response option more often than participants with the lowest assessment scores. Similarly, participants with the highest assessment scores should select the incorrect distracters less often compared to the participants with the lowest assessment scores.

Outcome correlation is a point-biserial correlation that correlates the outcomes scores that participants achieve to the assessment scores that they achieve. So rather than comparing only the top and bottom 27% of participants, the outcome correlation looks at all participants using a standard correlation approach.

If you are thinking that outcome discrimination and outcome correlation sound like they might be related to one another, you are right! High outcome discrimination statistics generally will result in high outcome correlations. In other words, outcome discrimination and outcome correlation statistics are highly correlated with one another. How correlated are they? Well, I looked at many real-life questions from Item Analysis Reports that customers have shared with me and found a positive correlation of 0.962, which is really high.


In my next post I will provide some criteria that can be used for outcome discrimination and outcome correlation coefficients to judge whether a question is meeting the grade in terms of psychometric quality.