Q&A: High-stakes online tests for nurses

Headshot JuliePosted by Julie Delazyn

I spoke recently with Leanne Furby, Director of Testing Services at the National League for Nursing (NLN), about her case study presentation at the Questionmark 2015 Users Conference in Napa Valley March 10-13.

Leanne’s presentation, Transitioning 70 Years of High-Stakes Testing to Questionmark, explains NLN’s switch from a proprietary computer- and paper-based test delivery engine to Questionmark OnDemand for securely delivering standardized exams worldwide. I’m happy to share a snippet from of our conversation:

Tell me about the NLN

The NLN is a national organization for faculty nurses and leaders in nurse education. We offer faculty development, networking opportunities, testing services, nursing research grants and public policy initiatives to more than 26,000 members.

Why did you switch to Questionmark?

Our main concern was delivering our tests and exams to a variety of different devices. We wanted our students to be able to take a test on a tablet or take a quiz on their own mobile devices, and this wasn’t something we could do with our proprietary test delivery engine.

Our second major reason to go with Questionmark was the Customized Assessment Reports and the analytics tools. Before making the switch, we were having to create reports and analyze results manually. It took time and resources. Now this is all integrated in Questionmark.

How do you use Questionmark assessments?

We have 90 different exam lines and deliver approximately 75,000 to 100,000 secure exams a year, both nationally and internationally, in multiple languages. The NLN partnered with Questionmark in 2014 to transition the delivery of these exams through a custom-built portal. Questionmark is now NLN’s turnkey solution—from item banking and test development with SMEs all over the world to inventory control, test delivery and analytics.

This transition has had a positive outcomes for both our organization and our customers. We have developed a new project management policy, procedures for system transition and documentation for training at all levels. This has transformed the way we develop, deliver and analyze exams and the way we collect data for business and education purposes.

What are you looking forward to at the conference?

I am most looking forward to the opportunity to speak to other users and product developers to learn tips, tricks and little secrets surrounding the product. It’s so important to speak to people who have experience and can share ways of utilizing the software in ways you hadn’t thought of.

Thank you Leanne for taking time out of your busy schedule to discuss your session with us!

***

You have the opportunity to save $100 on your own conference registration: Just sign up by January 29 to receive this special early-bird discount.

An easier approach to job task analysis: Q&A

Julie Delazyn HeadshotPosted by Julie Delazyn

Part of the assessment development process is understanding what needs to be tested. When you are testing what someone needs to know in order for them to do their job well, subject matter experts can help you harvest evidence for your test items by observing people at work. That traditionally manual process can take a lot of time and money.

Questionmark’s new job task analysis (JTA) capabilities enable SMEs to harvest information straight from the person doing the job. These tools also offer an easier way to see the frequency, importance, difficulty and applicability of a task in order to know if it’s something that needs to be included in an assessment.

Now that JTA question authoring, assessment creation and reporting are available to users of  Questionmark OnDemand and Questionmark Perception 5.7 I wanted to understand what makes this special and important. Questionmark Product Manager Jim Farrell, who has been working on the JTA question since its conception, was kind enough to speak to me about  its value, why it was created, and how it can now benefit our customers.

Here is a snippet of our conversation:

So … first things first … what exactly IS job task analysis and how would our customers benefit from using it?

Job task analysis, JTA, is a survey that you send out that has a list of tasks, which are broken down into dimensions. Those dimensions are typically difficulty, importance, frequency, and applicability. You want to find out things like this from someone who fills out the surveys: Do they find the job difficult? Do they deem it important? And how frequently do they do it? When you correlate all this data you’ll quickly see the items that are more important to test on and collect information on.

We have a JTA question type in Questionmark Live where you can either build your task list and your dimensions or you can import your tasks through a simple import process—so if you have a spreadsheet with all of your tasks you can easily import it. You would then add those to a survey and send them out to collect information. We also have two JTA reports that allow you to break down results by the actual dimension—just look at the difficulty for all the tasks—or you can look at a summary view of all of your tasks and all the dimensions all at
one time; have a snapshot.

That sounds very interesting and easy to use! I’m interested in how did question type actually came to be.

We initially developed the job task analysis survey for the US Navy. Prior to this, trainers would have to travel with paper and clipboards to submarines, battleships and aircraft carriers and watch sailors and others in the navy do their jobs. We developed the JTA survey to help them be more efficient to collect this data more easily and a lot more quickly than they did before.

What do you think is most valuable and exciting about JTA?

To me, the value comes in the ease of creating the questions and sending them out. And I am probably most excited for our customers. Most customers probably harvest information with paper and clipboard and walking around and watching people do their jobs. That’s a very expensive and time-consuming task, so by being able to send this survey out directly to subject matter experts you’re getting more authentic data because you are getting it right form the SMEs rather than from someone observing the behavior.

 

It was fascinating for me to understand how JTA was created and how it works … Do you find this kind of question type interesting? How do you see yourself using it? Please share your thoughts below!

How can a randomized test be fair to all?

Joan Phaup 2013 (3) Posted by Joan Phaup

James Parry, who is test development manager at the U.S Coast Guard Training Center in Yorktown, Virginia, will answer this question during a case study presentation the Questionmark Users Conference in San Antonio March 4 – 7. He’ll be co-presenting with LT Carlos Schwarzbauer, IT Lead at the USCG Force Readiness Command’s Advanced Distributed Learning Branch.

James and I spoke the other day about why tests created from randomly drawn items can be useful in some cases—but also about their potential pitfalls and some techniques for avoiding them.

When are randomly designed tests an appropriate choice?

James Parry

James Parry

There are several reasons to use randomized tests.  Randomization is appropriate when you think there’s a possibility of participants sharing the contents of their test with others who have not taken it.  Another reason would be in a computer lab style testing environment where you are testing many on the same subject at the same time with no blinders between the computers. So even if participants look at the screens next to them, chances are they won’t see the same items.

How are you using randomly designed tests?

We use randomly generated tests at all three levels of testing low-, medium- and high-stakes.  The low- and medium-stakes tests are used primarily at the schoolhouse level for knowledge- and performance-based knowledge quizzes and tests.  We are also generating randomized tests for on-site testing using tablet computers or local installed workstations.

Our most critical use is for our high-stakes enlisted advancement tests, which are administered both on paper and by computer. Participants are permitted to retake this test every 21 days if they do not achieve a passing score.  Before we were able to randomize the test there were only three parallel paper versions. Candidates knew this so some would “test sample” without studying to get an idea of every possible question. They would retake the first version, then the second, and so forth until they passed it. With randomization the word has gotten out that this is not possible anymore.

What are the pitfalls of drawing items randomly from an item bank?

The biggest pitfall is the potential for producing tests that have different levels of difficulty or that don’t present a balance of questions on all the subjects you want to cover. A completely random test can be unfair.  Suppose you produce a 50-item randomized test from an entire test item bank of 500 items.   Participant “A” might get an easy test, “B” might get a difficult test and “C” might get a test with 40 items on one topic and 10 on the rest and so on.

How do you equalize the difficulty levels of your questions?

This is a multi-step process. The item author has to make sure they develop sufficient numbers of items in each topic that will provide at least 3 to 5 items for each enabling objective.  They have to think outside the box to produce items at several cognitive levels to ensure there will be a variety of possible levels of difficulty. This is the hardest part for them because most are not trained test writers.

Once the items are developed, edited, and approved in workflow, we set up an Angoff rating session to assign a cut score for the entire bank of test items.  Based upon the Angoff score, each item is assigned a difficulty level of easy, moderate or hard and assigned a metatag to match within Questionmark.  We use a spreadsheet to calculate the number and percentage of available items at each level of difficulty in each topic. Based upon the results, the spreadsheet tells how many items to select from the database at each difficulty level and from each topic. The test is then designed to match these numbers so that each time it is administered it will be parallel, with the same level of difficulty and the same cut score.

Is there anything audience members should do to prepare for this session?

Come with an open mind and a willingness to think outside of the box.

How will your session help audience members ensure their randomized tests are fair?

I will give them the tools to use starting with a quick review of using the Angoff method to set a cut score and then discuss the inner workings of the spreadsheet that I developed to ensure each test is fair and equal.

***

See more details about the conference program here and register soon.

Using OData for dynamic, customized reporting: Austin Fossey Q&A

Joan Phaup 2013 (3)Posted by Joan Phaup

We’ll be exploring the power of the Open Data Protocol (OData) and its significance for assessment and measurement professionals during the Questionmark 2014 Users Conference in San Antonio March 4 – 7.

Austin Fossey, our reporting and analytics manager, will explain the ins and outs of using the Questionmark OData API, which makes it possible to access assessment results freely and use third-party tools to create dynamic, customized reports. Participants in a breakout session about the OData API, led by Austin along with Steve Lay, will have the opportunity to try it out for themselves.

Austin Fossey-42

Austin Fossey

I got some details about all this from Austin the other day:

What’s the value of learning about the OData API?

The OData API gives you access to raw data. It’s an option for accessing data from your assessment results warehouse without having to know how to program, query databases or even host the database yourself. By having access to those data, you are not limited to the reports Questionmark provides: You can do data merges and create your own custom reports.

OData is really good for targeting specific pieces of info people want. The biggest plus is that it doesn’t just provide data access. It provides a flow of data. If you know the data you need and you want to set up a report, a spreadsheet, or just have it in the web browser, you can get those results updated as new data become available. This flow of data is what makes OData reports truly dynamic, and this is what distinguishes OData reports from reports that are built from manually generated data exports.

What third-party tools can people use with the OData API?

Lots! Key applications include Microsoft Excel PowerPivot, Tableau, the Sesame Data Browser, SAP Business Objects and Logi Analytics, but there are plenty to choose from. People can also do their own programming if they prefer.  The Odata.org website includes a helpful listing of the OData ecosystem, which includes applications that generate and consume OData feeds.

Can you share some examples of custom reports that people can create with OData?

We have some examples of OData reportlets on our Open Assessment Platform website for developers, which also includes some tutorials. I’ve blogged about using the OData API to create a response matrix and to create a frequency table of item keys in Microsoft PowerPivot for Excel. There are so many different ways to use this!

What about merging data from assessments with data from other sources? What are some scenarios for doing that?

It could be any research where you want to cross-reference your assessment data with another data source. If you have another data set and were able to identify participants – say an HR database showing the coursework people have done – you could compare that with their test results to their course activity Reports don’t necessarily have to be about test scores. They can be about items and answer choices – anything you want.

Tell me about the hands-on element of this breakout session.

We will be working through a fairly simple example using Microsoft PowerPivot for Excel  in order to cement the concepts of using OData. We’re encouraging people to bring their laptops with Excel and the PowerPivot add-in already installed. If they don’t have that, they can either work with someone else or watch the exercise onscreen. We will provide a handout explaining everything so they can try this when they are back at work.

What do you want people to take away from this breakout?

We want to make sure people know how to construct an OData URL, that they understand the possibilities of using OData but also the limitations. It won’t be a panacea for everything. We want to be sure they know they have another tool in their tool box to answer the research questions or business questions they encounter day to day.

Our conference keynote speaker, Learning Strategist Bryan Chapman, will share insights about OData and examples of how organizations are using it during his presentation on Transforming Data into Meaning and Action.

Click here to see the complete conference program. And don’t forget to sign up by January 30th if you want to save $100 on your registration.

Item Analysis Report – Item Difficulty Index

Austin FosseyPosted by Austin Fossey

In classical test theory, a common item statistic is the item’s difficulty index, or “p value.” Given many psychometricians’ notoriously poor spelling, might this be due to thinking that “difficulty” starts with p?

Actually, the p stands for the proportion of participants who got the item correct. For example, if 100 participants answered the item, and 72 of them answered the item correctly, then the p value is 0.72. The p value can take on any value between 0.00 and 1.00. Higher values denote easier items (more people answered the item correctly), and lower values denote harder items (fewer people answered the item correctly).

Typically, test developers use this statistic as one indicator for detecting items that could be removed from delivery. They set thresholds for items that are too easy and too difficult, review them, and often remove them from the assessment.

Why throw out the easy and difficult items? Because they are not doing as much work for you. When calculating the item-total correlation (or “discrimination”) for unweighted items, Crocker and Algina (Introduction to Classical and Modern Test Theory) note that discrimination is maximized when p is near 0.50 (about half of the participants get it right).

Why is discrimination so low for easy and hard items? An easy item means that just about everyone gets it right, no matter how proficient they are in the domain; the item does not discriminate well between high and low performers. (We will talk more about discrimination in subsequent posts.)

Sometimes you may still need to use a very easy or very difficult item on your test form. You may have a blueprint that requires a certain number of items from a given topic, and all of the available items might happen to be very easy or very hard. I also see this scenario in cases with non-compensatory scoring of a topic. For example, a simple driving test might ask, “Is it safe to drink and drive?” The question is very easy and will likely have a high p value, but the test developer may include it so that if a participant gets the item wrong, they automatically fail the entire assessment.

You may also want very easy or very hard items if you are using item response theory (IRT) to score an aptitude test, though it should be noted that item difficulty is modeled differently in an IRT framework. IRT yields standard errors of measurement that are conditional on the participant’s ability, so having hard and easy items can help produce better estimates of high- and low-performing participants’ abilities, respectively. This is different from the classical test theory where the standard error of measurement is the same for all observed scores on an assessment.

While simple to calculate, the p value requires cautious interpretation. As Crocker and Algina note, the p value is a function of the number of participants who know the answer to the item plus the number of participants who were able to correctly guess the answer to the item. In an open response item, that latter group is likely very small (absent any cluing in the assessment form), but in a typical multiple choice item, a number of participants may answer correctly, based on their best educated guess.

Recall also that p values are statistics—measures from a sample. Your interpretation of a p value should be informed by your knowledge of the sample. For example, if you have delivered an assessment, but only advanced students have been scheduled to take it, then the p value will be higher than it might be when delivered to a more representative sample.

Since the p value is a statistic, we can calculate the standard error of that statistic to get a sense of how stable the statistic is. The standard error will decrease with larger sample sizes. In the example below, 500 participants responded to this item, and 284 participants answered the item correctly, so the p value is 284/500 = 0.568. The standard error of the statistic is ± 0.022. If these 500 participants were to answer this item over and over again (and no additional learning took place), we would expect the p value for this item to fall in the range of 0.568 ± 0.022 about 68% of the time.

item analysis report 2

 

Item p value and standard error of the statistic from Questionmark’s Item Analysis Report

A streamlined system for survey administration and reporting

Joan Phaup 2013 (3)Posted by Joan Phaup

It’s great to talk to customers who will be presenting case studies at the Questionmark 2014 Users Conference. They all bring to their presentations the lessons they’ve learned from experience.

Conference participants have always taken a keen interest in how to use surveys effectively, so I was quite interested to find out from Scott Bybee, a training manager from Verizon who will be talking at the conference about Leveraging Questionmark’s Survey Capabilities Within a Multi-system Model.

What will you be sharing during your conference presentation?

A lot of it will be about how our surveys, which are mostly Level 1 evaluations for training events and Level 3 self-assessments. I will tell how we can use one generic survey template for all the courses that are being evaluated. We do this by passing parameters from our LMS into the special fields in Questionmark. I’ll also talk about how we integrate data from our LMS with the survey data to create detailed reports in a custom reporting system we built: We have everything we need to get very specific demographic reporting out of the system.

Scott Bybee

Scott Bybee

How is this approach helping you?

This system integrates reporting for all level 1 and 3 surveys. This provides us a single solution for all of our training-related reporting needs. Prior to this, we had to collect data from multiple systems and manually tie it all together. Before, we had a lot of different surveys being used by the business. It became hard to match up results due to variances in questions. With this approach, everyone sees the same set of questions and the quality of the reporting is much higher.

The alternative would have been to collect demographic information using drop-down lists, which we’d have to constantly update and maintain. There’s also the issue of the participant possibly choosing the wrong options from the drop-downs. This way, we are passing everything along for them. They can’t make a mistake. Another advantage is that automatically including that information means it takes less time for them to complete the survey.

Do you have a key piece of advice about how to get truly useful data from surveys?

Make sure you are asking the right kinds of questions and are not trying to put too much into one question. Also, consider passing information directly from your LMS into Questionmark, so participants can’t make a mistake filling out a drop-down.

What do you hope people will take away from your session?

I hope they find out there are some really creative ways to use Questionmark to get what you want. For instance, we realized that by using Perception Integration Protocol (PIP), we could pass in all the variables needed for user-interface as well as alignment with back-end reporting.  I also want them to appreciate how much can be done by tying different systems together. The investment to make Questionmark work for surveys as well as assessments dramatically increased our return on investment (ROI).

What do you hope to take away from the conference?

This will be my fourth one to go to. Every time I go I learn something from the people who are there – things I’d never even thought about. I want to learn from people who are using the tool in innovative ways, and I also want to hear about where things are going in the future.

The conference agenda is taking shape here. You can save $200 if you register for the conference by December 12.