Badging and Assessment: If they know it, let them show it!

Posted by Brian McNamara

We are delighted to announce the availability of Questionmark Badging!

With Questionmark Badging and Questionmark OnDemand, you can grant “badges” to participants based on the outcomes achieved on assessments such as certification exams, post-course tests or advancement exams.Badges associated with Questionmark assessments provide participants with portable, verifiable digital credentials.

Badges aligned with Questionmark assessments can be tied in with competencies and achievements, helping organizations provide recognition and motivation for increasing knowledge and skills. For credentialing and awarding bodies, they can increase the visibility and value of certification programs.

The new app couples Questionmark’s capabilities in delivering valid, reliable and trustworthy assessments with the industry-leading digital credentialing platform from Credly. More than just a visual representation of accomplishment, digital badges provide participants with verifiable, portable credentials that can be shared and displayed across the web, including social networking sites such as LinkedIn, Facebook and Twitter.

Find more info about Questionmark Badging right here!

New white paper: Questionmark and Microsoft Office 365

Posted by John Kleeman

I’m pleased to inform you of a new white paper fresh off the press on Questionmark and Microsoft Office 365.

Office logoThis white paper explains how Microsoft Office 365 complements the Questionmark OnDemand assessment management system; and how you can use Office 365 to launch Questionmark surveys, quizzes, tests and exams, how to consume Office 365 resources within Questionmark, and how Office 365 can help analyze results from assessments. You can download the white paper here.

The white paper also describes some of the reasons that organizations use assessments and why it is important for assessments to be valid, reliable and trustable.

Launching assessments from Office 365

Being able to call assessments from within Office 365 allows you to closely connect an assessment to content, for example to check understanding after learning. The white paper describes how you can:

  • Call Questionmark assessments from the Office 365 app launcher
  • Launch an assessment from within a Word, Excel or other Office document
  • Embed an assessment inside a PowerPoint presentation
  • Launch or embed assessments from SharePoint
  • Use SAML to have common identities and seamless authentication between Office 365 and Questionmark OnDemand. The benefit of this is that test-takers can login once to Office 365 and then can take tests in Questionmark OnDemand without needing to login again.

Using Office 365 resources within assessments

Illustration of a picture of a video being used inside an assessmentAssessments of competence are in general more accurate when the questions simulate the performance environment being measured. By putting video, sound, graphics and other media within question stimulus, you help put the participant’s mind into an environment closer to how he/she will be when doing a real-world job task. This makes the question more accurate in measuring the performance of such tasks.

To help take advantage of this, a common use of Office 365 with Questionmark OnDemand is  to make media and other resources that you can use within assessments. The white paper describes how you can use Office 365 Video, PowerPoint, SmartArt and other Office 365 tools to make videos and other useful question content.

Using Office 365 to help analyze results of assessments

People have been using Microsoft Excel to help analyze assessment results since the 1980s and the white paper describes some suggestions on how to do that most effectively with Questionmark OnDemand.

Newer Microsoft tools can also be used to provide powerful insight into assessment results. Questionmark OnDemand makes available assessment data in an OData feed, which can be consumed by business intelligence systems like Power BI. OData is an open protocol to allow the creation and consumption of queryable and interoperable data in a simple and standard way. The white paper also describes how to use OData and Power BI to get further analysis and visualizations from Questionmark OnDemand.

 

The white paper is easy to read and gives practical advice. I recommend reading this white paper if your organization uses Office 365 and Questionmark or if you are considering doing so. You can download the white paper (free with registration) from the Questionmark website.  You can also see other white papers, eBooks and other helpful resources at www.questionmark.com/learningresources.

 

G Theory and Reliability for Assessments with Randomly Selected Items

Austin Fossey-42Posted by Austin Fossey

One of our webinar attendees recently emailed me to ask if there is a way to calculate reliability when items are randomly selected for delivery in a classical test theory (CTT) model.

As with so many things, the answer comes from Lee Cronbach—but it’s not Cronbach’s Alpha. In 1963, Cronbach, along with Goldine Gleser and Nageswari Rajaratnam, published a paper on generalizability theory, which is often called G theory for brevity or to sound cooler. G theory is a very powerful set of tools, but today I am focusing on one aspect of it: the generalizability coefficient, which describes the degree to which observed scores might generalize to a broader set of measurement conditions. This is helpful when the conditions of measurement will change for different participants, as is the case when we use different items, different raters, different administration dates, etc.

In G theory, measurement conditions are called facets. A facet might include items, test forms, administration occasions, or human raters. Facets can be random (i.e., they are a sample of a much larger population of potential facets), or they might be fixed, such as a condition that is controlled by the researcher. The hypothetical set of conditions across all possible facets is called, quite grandly, the universe of generalization. A participant’s average measurement across the universe of generalization is called their universe score, which is similar to a true score in CTT, except that we no longer need to assume that all measurements in the universe of generalizability are parallel.

In CTT, the concept of reliability is defined as the ratio of true score variance to observed score variance. Observed scores are just true scores plus measurement error, so as measurement error decreases, reliability increases toward 1.00.

The generalizability coefficient is defined as the ratio of universe score variance to expected score variance, which is similar to the concept of reliability in CTT. The generalizability coefficient is made of variance components, which differ depending on the design of the study, and which can be derived from an analysis of variance (ANOVA) summary table. We will not get into the math here, but I recommend Linda Crocker and James Algina’s Introduction to Classical and Modern Test Theory for a great introduction and easy-to-follow examples of how to calculate generalizability coefficients under multiple conditions. For now, let’s return to our randomly selected items.

In his chapter in Educational Measurement, 4th Edition, Edward Haertel illustrated the overlaps between G theory and CTT reliability measures. When all participants see the same items, the generalizability coefficient is made up of the variance components for the participants and for the residual scores, and it yields the exact same value as Cronbach’s Alpha. If the researcher wants to use the generalizability coefficient to generalize to an assessment with more or fewer items, then the result is the same as the Spearman-Brown formula.

But when our participants are each given a random set of items, they are no longer receiving parallel assessments. The generalizability coefficient has to be modified to include a variance component for the items, and the observed score variance is now a function of three things:

  • Error variance.
  • Variance in the item mean scores.
  • Variance in the participants’ universe scores.

Note that error variance is not the same as measurement error in CTT. In the case of a randomly generated assessment, the error variance includes measurement error and an extra component that reflects the lack of perfect correlation between the items’ measurements.

For those of you randomly selecting items, this makes a difference! Cronbach’s Alpha may yield low or even meaningless results when items are randomly selected (e.g., negative values). In an example dataset, 1,000 participants answered the same 200 items. For this assessment, Cronbach’s Alpha is equivalent to the generalizability coefficient: 0.97. But if each of those participants had answered 50 randomly selected items from the same set, Cronbach’s Alpha is no longer appropriate. If we tried to use Cronbach’s Alpha, we would have seen a depressing number: 0.50. However, the generalizability coefficient is 0.65–still too low, but better than the alpha value.

Finally, it is important to report your results accurately. According to the Standards for Educational and Psychological Testing, you can report generalizability coefficients as reliability evidence if it is appropriate for the design of the assessment, but it is important not to use these terms interchangeably. Generalizability is a distinct concept from reliability, so make sure to label it as a generalizability coefficient, not a reliability coefficient. Also, the Standards require us to document the sources of variance that are included (and excluded) from the calculation of the generalizability coefficient. Readers are encouraged to refer to the Standards’ chapter on reliability and precision for more information.

Item Development – Planning your field test study

Austin Fossey-42Posted by Austin Fossey

Once the items have passed their final editorial review, they are ready to be delivered to participants, but they are not quite ready to be delivered as scored items. For large-scale assessments, it is best practice to deliver your new items as unscored field test items so that you can gather item statistics for review before using the items to count toward a participant’s score. We discussed field test studies in an earlier post, but today we will focus more on the operational aspects of this task.

If you are embedding field test items, there is little you need to do to plan for the field test, other than to collect data on your participants to ensure representativeness and to make sure that enough participants respond to the item to yield stable statistics. You can collect data for representativeness by using demographic questions in Questionmark’s authoring tools.

If field testing an entire form, you will need to plan your field test carefully. When an entire form is going to be field tested, Schmeiser and Welch ( Educational Measurement, 4th ed.) recommend testing twice as many items as you will need for your operational form.

To check representativeness, you may want to survey your participants in advance to help you select your participant sample. For example, if your participant population is 60% female and 40% male, but your field test sample is 70% male, then that may impact the validity of your field test results. It will be up to you to decide which factors are relevant (e.g., sex, ethnicity, age, level of education, location, level of experience). You can use Questionmark’s authoring tools and reports to deliver and analyze these survey results.

You will also need to entice participants to take your field test. Most people will not want to take a test if they do not have to, but you will likely want to conduct the field test expeditiously. You may want to offer an incentive to test, but that incentive should not bias the results.

For example, I worked on a certification assessment where the assessment cost participants several hundred dollars. To incentivize participation in the field test study of multiple new forms, we offered the assessment free of charge and told participants that their results would be scored once the final forms were assembled. We surveyed volunteers and selected a representative sample to field test each of the forms.

The number of responses you need for each item will depend on your scoring model and your organization’s policies. If using Classical Test Theory, some organizations will feel comfortable with 80 – 100 responses, but Item Response Theory models may require 200 – 500 responses to yield stable item parameters.

More is always better, but it is not always possible. For instance, if an assessment is for a very small population, you may not have very many field test participants. You will still be able to use the item statistics, but they should be interpreted cautiously in conjunction with their standard errors. In the next post, we will talk about interpreting item statistics in the psychometric review following the field test.

Item Development – Conducting the final editorial review

Austin Fossey-42Posted by Austin Fossey

Once you have completed your content review and bias review, it is best to conduct a final editorial review.

You may have already conducted an editorial review prior to the content and bias reviews to cull items with obvious item-writing flaws or inappropriate item types—so by the time you reach this second editorial review, your items should only need minor edits.

This is the time to put the final polish on all of your items. If your content review committee and bias review committee were authorized to make changes to the items, go back and make sure they followed your style guide and that they used accurate grammar and spelling. Make sure they did not make any drastic changes that violate your test specifications, such as adding a fourth option to a multiple choice item that should only have three options.

If you have resources to do so, have professional editors review the items’ content. Ask the editors to identify issues with language, but review their suggestions rather than letting them make direct edits to the items. The editors may suggest changes that violate your style guide, they may not be familiar with language that is appropriate for your industry, or they may wish to make a change that would drastically impact the item content. You should carefully review their changes to make sure they are each appropriate.

As with other steps in the item development process, documentation and organization is key. Using item writing software like that provided by Questionmark can help you track revisions to items, document changes, and track your items to make sure each one is reviewed.

Do not approve items with a rubber stamp. If an item needs major content revisions, send it back to the item writers and begin the process again. Faulty items can undermine the validity of your assessment and can result in time-consuming challenges from participants. If you have planned ahead, you should have enough extra items to allow for some attrition while retaining enough items to meet your test specifications.

Finally, be sure that you have the appropriate stakeholders sign off on each item. Once the item passes this final editorial review, it should be locked down and considered ready to deliver to participants. Ideally, no changes should be made to items once they are in delivery, as this may impact how participants respond to the item and perform on the assessment. (Some organizations require senior executives to review and approve any requested changes to items that are already in delivery.)

When you are satisfied that the items are perfect, they are ready to be field tested. In the next post, I will talk about item try-outs, selecting a field test sample, assembling field test forms, and delivering the field test.

Check out our white paper: 5 Steps to Better Tests for best practice guidance and practical advice for the five key stages of test and exam development.

Austin Fossey will discuss test development at the 2015 Users Conference in Napa Valley, March 10-13. Register before Jan. 29 and save $100.

Acronyms, Abbreviations and APIs

Steve Lay HeadshotPosted by Steve Lay

As Questionmark’s integrations product owner, it is all too easy to speak in acronyms and abbreviations. Of course, with the advent of modern day ‘text-speak,’ acronyms are part of everyday speech. But that doesn’t mean everyone knows what they mean. David Cameron, the British prime minister, was caught out by the everyday ‘LOL’ when it was revealed during a recent public inquiry that he’d used it thinking it meant ‘lots of love’.

In the technical arena things are not so simple. Even spelling out an acronym like SOAP (which stands for Simple Object Access Protocol) doesn’t necessarily make the meaning any clearer. In this post, I’m going to do my best to explain the meanings of some of the key acronyms and abbreviations you are likely to hear talked about in relation to Questionmark’s Open Assessment Platform.

API

At a recent presentation (on Extending the Platform), while I was talking about ways of integrating with Questionmark technologies, I asked the audience how many people knew what ‘API’ stood for. The response prompted me to write this blog article!

The term, API, is used so often that it is easy to forget that it is not widely known outside of the computing world.

API stands for Application Programming Interface. In this case the ‘application’ refers to some external software that provides functionality beyond that which is available in the core platform. For example, it could be a custom registration application that collects information in a special way that makes it possible to automatically create a user and schedule them to a specified assessment.

The API is the information that the programmer needs to write this registration application. ‘Interface’ refers to the join between the external software and the platform it is extending. (Our own APIs are documented on the Questionmark website and can be reached directly from developer.questionmark.com.)

APIs and Standards

APIs often refer to technical standards. Using standards helps the designer of an API focus on the things that are unique to the platform concerned without having to go into too much incidental detail. Using a common standard also helps programmers develop applications more quickly. Pre-written code that implements the underlying standard will often be available for programmers to use.

To use a physical analogy, some companies will ask you to send them a self-addressed stamped envelope when requesting information from them. The company doesn’t need to explain what an envelope is, what a stamp is and what they mean by an address! These terms act a bit like technical standards for the physical world. The company can simply ask for one because they know you understand this request. They can focus their attention on describing their services, the types of requests they can respond to and the information they will send you in return.

QMWISe

QMWISe stands for Questionmark Web Integration Services Environment. This API allows programmers to exchange information with Questionmark OnDemand software-as-a-service or Questionmark Perception on-premise software. QMWISe is based on an existing standard called SOAP. (see above)

SOAP defines a common structure used for sending and receiving messages; it even defines the concept of a virtual ‘envelope’. Referring to the SOAP standard allows us to focus on the contents of the messages being exchanged such as creating participants, creating schedules, fetching results and so on.

REST

REST stands for REpresentational State Transfer and must qualify as one of the more obscure acronyms! In practice, REST represents something of a back-to-basics approach to APIs when contrasted with those based on SOAP. It is not, in itself, a standard but merely a set of stylistic guidelines for API designers defined by an academic paper written by Roy Fielding, a co-author of the HTTP standard (see below).

As a result, APIs are sometimes described as ‘RESTful’, meaning they adhere to the basic principles defined by REST. These days, publicly exposed APIs are more likely to be RESTful than SOAP-based. Central to the idea of a RESTful API is that the things your API deals with are identified by a URL (Uniform Resource Locator), the web’s equivalent of an address. In our case, that would mean that each participant, schedule, result, etc. would be identified by its own URL.

HTTP

RESTful APIs draw heavily on HTTP. HTTP stands for HyperText Transfer Protocol. It was invented by Tim Berners-Lee and forms one of the key inventions that underpin the web as we know it. Although conceived as a way of publishing HyperText documents (i.e., web pages), the underlying protocol is really just a way of sending messages. It defines the virtual envelope into which these messages are placed. HTTP is familiar as the prefix to most URLs.

OData

Finally this brings me to OData. OData just stands for Open Data. This standard makes it much easier to publish RESTful APIs. I recently OData in the post, What is Odata, and why is it important?

Although arguably simpler than SOAP, OData provides an even more powerful platform for defining APIs. For some applications, OData itself is enough, and tools can be integrated with no additional programming at all. The PowerPivot plugin for Microsoft Excel is a good example. Using Excel you can extract and analyse data using the Questionmark Results API (itself built on OData) without any Questionmark-specific programming at all.

For more about OData, check out this presentation on Slideshare.

Next Page »