Managing Item Development for Large-Scale Assessment

Julie Delazyn HeadshotPosted by Julie Delazyn

Whether you work with low-stakes assessments, small-scale classroom assessments or large-scale, high-stakes assessment, understanding and applying some basic principles of item development will greatly enhance the quality of your results.

What began as an 11-part blog series by Reporting and Analytics Manager Austin Fossey has, by popular demand, morphed into a white paper.

Managing Item Development for Large-Scale Assessment offers sounds advice on how to organize and execute item development steps that will help you create defensible assessments including:   Item Dev.Download the free white paper: Managing Item Development for Large-Scale Assessment

Cloud vs. On-premise: A customer’s viewpoint

Chloe MendoncaPosted by Chloe Mendonca

The cloud or on-premise: which is best? Many organisations are asking this question, and migration to the cloud is undoubtedly one of the biggest trends in the technology industry right now.

Many organisations are finding vast benefits from moving to the cloud. A recent research study by the School of Computing, Creative Technologies & Engineering and Leeds Beckett University, highlights how the cloud plays an instrumental role in the reduction of environmental and financial costs – often a 50% saving on costs related to the installation and maintenance of IT infrastructure in academic institutions. I was interested to learn more about the customer experience of moving from an on-premise system into the cloud.

My recent conversation with Paul Adriaans, IT Coordinator in Education at the Faculty of Law of the University of Maastricht in the Netherlands gave me some insight into what prompted their move from Questionmark Perception—our on-premise assessment management system— to Questionmark OnDemand—our cloud-based solution. Paul explained how the transition worked for them.

What is your history with Questionmark?
We began using Questionmark Perception at the University in the Faculty of Law department in 2007 and in December 2013 we moved to OnDemand.

What made you decide to go OnDemand?
With the on-premise system we always needed to get someone onsite to do the upgrades, as it required us to install the latest version of the software and set up our servers. This took planning, time and resources. We deployed the on-demand system within a few hours and the preparation was easy. Cost was another factor. The money spent internally to maintain our systems, was higher compared to the cost of going OnDemand. Moving to OnDemand also gives us greater flexibility to expand and grow our usage in the future.

Did you have any concerns about moving to OnDemand?
Understandably we were cautious about security. With our on-premise installation, we were accustomed to being fully in control of our own data security. Moving to the cloud meant entrusting Questionmark with data protection–but this new approach provides excellent security while still giving us complete access to our data. The university worked with SURF, the collaborative ICT organisation for Dutch higher education and research, and online learning services provider Up Learning, to test and approve Questionmark’s security. (The Questionmark OnDemand environment is located in an ISO accredited EU data center with multiple layers of security.)

How did you find the transition?
The upgrade process didn’t take a lot of work from our end, as we started with a clean database. We chose not to transfer old exams, users and schedules, but only to keep questions and a selection of the most recent exams. So we added all of the questions and exams that we still needed by exporting and importing QPacks (encrypted zip files) ourselves. We were able to have support from Up Learning as well as support from Questionmark’s help desk and customer care teams. They were very supportive and provided useful emails to guide us throughout the process.

How has the switch to Questionmark OnDemand affected your work?
OnDemand gives us access to all of the latest features as soon as they’re available. Due to the internal IT resources required to carry out an upgrade for our on-premise system, we were often several versions behind the OnDemand system. Now we always have the latest version and don’t have to worry about upgrading. If and when we decide to grow our assessment programme across the university, we know that OnDemand is flexible enough to accommodate that.

Online or test center proctoring: Which is more secure?

John Kleeman HeadshotPosted by John Kleeman

As explained in my previous post Online or test center proctoring: Which is best?, a new way of proctoring certification exams is rapidly gaining traction. With online proctoring, candidates takes exams at their office or home, with a proctor observing via video camera over the Internet. Parents scaling the walls of a building to help their children cheat

The huge advantage of online proctoring is that the candidate doesn’t need to travel to a test center. This is fairer and saves a lot of time and cost. But how secure is online proctoring? You might at first sight think that test center proctoring is more secure – as it sounds easier to spot cheating in a controlled environment and face-to-face than online. But it’s not as simple as that.

The stakes for a candidate to pass an exam are often high, and there are many examples where proctors at test centers coach candidates or otherwise breach the integrity of the exams process. A proctor in a test center can witness the same test being taken over and over again, and they can start to memorize, and potentially sell, the content that they see.  For example, according to a 2011 article in the Economist , one major test center company at that time was shutting down five test centers a week due to security concerns.

Test center vulnerabilities are not always as obvious as they are in the picture to the right (source here), but they are myriad. This recent photo shows parents in India climbing the walls of a building to help their children pass exams, with proctors bribed to help. According to Standard Digital:

“Supervisors stationed at notorious test centres vie for the postings, enticed by the prospect of bribes from parents eager to have their wards scrape through.”

Proxy test taking – where one person takes a test impersonating another – is a also big concern in the industry. A 2014 Computer World article quotes an expert saying:

“In some cases, proxies have been able to skirt security protocols by visiting corrupt testing facilities overseas that operate both a legitimate “front room” test area and a fraudulent “back room” operation.

This doesn’t just happen in a few parts of the world: there are examples worldwide. For instance, there was a prominent case in the UK in 2014 where proctors were dishonest in a test used to check English knowledge for candidates seeking visas. According to a BBC report, in some tests the proctor read out the correct answers to candidates. And in another test, a candidate came to the test center and had their picture taken, but then a false sitter went on to take the test. An undercover investigator posing as a candidate was told:

“Someone else will sit the exam for you. But you will have to have your photo taken there to prove you were present.”

This wasn’t a small scale affair – the UK government announced that at least 29,000 exam results were invalid due to this fraud.

Corrupt test centers have also been found in the US. In May 2015, a New York man was sentenced to jail for being involved in fraud where five New York test centers allowed applicants for a commercial driving license to pay their way to pass the test. According to a newspaper report:

“The guards are accused of taking bribes to arrange for customers to leave the testing room with their exams, which they gave to a surrogate test-taker outside who looked up the answers on a laptop computer. The guards would allow the test-takers to enter and leave the testing rooms.”

There are many other examples of this kind of cheating at test centers – a good source of information is  Caveon’s blog about cheating in the news. Caveon and Questionmark recently announced a partnership to generally enhance the security of high-stakes testing programs. The partnership with Caveon will also provide Questionmark’s customers with easy access to consulting services to help them enhance the security of the exams.
Of course, most test center proctors are honest and most test center exams are fair, but there are enough problems to raise concerns. Online proctoring has some security disadvantages, too:

  • Although improvements are being developed, it is harder for the proctor to check whether an ID is genuine when looking at it through a camera.
  • A remote camera in the candidate’s own environment is less capable of spotting some forms of cheating than a controlled environment in a test center.

But there are also genuine security advantages.  It is much harder for an online proctor to get to know a candidate to be able to coach him or her or receive a payment to help in other ways.

  • Because proctors can be assigned randomly and without any geographic connection, it’s much less likely for the proctor and candidate to be able to pre-arrange any bad behavior
  • All communication between proctor and candidate is electronic and can be logged, so the candidate cannot easily make an inappropriate approach during the exam.
  • While test center proctors have easy access to exam content which can lead to various types of security breaches, online proctors can be restricted from viewing the exam content through the use of such technologies as secure browsers.
  • Because there is less difficulty and cost involved in online proctoring than when the candidate travels to a physical test center, it’s practical to test more frequently– and this is a security benefit. If there is frequent testing, it may be simpler for a candidate to learn the material and pass the test honestly than put a lot of effort into cheating. If you have several exams, you can also compare the pictures of a candidate at each exam to reduce the chance of impersonation.

In summary, the main reason for online proctoring is that it saves time and money over going to a bricks-and-mortar test center. The security advantages and disadvantages of test center versus online proctoring  are open to debate.  Dealing with security vulnerabilities requires constant vigilance. With new online proctoring technologies enhancing exam security, many certification programs are now transitioning away from test centers. Traditionally a test center was a secure place to administer exams, but in practice there have been so many incidents of proctor dishonesty over the years that online proctoring is likely justifiable for security reasons.

Simpson’s Paradox and the Steelyard Graph

Austin Fossey-42Posted by Austin Fossey

If you work with assessment statistics or just about any branch of social science, you may be familiar with Simpson’s paradox—the idea that data trends between subgroups change or disappear when the subgroups are aggregated. There are hundreds of examples of Simpson’s paradox (and I encourage you to search some on the internet for kicks), but here is a simple example for the sake of illustration.

Simpson’s Paradox Example

Let us say that I am looking to get trained as a certified window washer so that I can wash windows on Boston’s skyscrapers. Two schools in my area offer training, and both had 300 students graduate last year. Graduates from School A had an average certification test score of 70.7%, and graduates from School B had an average score of 69.0%. Ignoring for the moment whether these differences are significant, as a student I will likely choose School A due to its higher average test scores.

But here is where the paradox happens. Consider now that I have a crippling fear of heights, which may be a hindrance for my window-washing aspirations. It turns out that School A and School B also track test scores for their graduates based on whether or not they have a fear of heights. The table below reports the average scores for these phobic subgroups.

Simpson's
Notice anything? The average score for people with and without a fear of heights in School B is higher than the same groups in School A. The paradox is that School A has a higher average test score overall, yet School B can boast better average test scores for students with a fear of heights and students without a fear of heights. School B’s overall average is lower because they simply had more students with a fear of heights. If we want to test the significance of these differences, we can do so with ANOVA.

Gaviria and González-Barbera’s Steelyard Graph

Simpson’s paradox occurs in many different fields, but it is sometimes difficult to explain to stakeholders. Tables (like the one above) are often used to
illustrate the subgroup differences, but in the Fall 2014 issue of Educational Measurement, José-Luis Gaviria and Coral González-Barbera from the Universidad Complutense de Madrid won the publication’s data visualization contest with their Steelyard Graph, which illustrates Simpson’s Paradox with a graph resembling a steelyard balance. The publication’s visual editor, ETS’s Katherine Furgol Castellano, wrote the discussion piece for the Steelyard Graph, praising Gaviria and González-Barbera for the simplicity of the approach and the novel yet astute strategy of representing averages with balanced levers.

The figure below illustrates the same data from the table above using Gaviria and González-Barbera’s Steelyard Graph approach. The size of the squares corresponds to the number of students, the location on the lever indicates the average subgroup score, and the triangular fulcrum represents the school’s overall average score. Notice how clear it is that the subgroups in School B have higher average scores than their counterparts in School A. The example below has only two subgroups, but the same approach can be used for more subgroups.

Simpson's 2

Example of Gaviria and González-Barbera’s Steelyard Graph to visualize Simpson’s paradox for subgroups’ average test scores.

Making a Decision when Faced with Simpson’s Paradox

When one encounters Simpson’s paradox, decision-making can be difficult, especially if there are no theories to explain why the relational pattern is different at a subgroup level. This is why exploratory analysis often must be driven by and interpreted through a lens of theory. One could come up with arbitrary subgroups that reverse the aggregate relationships, even though there is no theoretical grounding for doing so. On the other hand, relevant subgroups may remain unidentified by researchers, though the aggregate relationship may still be sufficient for decision-making.

For example, as a window-washing student seeing the phobic subgroups’ performances, I might decide that School B is the superior school for teaching the trade, regardless of which subgroup a student belongs to. This decision is based on a theory that a fear of heights may impact performance on the certification assessment, in which case School B does a better job at preparing both subgroups for their assessments. If that theory is not tenable, it may be that School A is really the better choice, but as an acrophobic would-be window washer, I will likely choose School B after seeing this graph . . . as long as the classroom is located on the ground floor.

When to weight items differently in CTT

Austin Fossey-42Posted by Austin Fossey

In my last post, I explained the statistical futility and interpretive quagmires that result from using negative item scores in Classical Test Theory (CTT) frameworks. In this post, I wanted to address another question I get from a lot of customers: when can we make one item worth more points?

This question has come up in a couple of cases. One customer wanted to make “hard” items on the assessment worth more points (with difficulty being determined by subject-matter experts). Another customer wanted to make certain item types worth more points across the whole assessment. In both cases, I suggested they weight all of the items equally.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

Before I reveal the rationale behind the recommendation, please permit me a moment of finger-wagging. The impetus behind these questions was that these test developers felt that some items were somehow better indicators of the construct, thus certain items seemed like more important points of evidence than others. If we frame the conversation as a question of relative importance, then one recognizes that the test blueprint document should contain all of the information about the importance of domain content, as well as how the assessment should be structured to reflect those evaluations. If the blueprint cannot answer these questions, then it may need to be modified. Okay, wagging finger back in its holster.

In general, weights should be applied at a subscore level that corresponds to the content or process areas on the blueprint. A straightforward way to achieve this structure is to present a lot of items. For example, if Topic A is supposed to be 60% of the assessment score and Topic B is supposed to be 40% of the assessment score, it might be best to ask 60 questions about Topic A and 40 questions about Topic B, all scored dichotomously [0,1].

There are times when this is not possible. Certain item formats may be scored differently or be too complex to deliver in bulk. For example, if Topic B is best assessed with long-format essay items, it might be necessary to have 60 selected response items in Topic A and four essays in Topic B—each worth ten points and scored on a rubric.

Example of a simple blueprint where items are worth more points due to their topic’s relative importance (weight)

The critical point is that the content areas (e.g., Topics) are driving the weighting, and all items within the content area are weighted the same. Thus, an item is not worth more because it is hard or because it is a certain format; it is worth more because it is in a topic that has fewer items, and all items within the topic are weighted more because of the topic’s relative importance on the test blueprint.

One final word of caution. If you do choose to weight certain dichotomous items differently, regardless of your rationale, remember that it may bias the item-total correlation discrimination. In these cases, it is best to use the item-rest correlation discrimination statistic, which is provided in Questionmark’s Item Analysis Report.

Interested in learning more about classical test theory and applying item analysis concepts? Join Psychometrician Austin Fossey for a free 75 minute online workshop — Item Analysis: Concepts and Practice — Tuesday, June 23, 2015 *space is limited

Exams and social media: is it really spying?

Steve Lay HeadshotPosted by Steve Lay

While I was traveling back from our US Users Conference several weeks ago, a debate was raging on social media following news that a testing company had been monitoring Twitter to detect evidence of leaked content. The Guardian newspaper, for example, reported that a New Jersey superintendent had found this ‘disturbing’.

In case you haven’t read about this case, here are the basics: after school, a student tweeted information about a test administered earlier that day. An automated Web monitoring system discovered the tweet, and the school was notified. The student later deleted the offending tweet.

According to the test provider, administrators are supposed to tell participants that sharing any test question online is prohibited. It isn’t clear from the press reports whether this warning was issued prior to the test or whether the student would have considered the tweet prohibited or not. Whatever the case may be, enough information was shared to trigger the automated warning.

Perhaps more interesting than the story itself is the reaction to it. Strong words have been used, but should monitoring social media really be regarded as spying?

The monitoring of online forums to check for exam leaks is not new. It goes back to the very earliest days of the Internet. When I first read about this case my first reaction was that this type of thing is happening all the time. Indeed, brand owners are constantly monitoring social media to help them understand the public’s reaction to their products and services and to help them target their advertising more effectively. Copyright owners also monitor the web to check for infringement. Trademark owners must pro-actively monitor for misuse to prevent their trademarks from becoming unenforceable. So if an organization has such rights, wouldn’t monitoring the web–including social media–to enforce them surely be expected?

This assumption is probably naive. Many people are not aware that this information is available in a form that can be subscribed to. They do not understand the subtle difference between a comment being made in a ‘public place’ like twitter and it being instantly discoverable. In our everyday experience, a conversation that happens in a public place like a café or store is not recorded, transcribed and then made instantly available to business partners of the venue. In this case, the student, the student’s parents and even the superintendent were surprised and shocked by the level of surveillance. They reacted as if a private conversation had been overheard.

It is interesting to contrast this recent case with one reported by Techcrunch in 2009, when information from Facebook was used to hold students to account for cheating. But in the Facebook case, the information was discovered by other students and brought to the attention of the test authorities. Why would the students do that? Likely because test takers are key stakeholders too! If cheating becomes commonplace, then the test will become worthless. So both the test publisher and the test taker have an interest in ensuring fair practice.

Coming back to the rogue tweet, what’s frustrating here is that there is no suggestion that the test taker was trying to cheat or to help someone else cheat. I haven’t seen the 140 characters in question, but it seems likely that the tweet was just a trivial extension of the type of verbal conversation that people frequently have after taking tests.

The mismatch in privacy expectations and the feeling that the student was being accused of malpractice were a toxic mix. Both of these can be avoided.

When monitoring people using CCTV or similar technologies, it is good practice to inform people that they are being monitored, and for what purpose. In many jurisdictions this may also be a legal requirement. Likewise, why not inform test takers of the type of monitoring that is taking place and why? This may have the added advantage of helping to inform them about the risks to their own privacy that over-sharing on social media can pose.

Also, when issues are flagged by monitoring services, test publishers should think carefully about any follow-up actions. Are these actions consistent with the stated purpose of the monitoring? Are they proportionate? Remember, the test taker and the test publisher should be on the same side!

Next Page »
SAP Microsoft Oracle HR-XML AAIC