Discussing data mining at NCME

Austin FosseyPosted by Austin Fossey

We will wrap up our discussion of themes at the National Council for Measurement in Education (NCME) annual meeting with an overview of the inescapable discussion about working with complex — and often messy– data sets.

It was clear from many of the presentations and poster sessions that technology is driving the direction of assessment, for better or for worse (or as Damian Betebenner put it, “technology eats statistics”). Advances in technology have allowed researchers to examine new statistical models for scoring participants, identify aberrant responses, score performance tasks, identify sources of construct-irrelevant variance, diversify item formats, and improve reporting methods.

As the symbiotic knot between technology and assessment grows tighter, many researchers and test developers are in the unexpected position of having too much data. This is especially true in complex assessment environments that yield log files with staggering amounts of information about a participant’s actions within an assessment.

Log files can track many types of data in an assessment, such as responses, click streams, and system states. All of these data are time stamped, and if they capture the right data, they can illuminate some of the cognitive processes that are manifesting themselves through the participant’s interaction with the assessment. Raw assessment data like Questionmark’s Results API OData Feeds can also be coupled with institutional data, thus exponentially growing the types of research questions we can pursue within a single organization.

NCME attendees learned about hardware and software that captures both response variables and behavioral variables from participants as they complete an online learning task.

Several presenters discussed issues and strategies for addressing less-structured data, with many papers tackling log file data gathered as participants interact with an online assessment or other online task. Ryan Baker (International Educational Data Mining Society) gave a talk about combine the data mining of log files with field observations to identify hard-to-capture domains, like student engagement.

Baker focused on the positive aspects of having oceans of data, choosing to remain optimistic about what we can do rather than dwell on the difficulties of iterative model building in these types of research projects. He shared examples of intelligent tutoring systems designed to teach students while also gathering data about the student’s level of engagement with the lesson. These examples were peppered with entertaining videos of the researchers in classrooms playing with their phones so that individual students would not realize that they were being subtly observed by the researcher via sidelong glances.

Evidence-centered design (ECD) emerged as a consistent theme: there was a lot conversation about how researchers are designing assessments so that they yield fruitful data for
intended inferences. Nearly every presentation about assessment development referenced ECD. Valerie Shute (Florida State University) observed that five years ago, only a fraction of participants would have known about ECD, but today it is widely used by practitioners.

Understanding Assessment Validity: An Introduction

greg_pope-150x1502

Posted by Greg Pope

In previous posts I discussed some of the theory and applications of classical test theory and test score reliability. For my next series of posts, I’d like to explore the exciting realm of validity. I will discuss some of the traditional thinking in the area of validity as well as some new ideas, and I’ll share applied examples of how your organization could undertake validity studies.

According to the “standards bible” of educational and psychological testing, the Standards for Educational and Psychological Testing (AERA/NCME, 1999), validity is defined as “The degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests.”

The traditional thinking around validity, familiar to most people, is that there are three main types:

validity 1

The most recent thinking on validity takes a more unifying approach which I will go into in more detail in upcoming posts.

Now here is something you may have heard before: “In order for an assessment to be valid it must be reliable.” What does this mean? Well, as we learned in previous Questionmark blog posts, test score reliability refers to how consistently an assessment measures the same thing. One of the criteria to make the statement, “Yes this assessment is valid,” is that the assessment must have acceptable test reliability, such as high Cronbach’s Alpha test reliability index values as found in the Questionmark Test Analysis Report and Results Management System (RMS). Other criteria for making the statement, “Yes this assessment is valid,” is to show evidence for criterion related validity, content related validity, and construct related validity.

In my next posts on this topic I will provide some illustrative examples of how organizations may undertake investigating each of these traditionally defined types of validity for their assessment program.