Since the dawn of behavioral research, psychologists have faced the problem of missing data--data unavailable for any number of reasons. Sometimes study participants fail to complete all the items on a survey or all the assigned conditions in an experiment. Other times, there are problems with data recording.
How researchers choose to handle those data can be tricky business. Traditionally, they've simply deleted participants with missing data from their analyses. But that could skew conclusions, particularly in studies that rely on random samples to draw conclusions about entire populations. Delete more than a few participants because of missing data, and suddenly your sample is no longer random.
"Missing data is one of the most important statistical and design problems in research," says University of Memphis researcher and methodologist William Shadish, PhD.
Now, a statistician and a psychologist from the Methodology Center at Pennsylvania State University are proposing solutions that would allow researchers to retain the integrity of their data sets. In the June issue of Psychological Methods (Vol. 7, No. 2) Joseph Schafer, PhD, and John Graham, PhD, argue that psychologists' research would benefit from the use of one of two sophisticated, new statistical techniques--Bayesian multiple imputation and maximum likelihood--designed to deal with missing data.
In fact, without them, study conclusions could be severely biased. Other methodologists, along with seasoned researchers and journal editors, couldn't agree more.
"Routine implementation of these new methods of addressing missing data will be one of the major changes in research over the next decade," says Arizona State University psychologist and Psychological Methods Editor Stephen West, PhD.
Ignoring the problem
The methods proposed by Schafer and Graham have been around for more than a decade--versions of them were originally developed by Harvard University statistician Donald B. Rubin, PhD--but few researchers use them. Instead, most researchers use simplistic methods such as pair-wise and list-wise deletion to drop participants without complete data sets.
"It's safe to say that the traditional treatment of missing data is to pretend that the problem will just go away by itself," says University of Rochester researcher Harry Reis, PhD, chair of APA's Board of Scientific Affairs.
The problem with that tactic is the risk that study conclusions could become biased when what started out as a random sample becomes sullied with the deletion of certain participants.
"The concern is that individuals who refuse to respond to questions or questionnaires may be systematically different from people who agree to take part," explains University of Missouri psychologist Harris Cooper, PhD, editor of Psychological Bulletin. "And those differences could be related to variables of interest in the study."
The problem exists for all types of research, but it's particularly an issue for studies looking at real-world problems, whether conducted in the field or in the laboratory.
The methods Schafer and Graham recommend can preserve the integrity of a data set by using statistics to fill in probable values for missing information, thereby allowing researchers to make more accurate conclusions about the population under study.
NASA uses techniques of this sort to fill in holes in the pictures sent back from space probes, explains Graham. "Some of the pictures coming back from the ships--like the ones that circle Mars--are quite crude," he says. "They use enhancement procedures to fill in the blanks in order to get a better idea of what a particular area might look like."
In a similar way, multiple imputation and maximum likelihood effectively "fill in the blanks" left by study participants to allow researchers a more accurate picture of the study sample as a whole. "We're not trying to find out what a single person would have said," explains Graham. "We're preserving aspects of the data so we can generalize from the sample."
These analyses allow researchers to draw conclusions that are more accurate than they could if they simply omitted participants from their analysis.
Getting it right
The researchers describe two general missing data techniques: multiple imputation and maximum likelihood. Both attempt to fill in missing data with plausible values.
Imputation examines the range of plausible values for a particular variable and calculates many values randomly. A researcher will end up with several credible data sets on which to do his or her analyses. The results are then averaged to produce excellent estimates of the parameters of interest, along with estimates of standard errors and confidence intervals.
Maximum likelihood combines into a single analysis missing data analysis and any data analysis technique a researcher wants to use. It's faster, but rather than standing on its own like imputation, it needs to be integrated separately into the various statistical techniques, such as multiple regression analysis and structural equation modeling. So far, only a few statistical packages have integrated the technique, but Graham expects that to change over the next few years. And at least one of the major statistical packages--SAS--has included multiple imputation as an "experimental" technique.
Integrating these techniques into mainstream software will be key if researchers are going to begin using them, says Shadish. "The truth is, these kinds of statistical developments don't get used until they get built into the most popular statistical packages."
That said, the techniques proposed by Schafer and Graham are not difficult, particularly if people use free software available on the Internet (see Graham's resource page at http://methodology.psu.edu/resources.html), says Shadish. And the benefits can't be overestimated.
"If we want to get it right, this is the way to do that," says Reis. "It's clear that [the issue of missing data] is problematic. How much so, we'll never know until we start using these new techniques."
Schafer and Graham hope their article is a first step in that direction. Now that they've laid the issue out in the premier psychology methods journal, they expect journal editors and reviewers to at least be more aware of the hazards of missing data when examining articles submitted for publication.
Shadish admits that he already rejects articles he reviews if he sees significant amounts of missing data and the researchers aren't using sophisticated methods to deal with them. Few journal editors have taken a stance on the issue, but that could change, he says.
"Journal editors are the gatekeepers," says Reis. "If they don't permit [people to delete data], it will stop."Beth Azar is a writer in Portland, Ore.
Letters to the Editor
- Send us a letter