From the Science Student Council
Gaps in the spreadsheet
By Candace Corbeil
Missing data are a common occurrence in social and behavioral science research. For instance, participants may not answer all of the items in a questionnaire. Researchers sometimes drop incomplete cases from the data set entirely (listwise deletion) or drop incomplete cases on an analysis-by-analysis basis (pairwise deletion). Although these procedures are standard options in many statistical software packages, they rely on strict statistical assumptions about why the data are missing. Unfortunately, these assumptions are rarely met. Consequently, using these procedures can produce inaccurate parameter estimates (e.g., means, standard deviations). And even if the assumptions are met, listwise and pairwise deletion can substantially reduce the power of the statistical tests performed.
Modern missing data procedures, such as multiple imputation and full information maximum likelihood estimation, provide a much better way to deal with the issue of missing data. These procedures rely on less strict assumptions about why the data are missing, and will produce unbiased parameter estimates. They also do not negatively impact statistical power. With recent advances in software, modern missing data procedures can now be performed in many statistical software packages (e.g., SPSS, SAS and R).
Addressing Missing Data: A Short How-to Guide
As a first step, you should examine the missing data patterns within your data set, and determine which variables have missing data. Then, to get a better idea of why the values are missing, it is important to consider the three main mechanisms for missing data:
- Missing Completely at Random (MCAR): Data are MCAR when missing values for one variable are completely unrelated to other observed variables in the data set and are unrelated to the values of that given variable. For example, in a data set, missing values for the variable "binge drinking" may be completely unrelated to other variables in the data set. Finding that your data are MCAR is generally an optimal, but often unrealistic, scenario.
- Missing at Random (MAR): Data are MAR when missing values for a variable are related to other observed variable(s) but not to the values of that given variable. As an example, in a given data set, the variable "binge drinking" may have a significant amount of missing data. The data would be MAR if participants with high levels of another variable, such as religiosity, were found to be less likely to complete the item related to binge drinking. In other words, highly religious participants tended to not answer the binge drinking item, but responding to the binge drinking item was unrelated to participants’ binge drinking behavior. As you can see, the term MAR is misleading as it seems to suggest that the data are missing completely at random, which is not the case.
- Missing Not at Random (MNAR): Data are MNAR when missing values for a variable are related to the values of that given variable, after controlling for other variables. For example, "drug use" may be MNAR if participants who engage in high levels of drug use are more likely to miss class on the day of a school-based survey, and therefore they do not complete the survey at all. In this case, missing values on the "drug use" items are directly related to participants’ drug use behavior.
Of the three missing data mechanisms, MCAR is the only one that can be tested (Enders, 2010). There are several methods to test MCAR (Enders, 2010). One way to determine if your data are MCAR is to conduct a series of independent t-tests with all variables that you intend to include in your statistical analysis. More specifically, you would separate the missing and complete cases for a given variable and conduct t-tests to examine mean differences for other main variables. Under the MCAR mechanism, cases with observed data should be similar to cases with missing values. Therefore, a nonsignificant t-test is evidence that the data are MCAR, whereas a significant t-test suggests that the data are MAR or MNAR.
Here is a practical example showing an MCAR and non-MCAR case using the same data set. Say you have data for 50 participants on height and weight. You have weight data for all 50 participants, but you only have height data for 25. To determine if "height" is MCAR, you would first separate participants on the basis of whether they reported their height. Therefore, you would have one group of participants who reported their height and another group of participants who did not report their height. Next, you would conduct a t-test to examine if these two groups of participants differ in regards to their mean weight.
Case 1. The data are MCAR: Participants who did not report their height have a comparable mean weight compared to participants who did report their height. In other words, weight has no relation to whether participants reported their height. The t-test statistic is not significant, and you conclude that "height" is MCAR.
Case 2. The data are not MCAR: For reasons that are unclear, participants who did not report their height have a lower mean weight compared to participants who did report their height. In other words, weight is related to whether participants reported their height. The t-test statistic is significant, and you conclude that "height" is not MCAR.
Pairwise or listwise deletion can technically be used if data are MCAR; however, using these methods can significantly reduce statistical power. Therefore, Enders (2010) recommends using modern missing data procedures even if data are MCAR.
Dealing with MCAR or MAR Data: Using Multiple Imputation
Multiple imputation is one state-of-the-art missing data procedure (Schafer & Graham, 2002). Methodologists strongly recommend multiple imputation over traditional missing data procedures, such as listwise and pairwise deletion, because it produces unbiased estimates with MCAR and MAR data, and cases do not have to be discarded (Baraldi & Enders, 2010).
Multiple imputation assumes that data are MAR, but it is still appropriate to use with MCAR data (Enders, 2010). It also assumes that data are normally distributed. However, non-normal data should not discourage you from using multiple imputation; studies have found that using multiple imputation with non-normal data produces accurate parameter estimates and standard errors if the sample size is relatively large (i.e., 400+) (Demirtas, Freels, & Yucel, 2008).
There are three steps involved in multiple imputation: imputation, analysis and pooling. During the imputation phase, you must decide which variables to include in the imputation model. You should include any variable that you intend to use in further statistical analyses. In addition, it is also important to include any variables that were related to “missingness” in the data set. For example, perhaps while conducting your MCAR independent t-tests, you find that participants who are missing values for the variable "binge drinking" have a lower mean age. Therefore, along with your main variables of interest ("binge drinking" and any other main variables), you would also include age in the imputation model.
After selecting variables for the imputation model, your statistical software program will ask you how many copies of the data set it should create. It will then create this number of data sets (m copies), each of which contains unique, but plausible estimates of the missing values. How many imputed data sets are needed? Graham, Olchowski and Gilreath (2007) recommend 20 imputed data sets for 10-30 percent missing data, 40 imputed data sets for 50 percent missing data, and 100 for 70 percent missing data.
In the analysis phase, you will conduct the statistical analysis of choice (e.g., logistic regression) and the program will analyze each of the m imputed data sets. Therefore, if you have 20 imputed data sets, the program will generate 20 parameter estimates and standard errors.
Instead of using the results from any single imputed data set, a multiple imputation analysis pools or averages, the m parameter values into a single point estimate. You then report the pooled results in your research presentation or manuscript.
Software Options
Software packages, such as SPSS, SAS and R, are capable of performing multiple imputation. The missing values add-on in SPSS performs the series of independent t-tests. Additionally, SPSS automates the analysis and pooling phases. The SPSS imputation procedure presents the imputations in a single file, with an identification variable attached to each data set. Although the pooling feature does not work with all statistical procedures in SPSS, it does work for many common analyses (e.g., multiple regression).
SAS also has a multiple imputation procedure (PROC MI). Similar to SPSS, SAS includes the imputations in a single file and assigns an identification number to each data set. The MIANALYZE procedure is able to pool the estimates and standard errors from the data sets.
There are several packages in R that are able to perform multiple imputation, such as mice (van Buuren & Groothuis-Oudshoorn, 2011), mi (Su, Gelman, Hill & Yajima, 2011) and Amelia II (Honaker, King & Blackwell, 2011).
See Enders (2010) for a discussion of other statistical software packages that can perform multiple imputation and other modern missing data procedures.
Reporting the Results
Although the use of multiple imputation and other missing data procedures is increasing, however many modern missing data procedures are still largely misunderstood. As such, it is advisable to include a brief description in the results section that details the missing data procedure that was used (Enders, 2010). In the case of multiple imputation, researchers could provide information about the imputation, analysis and pooling phases.
Moving Forward
This article contained steps to get you started in applying one modern missing data procedure — multiple imputation. Readers interested in learning more about multiple imputation or other modern missing data procedures should consult more in-depth sources (e.g., Enders, 2010).
References
Baraldi, A.N., & Enders, C.K. (2010). An introduction to modern missing data analyses. Journal of School Psychology, 48(1), 5-37. doi: :10.1016/j.jsp.2009.10.001.
Demirtas, H., Freels, S.A., & Yucel, R.M. (2008). Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal of Statistical Computation and Simulation, 78(1), 69-84. doi: 10.1080/10629360600903866.
Enders, C.K. (2010). Applied missing data analysis. New York: The Guilford Press.
Graham, J.W., Olchowski, A.E. & Gilreath, T.D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8, 206-213. doi: 10.1007/s11121-007-0070-9.
Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7), 1-47. doi: 10.18637/jss.v045.i07.
Schafer J.L. & Graham J.W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147-177. doi: 10.1037//1082-989X.7.2.147.
Su, Y., Gelman, A., Hill, J., & Yajima, M. (2011). Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. Journal of Statistical Software, 45(2), 1-31. doi: 10.18637/jss.v045.i02.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1-67. doi: 10.18637/jss.v045.i03.
About the Author
Candace Corbeil is the methodology representative on the APA Student Science Council. She is a second-year PhD candidate in psychology at the University of Rhode Island.
PSA is the monthly e-newsletter of the APA Science Directorate. It is read by psychologists, students, academic administrators, journalists and policymakers in Congress and federal science agencies.
