Summary of Saunders et al., (2006) Imputing Missing Data: A Comparison of Methods for Social work researchers

For social work researcher, dealing with missing data is always challenges. Often, missing values are just ignored, but it could distort the accuracy of data analysis to make valid and efficient inferences about a population. In this article, Saunders et at.,(2006) present six methods of data imputation to replace missing data and apply them to two data sets. The most common and easiest method of dealing with missing data is listwise deletion. When it is used, the computer program automatically deletes any case that has missing data for any bivariate or multivariate analysis. However, this method induces sample loss, so it may be appropriate only with a large sample and relatively small amount of missing data. The second method is mean substitution which uses the mean of the total sample as substitution for all of the missing values in that variable. It may be appropriate if only a small number of cases are missing values, because it reduces the estimate of the standard deviation and variance, and results in biased and deflated standard errors. Third method is hotdecking. With this method, the missing values for variable X are replaced with a value from a case that has similar characteristic. Hotdecking method is better than mean substitution to approximate the standard deviation, but bias is still likely to occur in regression equations. Fourth method is regression imputation or conditional mean imputation. To do this, the first step is to select the best predictors with complete data, the variable highly correlated with the variable having missing values. In regression equation, the predictor is used as independent variable and the variable with missing value is used as dependent variable. For example, if the income variable has missing values, its missing values could be predicted through the regression equation with the other variables such as age, education or occupation which have complete data set and high correlation with the income variable. Because this method assumes that there is a linear relationship among the variables used in the regression equation, it may result in overestimated model statistics and lower significance values. The last two imputation methods in this study are more sophisticated ways than other models mentioned above.

For me, it was hard to understand these methods with this article, so I need a lot of search for understanding these methods (honestly, I am still not sure about it). The EM algorithm is a method using the relationship between missing value and parameter, because missing value and parameters can provide useful information about each other. EM algorithm consists of two steps. First, step E, parameter is estimated from observational data, and with estimated parameter, missing value is imputed. Next, step M, using the observational data and imputing missing value in step E, new parameter is estimated again. EM algorithm iterates these steps until the variation of estimated parameter is minimized. Even though the estimated parameter is unbiased and efficiency, generally EM algorithm tends to underestimate the standard error and overestimate the accuracy of precision of the inference (Kang & Kim, 2006). Final method is introduced as multiple implicates in this article, but in the other articles, it is called as Multiple Imputation (MI). Among the articles that I found about the definition of MI, the explanations by Weyman (2003) and Graham (2009) are the most understandable. Weyman (2003) says like this. “In multiple imputation, missing values for any variable are predicted using existing values from other variables. The predicted values, called ‘imputes’, are substituted for the missing values, resulting in a full data called an ‘imputed data set.’ This process is performed multiple times, producing multiple imputed data sets. Standard statistical analysis is carried out on each imputed data set, producing multiple analysis results. These analysis results are then combined to produce one overall analysis.” Graham (2009) presents the key point of multiple imputation method step by step. “The key to any MI program is to restore the error variance lost from regression-based single imputation. In order to restore this lost variance, the first part of imputation is to add random error variance. The second part of restoring lost variance relates to the fact that each imputed value is based on a single regression equation. In order to adjust the lost error completely, one should obtain multiple random draws from the population and impute multiple times.”

To compare the results of these data imputation methods, Saunders et at.,(2006) use variables with missing values from two data sets and conduct statistical analysis. Despite of some variations in terms of F value or slope of coefficient criteria, I think this study fails to reveal significant difference among five imputation methods due to the characteristic of the examples, a large sample size with only a small percentage of missing values. According to Graham (2009), in the case of small amount of missing values in the data set (i.e., under 5%), multiple imputation could be applied, but not essential. Therefore, in my opinion (and as the author admitted), this study should be conducted in a more sophisticated way such as Monte Carlo simulation to compare the results and use the other data sets which can generate statistically significant results.


< List of Reference >

Graham, J.W. (2009). Missing data analysis: Making it work in the real world. Annual Review of         Psychology, 60, 549-576. doi: 10.1146/annurev.psych.58.110405.085530

Kang & Kim. (2006). Review for imputing missing data methods in public administration and policy research. Korean Public Administration Review, 40(2), 31-52.

Wayman, J.C. (2003). Multiple imputation for missing data: What is it and How can I use it? Paper    presented at the 2003 Annual meeting of the American Educational Research Association,    Chicago, IL. Retrieved from

Leave a Reply

Blog authors are solely responsible for the content of the blogs listed in the directory. Neither the content of these blogs, nor the links to other web sites, are screened, approved, reviewed or endorsed by McGill University. The text and other material on these blogs are the opinion of the specific author and are not statements of advice, opinion, or information of McGill.