JPSM/MPSM Seminar
December 9, 2020
Micha Fischer
"Sequential Imputation with Integrated Method Selection: A Novel Approach to Missing Value Imputation in High-Dimensional (Survey) Data"
"The issue of incomplete observations due to various reasons (i.e. item nonresponse, unit nonresponse, failure to link records, and panel attrition) is an inevitable problem in survey data sets. Multiple sequential imputation is often used to impute those missing values. However, in data sets where many variables are affected by missing values, appropriate specifications of sequential regression models can be burdensome and time consuming, as a separate model needs to be developed by a human imputer for each incomplete variable. This task is even more complex, because survey data typically consists of many different kinds of variables (i.e. continuous, binary, and multi-categorical) with possibly nontrivial and non-linear relationships. Available software packages for automated imputation procedures (e.g. MICE, IVEware) require model specifications for each variable containing missing values. Additionally, default models in this software can lead to bias in imputed values, for example when variables are non-normally distributed.
This research aims to automate the process of sequential imputation of missing values in high-dimensional data sets consisting of potentially non-normally distributed variables and potentially complex and non-linear interactions. To achieve this goal, we propose modifying the sequential imputation procedure. First, model specification via an automated variable selection procedure (e.g. adaptive LASSO, elastic net) is performed. Second, the process carries out model selection from a pool of several parametric and non-parametric models in each step of the sequential imputation procedure. The model selection is based on prediction accuracy and the similarity between imputed and observed values conditional on the response propensity score for the outcome variable.
A simulation study based on a survey data set (NHANES) investigates in which situations this automated procedure can outperform approaches implemented in the currently available software (MICE, IVEware). Evaluation of the proposed method focuses on two different aspects: 1) differences of quantitative properties of hypothetical models of interest fit on the imputed data are used to compare the accuracy of the different imputation procedures, 2) the methods are assessed in terms of run time."