Repository logo

MULTIPLE IMPUTATION TO CORRECT FOR MEASUREMENT ERROR: Application to Chronic Disease Case Ascertainment in Administrative Health Databases




Journal Title

Journal ISSN

Volume Title




Degree Level



Diagnosis codes in administrative health databases (AHDs) are commonly used to ascertain chronic disease cases for research and surveillance. Low sensitivity of diagnosis codes has been demonstrated in many studies that validate AHDs against a gold standard data source in which the true disease status is known. This will result in misclassification of disease status, which can lead to biased prevalence estimates and loss of power to detect associations between diseases status and health outcomes. Model-based case detection algorithms in combination with multiple imputation (MI) methods in validation dataset/main dataset designs could be used to correct for misclassification of chronic disease status in AHDs. Under this approach, a predictive model of disease status (e.g., logistic model) is constructed in the validation dataset, the model parameters are estimated and MI methods are used to impute true disease status in the main dataset. This research considered scenarios that the misclassification of the observed disease status is independent of disease predictors and dependent on disease predictors. When the misclassification of the observed disease status is independent of disease predictors, the MI methods based on Frequentist logistic model (with and without bias correction) and Bayesian logistic model were compared. And when the misclassification of the observed disease status is dependent on disease predictors, the MI based on Frequentist logistic model with different variables as covariates were compared. Monte Carlo techniques were used to investigate the effects of the following data and model characteristics on bias and error in chronic disease prevalence estimates from AHDs: sensitivity of observed disease status based on diagnosis codes, size of the validation dataset, number of imputations, and the magnitude of measurement error in covariates of the predictive model. Relative bias, root mean squared error and coverage of 95% confidence interval were used to measure the performance. Without bias correction, the Bayesian MI model has lower RMSE than the Frequentist MI model. And the Frequentist MI model with bias correction is demonstrated via a simulation study to have superior performance to Bayesian MI model and the Frequentist MI model without bias correction. The results indicate that MI works well for measurement error correction if the missing true values are not missing not at random no matter whether the observed disease diagnosis is dependent on other disease predictors or not. Increasing the size of the validation dataset can improve the performance of MI better than increasing the number of imputations.



Keywords: disease prevalence, measurement error, multiple imputation



Master of Science (M.Sc.)


School of Public Health




Part Of