Repository logo
 

Feature Selection Bias in Assessing the Predictivity of SNPs for Alzheimer's Disease

dc.contributor.advisorLi, Longhai
dc.contributor.advisorBalbuena, Lloyd
dc.contributor.committeeMemberFeng, Cindy
dc.contributor.committeeMemberWu, Fangxiang
dc.contributor.committeeMemberWang, JC
dc.creatorDong, Mei
dc.date.accessioned2019-06-07T20:07:17Z
dc.date.available2019-06-07T20:07:17Z
dc.date.created2019-05
dc.date.issued2019-06-07
dc.date.submittedMay 2019
dc.date.updated2019-06-07T20:07:18Z
dc.description.abstractIn the context of identifying related SNPs for a phenotype of interest (e.g., a disease status), we consider the problem of assessing the predictivity of SNPs that are selected by performing genome-wide association studies. Internal cross-validation (ICV) is a wrong but often used method for this assessment. With ICV, a subset of SNPs are pre-selected based on all samples then cross-validation (CV) is applied to assess the predictivity of the pre-selected SNPs. The predictivity estimate of the selected SNPs given by ICV is upwardly biased. This is often called the feature selection bias problem. The cause of this bias is that the feature selection procedure, which is a part of training procedure, is not external to the test samples in ICV. A correct method, called external cross-validation (ECV), is to re-select features based on only the training samples in each fold of CV such that the feature selection is external to test samples. The feature selection bias of ICV has been discussed by a few articles in the context of cancer diagnosis with microarray data. However, this problem has not received sufficient attention in the literature, especially in the context of predicting with SNP data. Many articles in the literature use ICV or do not state explicitly that their feature selection is external to test samples. In this thesis, we use an example of predicting late-onset Alzheimer's disease (LOAD) from selected SNPs to demonstrate that ICV could lead to severe false discovery. We use a real SNP dataset related to LOAD and two synthetic datasets (simulated response with real SNPs) for this demonstration. For the prediction, we compare the performances of three regularized logistic regression methods: LASSO, elastic-net, and a fully Bayesian hyper-LASSO method. For the LOAD dataset, we see that, except for APOE, no other SNPs can improve the prediction of LOAD using ECV method; however, the predictivity estimate of selected SNPs given by ICV can reach an $R^{2}$ as high as 80\%. For the synthetic datasets, we obtain the similar results as in the real dataset; additionally we see that the predictivity estimate of selected SNPs obtained with ICV can be even higher than the oracle predictivity of the truly related SNPs used to generate the response. In this study, we also find that the hyper-LASSO method can achieve better predictive performance than the LASSO and elastic-net. We recommend that ICV should not be used to measure the predictivity of selected SNPs and this statement should be made clear in research articles.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/10388/12118
dc.subjectFeature Selection bias
dc.subjectCross-validation
dc.subjectPredictive Analysis
dc.subjectGWAS
dc.subjectAlzheimer's Disease
dc.titleFeature Selection Bias in Assessing the Predictivity of SNPs for Alzheimer's Disease
dc.typeThesis
dc.type.materialtext
thesis.degree.departmentMathematics and Statistics
thesis.degree.disciplineMathematics
thesis.degree.grantorUniversity of Saskatchewan
thesis.degree.levelMasters
thesis.degree.nameMaster of Science (M.Sc.)

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DONG-THESIS-2019.pdf
Size:
1.09 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.26 KB
Format:
Plain Text
Description: