Repository logo
 

A comparative study of genotype imputation programs

dc.contributor.advisorKusalik, Anthony
dc.contributor.committeeMemberMcQuillan, Ian
dc.contributor.committeeMemberLinks, Matthew
dc.contributor.committeeMemberKonkin, David
dc.contributor.committeeMemberKeil, Mark
dc.creatorYang, Nathan 1991-
dc.date.accessioned2020-01-06T15:17:01Z
dc.date.available2020-01-06T15:17:01Z
dc.date.created2020-06
dc.date.issued2020-01-06
dc.date.submittedJune 2020
dc.date.updated2020-01-06T15:17:02Z
dc.description.abstractBackground Genotype imputation infers missing genotypic data computationally, and has been reported to be highly useful in various genetic studies; e.g., genome-wide association studies and genomic selection. Motivation While various genotype imputation programs have been evaluated via different measurements, some, such as Pearson correlation, may not be appropriate for a given context and may result in misleading results. Further, most evaluations of genotype imputation programs are focused on human data. Finally, the most commonly used measurement, concordance, is unable to determine a difference in performance in some cases. Research Questions (1) How do popular genotype imputation programs (i.e., Minimac and Beagle) perform on plant data as compared to human data? (2) Can we find measures that better discriminate imputation performance when concordance does not? and (3) What do alternate measures indicate for the performance of these imputation programs? Methods Since Kullback-Leibler divergence (K-L divergence) and Hellinger distance can aid in ranking statistical inference methods, they can be highly useful in our study. To amplify signals from K-L divergence and Hellinger distance, we obtain their negative logarithmic values (i.e., negative logarithmic K-L divergence (NLKLD) and negative logarithmic Hellinger distance (NLHD)) so that larger values indicate better imputation results. With NLKLD and NLHD, we investigate the performance of two existing genotype imputation programs (i.e., Beagle and Minimac) on data from plants, specifically Arabidopsis thaliana and rice, as well as human. For each pair of organisms to be compared, we select data from one chromosome of each organism such that approximately the same number of samples/participants and SNPs are present for each organism. Finally, we apply different missing rates for target datasets and different sample size ratios between reference and target datasets for sensitivity analysis of the imputation programs. Results We demonstrate that in a general case where single nucleotide polymorphisms (SNPs) with different minor allele frequencies (MAFs) are imputed at the same concordance, both NLKLD and NLHD capture a difference in the imputation performance. Such a difference reflects not only the difference of correspondence between the known and imputed MAFs, but also the difference of chance agreement between the known and imputed genotypes. Additionaly, neither Minimac nor Beagle performs better on either A. thaliana or human data. However, Beagle performs better on human data than on rice data. Finally, the majority of both NLKLD and NLHD results from all experimental data indicate that Minimac outperforms Beagle. Conclusions (1) Although neither Minimac nor Beagle consistently performs better on either plant or human data, Beagle evidently performs better on human data than on rice data; (2) NLKLD and NLHD can be more discriminating than concordance and should be considered in comparing different genotype imputation programs to determine superior imputation methods; and (3) the NLKLD and NLHD results suggest that Minimac’s imputation method is superior to Beagle’s. Further study can involve confirming these trends with runs on more experimental data.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/10388/12506
dc.subjectgenotype imputation K-L divergence Hellinger distance measures
dc.titleA comparative study of genotype imputation programs
dc.typeThesis
dc.type.materialtext
thesis.degree.departmentComputer Science
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Saskatchewan
thesis.degree.levelMasters
thesis.degree.nameMaster of Science (M.Sc.)

Files

Original bundle
Now showing 1 - 5 of 18
Loading...
Thumbnail Image
Name:
YANG-THESIS-2020.pdf
Size:
1.81 MB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
README.txt
Size:
1.13 KB
Format:
Plain Text
No Thumbnail Available
Name:
rice.chr12.nlkld.txt
Size:
3.9 KB
Format:
Plain Text
No Thumbnail Available
Name:
rice.chr12.nlhd.txt
Size:
3.92 KB
Format:
Plain Text
No Thumbnail Available
Name:
rice.chr12.iqs.txt
Size:
5.64 KB
Format:
Plain Text
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
2.27 KB
Format:
Plain Text
Description: