A comparative study of genotype imputation programs

View/ Open
Date
2020-01-06Author
Yang, Nathan 1991-
Type
ThesisDegree Level
MastersMetadata
Show full item recordAbstract
Background Genotype imputation infers missing genotypic data computationally, and has been reported
to be highly useful in various genetic studies; e.g., genome-wide association studies and genomic selection.
Motivation While various genotype imputation programs have been evaluated via different measurements,
some, such as Pearson correlation, may not be appropriate for a given context and may result in misleading
results. Further, most evaluations of genotype imputation programs are focused on human data. Finally, the
most commonly used measurement, concordance, is unable to determine a difference in performance in some
cases.
Research Questions (1) How do popular genotype imputation programs (i.e., Minimac and Beagle) perform
on plant data as compared to human data? (2) Can we find measures that better discriminate imputation
performance when concordance does not? and (3) What do alternate measures indicate for the performance
of these imputation programs?
Methods Since Kullback-Leibler divergence (K-L divergence) and Hellinger distance can aid in ranking
statistical inference methods, they can be highly useful in our study. To amplify signals from K-L divergence
and Hellinger distance, we obtain their negative logarithmic values (i.e., negative logarithmic K-L divergence
(NLKLD) and negative logarithmic Hellinger distance (NLHD)) so that larger values indicate better imputation
results. With NLKLD and NLHD, we investigate the performance of two existing genotype imputation
programs (i.e., Beagle and Minimac) on data from plants, specifically Arabidopsis thaliana and rice, as well
as human. For each pair of organisms to be compared, we select data from one chromosome of each organism
such that approximately the same number of samples/participants and SNPs are present for each organism.
Finally, we apply different missing rates for target datasets and different sample size ratios between reference
and target datasets for sensitivity analysis of the imputation programs.
Results We demonstrate that in a general case where single nucleotide polymorphisms (SNPs) with different
minor allele frequencies (MAFs) are imputed at the same concordance, both NLKLD and NLHD capture a
difference in the imputation performance. Such a difference reflects not only the difference of correspondence
between the known and imputed MAFs, but also the difference of chance agreement between the known
and imputed genotypes. Additionaly, neither Minimac nor Beagle performs better on either A. thaliana or
human data. However, Beagle performs better on human data than on rice data. Finally, the majority of
both NLKLD and NLHD results from all experimental data indicate that Minimac outperforms Beagle.
Conclusions (1) Although neither Minimac nor Beagle consistently performs better on either plant or
human data, Beagle evidently performs better on human data than on rice data; (2) NLKLD and NLHD
can be more discriminating than concordance and should be considered in comparing different genotype
imputation programs to determine superior imputation methods; and (3) the NLKLD and NLHD results
suggest that Minimac’s imputation method is superior to Beagle’s. Further study can involve confirming
these trends with runs on more experimental data.
Degree
Master of Science (M.Sc.)Department
Computer ScienceProgram
Computer ScienceSupervisor
Kusalik, AnthonyCommittee
McQuillan, Ian; Links, Matthew; Konkin, David; Keil, MarkCopyright Date
June 2020Subject
genotype imputation
K-L divergence
Hellinger distance
measures