University of SaskatchewanHARVEST
  • Login
  • Submit Your Work
  • About
    • About HARVEST
    • Guidelines
    • Browse
      • All of HARVEST
      • Communities & Collections
      • By Issue Date
      • Authors
      • Titles
      • Subjects
      • This Collection
      • By Issue Date
      • Authors
      • Titles
      • Subjects
    • My Account
      • Login
      JavaScript is disabled for your browser. Some features of this site may not work without it.
      View Item 
      • HARVEST
      • Electronic Theses and Dissertations
      • Graduate Theses and Dissertations
      • View Item
      • HARVEST
      • Electronic Theses and Dissertations
      • Graduate Theses and Dissertations
      • View Item

      A comparative study of genotype imputation programs

      Thumbnail
      View/Open
      YANG-THESIS-2020.pdf (1.808Mb)
      README.txt (1.125Kb)
      rice.chr12.nlkld.txt (3.900Kb)
      rice.chr12.nlhd.txt (3.923Kb)
      rice.chr12.iqs.txt (5.643Kb)
      rice.chr12.concordance.txt (4.750Kb)
      human.chr13.nlkld.txt (3.907Kb)
      human.chr13.nlhd.txt (3.916Kb)
      human.chr13.iqs.txt (5.775Kb)
      human.chr13.concordance.txt (4.724Kb)
      human.chr22.nlkld.txt (3.929Kb)
      human.chr22.nlhd.txt (3.869Kb)
      human.chr22.iqs.txt (5.599Kb)
      human.chr22.concordance.txt (4.758Kb)
      A.thaliana.chr4.nlkld.txt (3.926Kb)
      A.thaliana.chr4.nlhd.txt (3.895Kb)
      A.thaliana.chr4.iqs.txt (5.504Kb)
      A.thaliana.chr4.concordance.txt (4.685Kb)
      Date
      2020-01-06
      Author
      Yang, Nathan 1991-
      Type
      Thesis
      Degree Level
      Masters
      Metadata
      Show full item record
      Abstract
      Background Genotype imputation infers missing genotypic data computationally, and has been reported to be highly useful in various genetic studies; e.g., genome-wide association studies and genomic selection. Motivation While various genotype imputation programs have been evaluated via different measurements, some, such as Pearson correlation, may not be appropriate for a given context and may result in misleading results. Further, most evaluations of genotype imputation programs are focused on human data. Finally, the most commonly used measurement, concordance, is unable to determine a difference in performance in some cases. Research Questions (1) How do popular genotype imputation programs (i.e., Minimac and Beagle) perform on plant data as compared to human data? (2) Can we find measures that better discriminate imputation performance when concordance does not? and (3) What do alternate measures indicate for the performance of these imputation programs? Methods Since Kullback-Leibler divergence (K-L divergence) and Hellinger distance can aid in ranking statistical inference methods, they can be highly useful in our study. To amplify signals from K-L divergence and Hellinger distance, we obtain their negative logarithmic values (i.e., negative logarithmic K-L divergence (NLKLD) and negative logarithmic Hellinger distance (NLHD)) so that larger values indicate better imputation results. With NLKLD and NLHD, we investigate the performance of two existing genotype imputation programs (i.e., Beagle and Minimac) on data from plants, specifically Arabidopsis thaliana and rice, as well as human. For each pair of organisms to be compared, we select data from one chromosome of each organism such that approximately the same number of samples/participants and SNPs are present for each organism. Finally, we apply different missing rates for target datasets and different sample size ratios between reference and target datasets for sensitivity analysis of the imputation programs. Results We demonstrate that in a general case where single nucleotide polymorphisms (SNPs) with different minor allele frequencies (MAFs) are imputed at the same concordance, both NLKLD and NLHD capture a difference in the imputation performance. Such a difference reflects not only the difference of correspondence between the known and imputed MAFs, but also the difference of chance agreement between the known and imputed genotypes. Additionaly, neither Minimac nor Beagle performs better on either A. thaliana or human data. However, Beagle performs better on human data than on rice data. Finally, the majority of both NLKLD and NLHD results from all experimental data indicate that Minimac outperforms Beagle. Conclusions (1) Although neither Minimac nor Beagle consistently performs better on either plant or human data, Beagle evidently performs better on human data than on rice data; (2) NLKLD and NLHD can be more discriminating than concordance and should be considered in comparing different genotype imputation programs to determine superior imputation methods; and (3) the NLKLD and NLHD results suggest that Minimac’s imputation method is superior to Beagle’s. Further study can involve confirming these trends with runs on more experimental data.
      Degree
      Master of Science (M.Sc.)
      Department
      Computer Science
      Program
      Computer Science
      Supervisor
      Kusalik, Anthony
      Committee
      McQuillan, Ian; Links, Matthew; Konkin, David; Keil, Mark
      Copyright Date
      June 2020
      URI
      http://hdl.handle.net/10388/12506
      Subject
      genotype imputation K-L divergence Hellinger distance measures
      Collections
      • Graduate Theses and Dissertations
      University of Saskatchewan

      University Library

      The University of Saskatchewan's main campus is situated on Treaty 6 Territory and the Homeland of the Métis.

      © University of Saskatchewan
      Contact Us | Disclaimer | Privacy