Repository logo
 

A comparative study of genotype imputation programs

Date

2020-01-06

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Type

Thesis

Degree Level

Masters

Abstract

Background Genotype imputation infers missing genotypic data computationally, and has been reported to be highly useful in various genetic studies; e.g., genome-wide association studies and genomic selection. Motivation While various genotype imputation programs have been evaluated via different measurements, some, such as Pearson correlation, may not be appropriate for a given context and may result in misleading results. Further, most evaluations of genotype imputation programs are focused on human data. Finally, the most commonly used measurement, concordance, is unable to determine a difference in performance in some cases. Research Questions (1) How do popular genotype imputation programs (i.e., Minimac and Beagle) perform on plant data as compared to human data? (2) Can we find measures that better discriminate imputation performance when concordance does not? and (3) What do alternate measures indicate for the performance of these imputation programs? Methods Since Kullback-Leibler divergence (K-L divergence) and Hellinger distance can aid in ranking statistical inference methods, they can be highly useful in our study. To amplify signals from K-L divergence and Hellinger distance, we obtain their negative logarithmic values (i.e., negative logarithmic K-L divergence (NLKLD) and negative logarithmic Hellinger distance (NLHD)) so that larger values indicate better imputation results. With NLKLD and NLHD, we investigate the performance of two existing genotype imputation programs (i.e., Beagle and Minimac) on data from plants, specifically Arabidopsis thaliana and rice, as well as human. For each pair of organisms to be compared, we select data from one chromosome of each organism such that approximately the same number of samples/participants and SNPs are present for each organism. Finally, we apply different missing rates for target datasets and different sample size ratios between reference and target datasets for sensitivity analysis of the imputation programs. Results We demonstrate that in a general case where single nucleotide polymorphisms (SNPs) with different minor allele frequencies (MAFs) are imputed at the same concordance, both NLKLD and NLHD capture a difference in the imputation performance. Such a difference reflects not only the difference of correspondence between the known and imputed MAFs, but also the difference of chance agreement between the known and imputed genotypes. Additionaly, neither Minimac nor Beagle performs better on either A. thaliana or human data. However, Beagle performs better on human data than on rice data. Finally, the majority of both NLKLD and NLHD results from all experimental data indicate that Minimac outperforms Beagle. Conclusions (1) Although neither Minimac nor Beagle consistently performs better on either plant or human data, Beagle evidently performs better on human data than on rice data; (2) NLKLD and NLHD can be more discriminating than concordance and should be considered in comparing different genotype imputation programs to determine superior imputation methods; and (3) the NLKLD and NLHD results suggest that Minimac’s imputation method is superior to Beagle’s. Further study can involve confirming these trends with runs on more experimental data.

Description

Keywords

genotype imputation K-L divergence Hellinger distance measures

Citation

Degree

Master of Science (M.Sc.)

Department

Computer Science

Program

Computer Science

Citation

Part Of

item.page.relation.ispartofseries

DOI

item.page.identifier.pmid

item.page.identifier.pmcid