Predicting Phenotypes From Novel Genomic Markers Using Deep Learning
Genomic selection (GS) is a powerful method concerned with predicting the phenotypes of individuals from genome-wide markers to select candidates for the next breeding cycle. Previous studies in GS have used single nucleotide polymorphism (SNP) markers to predict phenotypes using conventional statistical or deep learning models. However, these predictive models face challenges due to the high dimensionality of genome-wide SNP marker data and interactions between alleles. Thanks to recent breakthroughs in DNA sequencing and decreased sequencing cost, the study of novel genomic variants such as structural variations (SVs) and transposable elements (TEs) became increasingly prevalent. Here, we present a one-dimensional deep convolutional neural network, NovGMDeep, to predict phenotypes using novel genomic markers, such as SVs and TEs. The model is designed to use novel genomic markers to reduce the curse of dimensionality of the SNP genotypic data for GS. The proposed model is trained and tested on the samples of Arabidopsis thaliana and Oryza sativa using 3-fold cross-validation. The prediction accuracy is evaluated using Pearson’s Correlation Coefficient (PCC), Mean Absolute Error (MAE), and Standard Deviation (SD) of MAE on the testing sets. The predicted results showed a higher correlation when the model is trained with SVs and TEs than SNPs. NovGMDeep also has higher prediction accuracy when compared with conventional statistical models. We also included an extended study which describes sample size effects when the proposed model is trained on different number of samples for SVs. The results show better PCC values when the model was trained on more than 700 samples. This work sheds light on the unrecognized function of SVs and TEs in genotype-to-phenotype associations, as well as their extensive significance and value in crop development. Moreover, the predictions identified here using SVs and TEs will be useful to investigate the evolution and trait architecture of A. thaliana and O. sativa.
Genomic Selection, Deep Learning, Structural Variants, Transposable Elements, Computational Genomics.
Master of Science (M.Sc.)