Repository logo
 

Exploring Structural Variant Identification using Current Software, Whole-Genome Alignment Methods, and a Preliminary Study into Graph-based Alternatives

Date

2023-12-21

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

0000-0003-2522-2833

Type

Thesis

Degree Level

Masters

Abstract

Structural variants (SVs) are genetic sequence rearrangements that play a significant role in many critical biological traits; however, current SV identification tools often produce substantial disparities in their outputs. Additionally, due to low alignment accuracy, most SV identification methods struggle in complex or repetitive genetic regions, introducing errors in the SV results. This struggle with alignment accuracy is especially concerning when considering the highly repetitive nature of plant genomes. Consequently, this thesis addresses four research objectives, including a comparative study of several state-of-the-art SV tools, the creation of a whole genome alignment-based SV calling model, the construction of a quantitative and automated evaluation process to measure the accuracy of SV results, and a preliminary study into the patterns created by simulated SV sequences when modelled using sequence graphs. First, this thesis proposes a Snakemake pipeline named Structural Variants - Jaccard Index Measure, or SV-JIM, to identify SVs using multiple SV callers and then reduce the disparity and improve the confidence of SV results. SV-JIM contains several existing SV callers that take raw sequencing reads or genome assemblies as input. It uses these callers as a foundation to generate SV sets supported by multiple types of evidence and results. Further, this work evaluates inter-caller consistency and examines several patterns produced by their results through an aggregation approach. SV-JIM was validated using datasets from several species, including Brassica nigra, Arabidopsis thaliana, and Homo sapiens, which permitted a detailed survey of its results with different-sized genomes. The human genome data allowed SV-JIM to be benchmarked against known SV locations to assess its precision, recall, and F1 scores. Using the benchmark, the SV callers contained in SV-JIM achieved precision and recall rates as high as 67% and 90%. The benchmark served to identify top performers and provided insights into finding the optimal amount of consensus between SV callers. SV-JIM is available under MIT license through GitHub at https://github.com/USask-BINFO/SV-JIM. Second, this thesis proposes a software pipeline named Structural Variant Pattern Scan, or SVPS, to explore using whole genome alignment for SV detection. SVPS takes whole genome alignments (WGA) as input and detects SV locations based on patterns found in the input WGA. Several quantitative and automated processes to improve the thoroughness of SV result verification are incorporated within SVPS to evaluate the precision of its results when validated using Brassica nigra and Arabidopsis thaliana data. Using these data, SVPS demonstrated high precision rates above 90% for most SV types. In addition, the experiments used multiple whole genome alignment software configurations to study the effect of alignment sensitivity on SV results, suggesting that differences in sensitivity can reduce the granularity of alignment gaps and distort which regions are reported. SVPS is available under MIT license through GitHub at https://github.com/USask-BINFO/SVPS. Last, this thesis explores using k-mer and string graphs to model biological sequences and examine any patterns created by variations at known SV locations. Several basic k-mer and string graphs were constructed using simulated sequences containing a single SV to identify graph patterns that could be used to detect SV locations algorithmically. These graphs also revealed several complexities in the graphs' construction, including a string graph's tendency to represent identical subsequences using different vertices. This led to a greedy approach to their construction. Further, these experiments also identified several desirable graph features to explore in future research, including providing single base SV breakpoint resolution between vertices and allowing genetic sequences to traverse vertices in both the forward and reverse orientations.

Description

Keywords

Structural Variants, Whole Genome Alignment, Comparative Genomics, Genetic Variation

Citation

Degree

Master of Science (M.Sc.)

Department

Computer Science

Program

Computer Science

Part Of

item.page.relation.ispartofseries

DOI

item.page.identifier.pmid

item.page.identifier.pmcid