Repository logo
 

Application of Wavelet-based Denoising to Improve the Accuracy of Nanopore Sequencing Data

Date

2022-08-18

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

0000-0001-5654-0231

Type

Thesis

Degree Level

Masters

Abstract

DNA sequencing methods in biology are divided into three generations based on their time of invention and technology used. First generation sequencing technologies introduced in the 1970s sequenced short strands of DNA, with the longest strand ranging from 300-1000 base pairs in the Sanger method. Second generation technologies improved on the first generation by being high throughput, scalable and parallel. After successful genome assemblies of small and large organisms using first and second generation sequencing methods, the last two decades brought about third generation sequencing technologies. Third generation sequencing technologies focus on sequencing single nucleotide molecules and produce real-time, high-throughput basecalls and are scalable, low cost and portable. Nanopore sequencing is a third generation sequencing technology that works by measuring the change in electric current in an ionic membrane as a DNA strand passes through a nanopore embedded in the membrane. A major limitation that has prevented mass adoption of nanopore sequencing commercially is its lower accuracy compared to second generation sequencing technologies. The aim in this project was to improve the accuracy of nanopore sequencing by reducing noise in the nanopore signal. Wavelets were used to decompose the nanopore signal, remove noise and then reconstruct the signal. The modified signal was used for training a new basecalling model. It was observed that a significant difference in basecall quality can be achieved between the default model used by Oxford Nanopore Technologies's Guppy basecaller and our custom denoised model in terms of mean percentage identity. An increase of 5.3% was achieved in mean percentage identity while maintaining the mean read quality of basecalls for Bacteriophage lambda dataset. Both mean percentage identity and mean read quality for the custom model were overall more consistent with lesser low scoring outliers. Haar wavelet was demonstrated as the most suitable wavelet candidate with level of decomposition and threshold values 4 and 0.04 respectively for denoising nanopore sequencing data. Results were validated by training and testing with and without wavelet denoising on three existing nanopore datasets.

Description

Keywords

nanopore sequencing, wavelets, denoising, basecall accuracy

Citation

Degree

Master of Science (M.Sc.)

Department

Computer Science

Program

Computer Science

Part Of

item.page.relation.ispartofseries

DOI

item.page.identifier.pmid

item.page.identifier.pmcid