Repository logo
 

Comparison of DNA sequence assembly algorithms using mixed data sources

dc.contributor.advisorKusalik, Anthonyen_US
dc.contributor.committeeMemberSharpe, Andrewen_US
dc.contributor.committeeMemberZiola, Barryen_US
dc.contributor.committeeMemberMcquillan, Ianen_US
dc.creatorBamidele-Abegunde, Tejumoluwaen_US
dc.date.accessioned2010-04-14T13:33:41Zen_US
dc.date.accessioned2013-01-04T04:29:11Z
dc.date.available2011-04-15T08:00:00Zen_US
dc.date.available2013-01-04T04:29:11Z
dc.date.created2010-04en_US
dc.date.issued2010-04en_US
dc.date.submittedApril 2010en_US
dc.description.abstractDNA sequence assembly is one of the fundamental areas of bioinformatics. It involves the correct formation of a genome sequence from its DNA fragments ("reads") by aligning and merging the fragments. There are different sequencing technologies -- some support long DNA reads and the others, shorter DNA reads. There are sequence assembly programs specifically designed for these different types of raw sequencing data. This work explores and experiments with these different types of assembly software in order to compare their performance on the type of data for which they were designed, as well as their performance on data for which they were not designed, and on mixed data. Such results are useful for establishing good procedures and tools for sequence assembly in the current genomic environment where read data of different lengths are available. This work also investigates the effect of the presence or absence of quality information on the results produced by sequence assemblers. Five strategies were used in this research for assembling mixed data sets and the testing was done using a collection of real and artificial data sets for six bacterial organisms. The results show that there is a broad range in the ability of some DNA sequence assemblers to handle data from various sequencing technologies, especially data other than the kind they were designed for. For example, the long-read assemblers PHRAP and MIRA produced good results from assembling 454 data. The results also show the importance of having an effective methodology for assembling mixed data sets. It was found that combining contiguous sequences obtained from short-read assemblers with long DNA reads, and then assembling this combination using long-read assemblers was the most appropriate approach for assembling mixed short and long reads. It was found that the results from assembling the mixed data sets were better than the results obtained from separately assembling individual data from the different sequencing technologies. DNA sequence assemblers which do not depend on the availability of quality information were used to test the effect of the presence of quality values when assembling data. The results show that regardless of the availability of quality information, good results were produced in most of the assemblies. In more general terms, this work shows that the approach or methodology used to assemble DNA sequences from mixed data sources makes a lot of difference in the type of results obtained, and that a good choice of methodology can help reduce the amount of effort spent on a DNA sequence assembly project.en_US
dc.identifier.urihttp://hdl.handle.net/10388/etd-04142010-133341en_US
dc.language.isoen_USen_US
dc.subjectSanger sequencingen_US
dc.subjectNext generation sequencing technoloigesen_US
dc.subjectDNA sequence assemblyen_US
dc.titleComparison of DNA sequence assembly algorithms using mixed data sourcesen_US
dc.type.genreThesisen_US
dc.type.materialtexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.grantorUniversity of Saskatchewanen_US
thesis.degree.levelMastersen_US
thesis.degree.nameMaster of Science (M.Sc.)en_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
my-thesis.pdf
Size:
4.67 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
905 B
Format:
Plain Text
Description: