Repository logo
 

Exploring the Behaviour of the Hidden Markov Model on CpG Island Prediction

dc.contributor.advisorKusalik, Tonyen_US
dc.contributor.advisorHarkness, Troyen_US
dc.contributor.committeeMemberMcQuillan, Ianen_US
dc.contributor.committeeMemberWu, FangXiangen_US
dc.creatorBerg, Arnieen_US
dc.date.accessioned2013-05-28T12:00:17Z
dc.date.available2013-05-28T12:00:17Z
dc.date.created2013-04en_US
dc.date.issued2013-05-27en_US
dc.date.submittedApril 2013en_US
dc.description.abstractDNA can be represented abstrzctly as a language with only four nucleotides represented by the letters A, C, G, and T, yet the arrangement of those four letters plays a major role in determining the development of an organism. Understanding the signi cance of certain arrangements of nucleotides can unlock the secrets of how the genome achieves its essential functionality. Regions of DNA particularly enriched with cytosine (C nucleotides) and guanine (G nucleotides), especially the CpG di-nucleotide, are frequently associated with biological function related to gene expression, and concentrations of CpGs referred to as \CpG islands" are known to collocate with regions upstream from gene coding sequences within the promoter region. The pattern of occurrence of these nucleotides, relative to adenine (A nucleotides) and thymine (T nucleotides), lends itself to analysis by machine-learning techniques such as Hidden Markov Models (HMMs) to predict the areas of greater enrichment. HMMs have been applied to CpG island prediction before, but often without an awareness of how the outcomes are a ected by the manner in which the HMM is applied. Two main ndings of this study are: 1. The outcome of a HMM is highly sensitive to the setting of the initial probability estimates. 2. Without the appropriate software techniques, HMMs cannot be applied e ectively to large data such as whole eukaryotic chromosomes. Both of these factors are rarely considered by users of HMMs, but are critical to a successful application of HMMs to large DNA sequences. In fact, these shortcomings were discovered through a close examination of published results of CpG island prediction using HMMs, and without being addressed, can lead to an incorrect implementation and application of HMM theory. A rst-order HMM is developed and its performance compared to two other historical methods, the Takai and Jones method and the UCSC method from the University of California Santa Cruz. The HMM is then extended to a second-order to acknowledge that pairs of nucleotides de ne CpG islands rather than single nucleotides alone, and the second-order HMM is evaluated in comparison to the other methods. The UCSC method is found to be based on properties that are not related to CpG islands, and thus is not a fair comparison to the other methods. Of the other methods, the rst-order HMM method and the Takai and Jones method are comparable in the tests conducted, but the second-order HMM method demonstrates superior predictive capabilities. However, these results are valid only when taking into consideration the highly sensitive outcomes based on initial estimates, and nding a suitable set of estimates that provide the most appropriate results. The rst-order HMM is applied to the problem of producing synthetic data that simulates the characteristics of a DNA sequence, including the speci ed presence of CpG islands, based on the model parameters of a trained HMM. HMM analysis is applied to the synthetic data to explore its delity in generating data with similar characteristics, as well as to validate the predictive ability of an HMM. Although this test fails to i meet expectations, a second test using a second-order HMM to produce simulated DNA data using frequency distributions of CpG island pro les exhibits highly accurate predictions of the pre-speci ed CpG islands, con- rming that when the synthetic data are appropriately structured, an HMM can be an accurate predictive tool. One outcome of this thesis is a set of software components (CpGID 2.0 and TrackMap) capable of ef- cient and accurate application of an HMM to genomic sequences, together with visualization that allows quantitative CpG island results to be viewed in conjunction with other genomic data. CpGID 2.0 is an adaptation of a previously published software component that has been extensively revised, and TrackMap is a companion product that works with the results produced by the CpGID 2.0 program. Executing these components allows one to monitor output aspects of the computational model such as number and size of the predicted CpG islands, including their CG content percentage and level of CpG frequency. These outcomes can then be related to the input values used to parameterize the HMM.en_US
dc.identifier.urihttp://hdl.handle.net/10388/ETD-2013-04-1030en_US
dc.language.isoengen_US
dc.subjectCpG islandsen_US
dc.subjectHidden Markov Modelen_US
dc.subjectsynthetic dataen_US
dc.subjectBaum-Welchen_US
dc.subjectViterbien_US
dc.subjectmethylationen_US
dc.titleExploring the Behaviour of the Hidden Markov Model on CpG Island Predictionen_US
dc.type.genreThesisen_US
dc.type.materialtexten_US
thesis.degree.departmentComputer Scienceen_US
thesis.degree.disciplineComputer Scienceen_US
thesis.degree.grantorUniversity of Saskatchewanen_US
thesis.degree.levelMastersen_US
thesis.degree.nameMaster of Science (M.Sc.)en_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
BERG-THESIS.pdf
Size:
1.87 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1003 B
Format:
Plain Text
Description: