Multiple sequence alignment augmented by expert user constraints
Sequence alignment has become one of the most common tasks in bioinformatics. Most of the existing sequence alignment methods use general scoring schemes. But these alignments are sometimes not completely relevant because they do not necessarily provide the desired information. It would be extremely difficult, if not impossible, to include any possible objective into an algorithm. Our goal is to allow a working biologist to augment a given alignment with additional information based on their knowledge and objectives.In this thesis, we will formally define constraints and compatible constraint sets for an alignment which require some positions of the sequences to be aligned together. Using this approach, one can align some specific segments such as domains within protein sequences by inputting constraints (the positions of the segments on the sequences), and the algorithm will automatically find an optimal alignment in which the segments are aligned together.A necessary prerequisite of calculating an alignment is that the constraints inputted be compatible with each other, and we will develop algorithms to check this condition for both pairwise and multiple sequence alignments. The algorithms are based on a depth-first search on a graph that is converted from the constraints and the alignment. We then develop algorithms to perform pairwise and multiple sequence alignments satisfying these compatible constraints.Using straightforward dynamic programming for pairwise sequence alignment satisfying a compatible constraint set, an optimal alignment corresponds to a path going through the dynamic programming matrix, and as we are only using single-position constraints, a constraint can be represented as a point on the matrix, so a compatible constraint set is a set of points. We try to determine a new path, rather than the original path, that achieves the highest score which goes through all the compatible constraint set points. The path is a concatenation of sub-paths, so that only the scores in the sub-matrices need to be calculated. This means the time required to get the new path decreases as the number of constraints increases, and it also varies as the positions of the points change. It can be further reduced by using the information from the original alignment, which can offer a significant speed gain.We then use exact and progressive algorithms to find multiple sequence alignments satisfying a compatible constraint set, which are extensions of pairwise sequence alignments. With exact algorithms for three sequences, where constraints are represented as lines, we discuss a method to force the optimal path to cross the constraint lines. And with progressive algorithms, we use a set of pairwise alignments satisfying compatible constraints to construct multiple sequence alignments progressively. Because they are more complex, we leave some extensions as future work.
multiple sequence alignment, constraint, compatible constraint set
Master of Science (M.Sc.)