Next: Progressive Alignment:
Near Optimal Multiple Sequence Alignments Using a Traveling Salesman Problem Approach
Chantal Korostensky and Gaston Gonnet
Swiss Federal Institute of Technology
Institute of Scientific Computing
ETH Zürich, Switzerland
korosten, gonnet@inf.ethz.ch
Abstract:
We present a new method for the calculation of multiple sequence
alignments (MSAs). The input to our problem are
n protein
sequences. We assume that the sequences are related with each
other and that there exists some unknown evolutionary tree that
corresponds to the MSA. One advantage of our method is that the
scoring can be done with reference to this phylogenetic tree, even
though the tree structure itself may remain unknown. Instead of
computing an evolutionary tree, we only need to compute a circular
tour of the tree which is determined via a Traveling Salesman
Problem (TSP) algorithm. Our algorithm can calculate a near
optimal MSA and has a performance guarantee of
(where
opt is the optimal score of the MSA). The algorithm
runs in
O(
k^{2} n^{2}) time, where
k is the length of the longest
input sequence. From there we improve the alignment further.
Experimental results are shown at the end.
Introduction The computation of multiple
sequence alignments (MSAs) remains one of the major open problems
in computational biology. Possibly this is due to the complexity
as most problems associated with MSAs are NPcomplete
[20]. It is therefore desirable to develop heuristics
that try to come as close to the optimum as possible but terminate
within reasonable time and space bounds. An MSA consists of a set
of strings, where each string represents an amino acid, DNA or RNA
sequence (see Definition 1.1).
Figure:
The sequences
of the MSA on the left are related by the evolutionary tree on the
right.
A: RPCVCP___VLRQAAQ__QVLQRQIIQGPQQLRRLF_AA
B: RPCACP___VLRQVVQ__QALQRQIIQGPQQLRRLF_AA
C: KPCLCPKQAAVKQAAH__QQLYQGQLQGPKQVRRAFRLL
D: KPCVCPRQLVLRQAAHLAQQLYQGQ____RQVRRAF_VA
E: KPCVCPRQLVLRQAAH__QQLYQGQ____RQVRRLF_AA



Example: In the MSA in Figure
1 five sequences are aligned (n=5). We
assume that the sequences are related and that we have the correct
tree. The internal nodes represent unknown ancestor sequences.
Definitions
Definition 0.1
Given is a set of sequences
with
where
is a finite alphabet. A
Multiple
Sequence Alignment (MSA) consists of a set of sequences
with
where
.
.
The sequence obtained from
by removing all
gap characters is equal to
s_{i}.
Definition 0.2
The character
or any contiguous sequence of
within an aligned sequence
is called a
gap. A gap
g has length
len(
g), where the
length is the number of contiguous dashes
.
It starts at a position
pos(
g) in sequence
and ends at position
end(
g).
Definition 0.4
Let
be the set of all possible MSAs that can be
generated for a given set of sequences
.
The
optimal MSA
is an MSA such that
w.l.o.g.
.
In many scoring functions the minimum is the optimal.
Problem 0.1 (MSA problem)
Given is a set of sequences
.
Find the
optimal MSA
for
S.
Scoring Pairwise Sequence Alignments To determine if
two sequences
are related and have a common
ancestor the sequences are usually aligned, and the problem is to
find the alignment that maximizes the probability that the two
sequences are related:
To actually calculate these probabilities, one applies a Markovian
model for sequence evolution [25,3].
This begins with an alignment of the two sequences, e.g.
VNRLQQNIVSL____________EVDHKVANYKP
VNRLQQSIVSLRDAFNDGTKLLEELDHRVLNYKP
The gaps arise from insertions (or their counterpart
deletions) during divergent evolution. The alignment is normally
done by a dynamic programming (DP) algorithm using Dayhoff
matrices
[12,32,28,1], which
finds the alignment that maximizes the probability that the two
sequences evolved from an ancestral sequence as opposed to being
random sequences. An affine gap cost is used according to the
formula
,
where a is a fixed gap cost, l is the
length of the gap and b is the incremental cost
[1,4]. More precisely, we are comparing
two possibilities
 a)
 that the two sequences arose
independently of each other (implying that the alignment is entirely arbitrary,
with amino acid i in one protein being aligned to amino acid j
in the other is occurring no more frequently than expected by
chance, which is equal to the product of the individual
frequencies with which amino acids i and j occur in the
database)

(1) 
 b)
 that the two sequences have evolved
from some common ancestral sequence after t units of evolution
where t is measured in PAM units [9].

(2) 
Definition 0.5
The score of an
optimal pairwise alignment
OPA(
s_{1},
s_{2}) of two
sequences
s_{1},
s_{2} is the score of an alignment with the
maximum score where a probabilistic scoring method
[
6,
10] is used. We refer to a pairwise
alignment of two sequences
s_{1},
s_{2} with
.

(3) 
The entries of the Dayhoff matrix are the logarithm of the
quotient of these two probabilities. Note that scores represent
the probabilities that the two sequences have a common ancestor.
The larger the score is the more likely it is that the two
sequences are homologous and therefore have a common ancestor.
Related work There are two main methods for finding
an optimal MSA:
 global methods which construct an alignment throughout the
length of the entire sequence (see below), and
 local methods which only attempt to identify an ordered series
of motifs (homologous regions) while ignoring regions between
motifs [17,18].
We are concerned with global methods and we briefly describe the
different strategies used along with some of their drawbacks.
Heuristics
Next: Progressive Alignment:
Chantal Korostensky
19990714