Multiple sequence alignment is the process of aligning several related sequences, showing the conserved and unconserved residues across all of the sequences simultaneously. These conserved/unconserved residues form a pattern that can often be used to retreive sequences that are distantly related to the original group of sequences. These distant relatives are extreemly helpful in understanding the role that the group of sequences play in the process of life.
Sequence A
1 2 3 4 5 6 7 8
S -------------------------------
e 1 | | | | | | | | |
q |---|---|---|---|---|---|---|---|
u 2 | | | | | | | | |
e |---|---|---|---|---|---|---|---|
n 3 | | | | | | | | |
c |---|---|---|---|---|---|---|---|
e 4 | | | | | | | | |
|---|---|---|---|---|---|---|---|
B 5 | | | | | | | | |
-------------------------------
If we were to align
3 residue third sequence, sequence C, with the original
two sequences we would need 8x5x3=120 memory cells:
Sequence A
1 2 3 4 5 6 7 8
-------------------------------
3 / / / / / / / / /|
/---/---/---/---/---/---/---/---/ |
Sequence C 2 / / / / / / / / /|/|
/---/---/---/---/---/---/---/---/ | |
1 / / / / / / / / /|/|/|
S |-------------------------------| | | |
E 1 | | | | | | | | |/|/|/|
Q |---|---|---|---|---|---|---|---| | | |
U 2 | | | | | | | | |/|/|/|
E |---|---|---|---|---|---|---|---| | | |
N 3 | | | | | | | | |/|/|/
C |---|---|---|---|---|---|---|---| | |
E 4 | | | | | | | | |/|/
|---|---|---|---|---|---|---|---| |
B 5 | | | | | | | | |/
-------------------------------
This approach is not pratical for more than three average sized protein sequences. Lets look at the memory required to align average sized (300 residue) protein sequences:
|
Sequences
| Cells | Memory (4 bytes/cells) |
| 2 | 300^2 = 90000 | 351Kb |
| 3 | 300^3 = 27000000 | 105Mb |
| 4 | 300^4 = 8.1x10^9 | 31640Mb |
SEQ#03
/
/
278-/
/
/
*
23-/ \
* \
118-/ \ \-256
/ \ \
* 238-\ \
/ \ \ SEQ#02
230-/ \ SEQ#05
/ 205-\
/ \
SEQ#01 SEQ#04
After the joining order
has been determined, sequences close to each other are aligned
first. In the example above, SEQ#01 and SEQ#04 are
the first two sequences to be aligned. The third sequence, SEQ#05,
is then aligned with the two previously aligned sequences,SEQ#01
and SEQ#04. SEQ#02 is then aligned, followed by
SEQ#03. While this approach produces adequate results for many sets of sequences, The alignment produced by the procedure will vary dependent on the joining order. Thus, joining the sequences in this order:
[ [ [ [SEQ#03 + SEQ#02] + SEQ#05] + SEQ#04] + SEQ#01]
may not produce the same alignment as joining the sequences in the original order:
[ [ [ [SEQ#01 + SEQ#04] + SEQ#05] + SEQ#02] + SEQ#03]
The advantages to this
approach are that it requires only modest computer resources and
that it is capable of aligning hundreds of sequences.
The MSA program uses a clever approach to restrict the amount of memory by computing bounds that approximate the center of a multi-dimensional hypercube. The first bound is producing by computing pairwise alignments between the set of sequences. Weights are usually applied to this value to produce the lower bound used by the program. Next a heuristic alignment is produced for the sequences. This heuristic alignment is produced by a procedure similar to progressive pairwise approach outlined above. Weights are usually applied to this value to produce the upper bound used by the program. A delta value is then computed to be the difference between these two values. The epsilon values shown by the program is the computed delta value broke down per pairwise alignment. To produce good optimal alignments, epsilon and delta are the two most important parameters that you need to pay attention to. The delta and epsilon values are preliminary measures of the divergence between the set of sequences. Thus, closely related sequences will have low epsilons and deltas while distantly related sequences will have high epsilons and deltas.
Even though MSA reduces the space required to produce a multiple alignment dramatically, it is still uses much more memory than the progressive pairwise technique. Generally speaking, MSA will produce better alignments than most multiple sequence alignment programs such as Clustal or Pileup. The drawback with using MSA is that it requires an enormous amount of both computer time and memory to align more than a few distantly related sequences. However, we have been able to use MSA to optimally align 20 Phospholipase A2 sequences (approximately 130 residues), 14 Cytochrome C sequences (approximately 110 residues), 6 Aspartal proteases (approximately 350 residues), and 8 Lipid binding proteins (approximately 480 residues) on our computers. All of these problems approached the limits of the problems that can be solved optimally by the MSA program. The size of the problems solved by MSA are directly related to the sequence lengths, the number of sequences, and the amount of sequence diversity
| # | Sequences | Elapsed | CPU Time | Memory |
| 1 | humcetp | na | na | na |
| 2 | hupltp | 00:00:00 | 00:00:00 | 608,056 |
| 3 | rrrya3 | 00:00:24 | 00:00:24 | 632,863 |
| 4 | bovbpi | 00:01:53 | 00:01:53 | 20,432,143 |
| 5 | ratlbp | 00:20:52 | 00:20:51 | 75,296,490 |
| 6 | rry2g5 | 16:48:04 | 16:04:10 | 10,129,449,583 |
[an error occurred while processing this directive]