Sequence Analysis Tutorials

Multiple Sequence Alignments and Patterns

Multiple sequence alignment is the process of aligning several related sequences, showing the conserved and unconserved residues across all of the sequences simultaneously. These conserved/unconserved residues form a pattern that can often be used to retreive sequences that are distantly related to the original group of sequences. These distant relatives are extreemly helpful in understanding the role that the group of sequences play in the process of life.

Global Multiple Sequence Alignments

Global multiple sequence alignments are sequence alignments that require the participation of all sequence residues. A multiple sequence alignment shows the residue juxtaposition across the entire set of sequences; thus showing the conserved and unconserved residues across all of the sequences simultaneously.

Dynamic Programming Approach

In order to understand the multiple sequence alignment algorithms, we first, need to review aligning two sequences using the dynamic programming approach. (See the database searching tutorial for a full explaination of dynamic programming) In a straightforward implimentation, this problem would require memory proportional to the lengths of the sequences. Thus if sequence A had length L and sequence B had length M this problem could be solved using N x M cells of memory. For example if sequence A was 8 residues long and sequence B was 5 residues long, 40 memory cells would be needed:

                             Sequence  A
                     1   2   3   4   5   6   7   8           
            S       -------------------------------
            e    1 |   |   |   |   |   |   |   |   |
            q      |---|---|---|---|---|---|---|---|
            u    2 |   |   |   |   |   |   |   |   |
            e      |---|---|---|---|---|---|---|---|
            n    3 |   |   |   |   |   |   |   |   |
            c      |---|---|---|---|---|---|---|---|
            e    4 |   |   |   |   |   |   |   |   |
            B    5 |   |   |   |   |   |   |   |   |
If we were to align 3 residue third sequence, sequence C, with the original two sequences we would need 8x5x3=120 memory cells:
                                     Sequence  A
                           1   2   3   4   5   6   7   8    
                      3 /   /   /   /   /   /   /   /   /| 
                       /---/---/---/---/---/---/---/---/ |     
       Sequence C   2 /   /   /   /   /   /   /   /   /|/|
                     /---/---/---/---/---/---/---/---/ | |    
                  1 /   /   /   /   /   /   /   /   /|/|/| 
            S      |-------------------------------| | | |
            E    1 |   |   |   |   |   |   |   |   |/|/|/|
            Q      |---|---|---|---|---|---|---|---| | | |
            U    2 |   |   |   |   |   |   |   |   |/|/|/|
            E      |---|---|---|---|---|---|---|---| | | |
            N    3 |   |   |   |   |   |   |   |   |/|/|/
            C      |---|---|---|---|---|---|---|---| | | 
            E    4 |   |   |   |   |   |   |   |   |/|/ 
                   |---|---|---|---|---|---|---|---| |
            B    5 |   |   |   |   |   |   |   |   |/

This approach is not pratical for more than three average sized protein sequences. Lets look at the memory required to align average sized (300 residue) protein sequences:

Cells Memory (4 bytes/cells)
2 300^2 = 90000 351Kb
3 300^3 = 27000000 105Mb
4 300^4 = 8.1x10^9 31640Mb


Progressive pairwise approach

The progressive pairwise approach relies on exhaustive pairwise alignments between all of the sequences to produce a measure of sequence relatedness. From this measure, an algorithm (UPGMA in Pileup, Neighbor Joining in ClustalW) is used to develop a joining order. This joining order corresponds to a tree that is used to produce the multiple sequence alignment. It should be noted that this tree is not an evolutionary tree - Do not make the mistake of using it as one.

                          23-/ \
                            *   \
                       118-/ \   \-256
                          /   \   \ 
                         * 238-\   \
                        / \     \ SEQ#02  
                   230-/   \   SEQ#05
                      / 205-\
                     /       \
                  SEQ#01    SEQ#04

After the joining order has been determined, sequences close to each other are aligned first. In the example above, SEQ#01 and SEQ#04 are the first two sequences to be aligned. The third sequence, SEQ#05, is then aligned with the two previously aligned sequences,SEQ#01 and SEQ#04. SEQ#02 is then aligned, followed by SEQ#03.

While this approach produces adequate results for many sets of sequences, The alignment produced by the procedure will vary dependent on the joining order. Thus, joining the sequences in this order:

      [ [ [ [SEQ#03 + SEQ#02] + SEQ#05] + SEQ#04] + SEQ#01]

may not produce the same alignment as joining the sequences in the original order:

      [ [ [ [SEQ#01 + SEQ#04] + SEQ#05] + SEQ#02] + SEQ#03]
The advantages to this approach are that it requires only modest computer resources and that it is capable of aligning hundreds of sequences.

Modified dynamic programming approach

If we had a good global alignment between two related sequences, the memory cells that contain this alignment would be near the center diagonal of the two dimensional grid. Likewise if we had a good global alignment between three related sequences we would expect its alignment path to make use of the memory cells near the center of the cube.

The MSA program uses a clever approach to restrict the amount of memory by computing bounds that approximate the center of a multi-dimensional hypercube. The first bound is producing by computing pairwise alignments between the set of sequences. Weights are usually applied to this value to produce the lower bound used by the program. Next a heuristic alignment is produced for the sequences. This heuristic alignment is produced by a procedure similar to progressive pairwise approach outlined above. Weights are usually applied to this value to produce the upper bound used by the program. A delta value is then computed to be the difference between these two values. The epsilon values shown by the program is the computed delta value broke down per pairwise alignment. To produce good optimal alignments, epsilon and delta are the two most important parameters that you need to pay attention to. The delta and epsilon values are preliminary measures of the divergence between the set of sequences. Thus, closely related sequences will have low epsilons and deltas while distantly related sequences will have high epsilons and deltas.

Even though MSA reduces the space required to produce a multiple alignment dramatically, it is still uses much more memory than the progressive pairwise technique. Generally speaking, MSA will produce better alignments than most multiple sequence alignment programs such as Clustal or Pileup. The drawback with using MSA is that it requires an enormous amount of both computer time and memory to align more than a few distantly related sequences. However, we have been able to use MSA to optimally align 20 Phospholipase A2 sequences (approximately 130 residues), 14 Cytochrome C sequences (approximately 110 residues), 6 Aspartal proteases (approximately 350 residues), and 8 Lipid binding proteins (approximately 480 residues) on our computers. All of these problems approached the limits of the problems that can be solved optimally by the MSA program. The size of the problems solved by MSA are directly related to the sequence lengths, the number of sequences, and the amount of sequence diversity

An Example For best results running the program, first produce only a heuristic alignment with your set of sequences. Examine the alignment and the epsilons. If you are only using a few short sequences and the computed epsilons are relatively low (epsilons < 50) then the sequences are closely related. Continue on to produce an optimal alignment. If you are using a more complicated set of sequences, a ramping strategy is suggested. First, align three of your sequences optimally, than four, then five ... etc., until you exceed 32Mw of memory. Then double, triple, or quadruple the memory requirements and add one last sequence. You may or may not be able to get this last sequence to align. If you are dealing with long sequences, you may want to divide your sequences into two or three subregions, and align the subregions sepparately. The chart below illustrates using this strategy on a DEC 8400:

# Sequences Elapsed CPU Time Memory
1 humcetp na na na
2 hupltp 00:00:00 00:00:00 608,056
3 rrrya3 00:00:24 00:00:24 632,863
4 bovbpi 00:01:53 00:01:53 20,432,143
5 ratlbp 00:20:52 00:20:51 75,296,490
6 rry2g5 16:48:04 16:04:10 10,129,449,583


Patterns dervied from aligned families

Early patterns were first reported as consensus sequences These patterns were essentially composite sequences consisting of the most common residue ocouring at a position in an alignment. Today, these patterns are of little use unless if the sequences are highly conserved. A later approach stored the pattern as a regular expression. A regular expression is much more flexible than a consensus sequence because more than one residue can be stored at each position. There are many patterns that can be described as regular expressions. Many of these excellent patterns can be found in the PROSITE dictionary of sites and patterns. The GCG program motifs makes use of this data. The findpatterns program can be used to search a database for an ambiguous pattern. A more recent approach is to use a weight matrix to represent the pattern. This approach is much more sensitive to patterns that are not strongly conserved. The profile analysis method can be used to create a weight matrix patterns from a sequence alignment. These patterns, (called profiles) can then be used to search the sequence databases for additional sequences that contain the pattern. The profilemake program can be used to create a profile withe the profilesearch or PROFILE-SS programs can be used to scan a database for a profile. The profilescan program can be used to see if a sequence contains a pattern placed in the profile library.

Patterns derived from unaligned sequences

Deriving patterns from unaligned sequences is a new area of research. One successful method uses an information content measure to discover patterns. This approach is incorporated into the MEME program. MEME essentially finds a position-dependent letter frequency matrix that is similar to a gapless profile. This gapless profile can then be used to search a database for copies of the profile. An experimental approach being used today is that of a hidden markov model. At the current time software using this technique is still very experimental. We anticipate that in a few years this approach may become useful technique.

[an error occurred while processing this directive]