Sequence Analysis Tutorials

A Tutorial on Searching Sequence Databases and Sequence Scoring Methods

Developed by the Biomedical Supercomputing Initiative of the Pittsburgh Supercomputing Center, An NIH Supported Resource Center

Hugh B. Nicholas Jr., David W. Deerfield II., and Alexander J. Ropelewski

January 1997 (Revised March 1998)

Database Searching: An Overview

Database searching is different from sequence alignment. Although a database search typically yields several alignments as part of its results these alignments are not really our goal. Our goal is the scores accompanying these alignments. We want to use these scores to distinguish sequences related to our query sequence from sequences that are not related to our query sequence. Because of this we will select parameters for database searching that give us the scores that best distinguish the related from the unrelated sequences in the database. Generally, these are not the parameters that will give us the most accurate alignment, the alignment that is our best estimate of the actual history of substitutions, insertions, and deletions in the two sequences.

Database searching is the application of knowledge gained from previous experiments to the problem of discovering the biochemistry and physiology of a newly discovered gene or its protein product. This is exactly analogous to using previously determined knowledge to carry out a laboratory experiment. Thus database searching is a scientific activity in its own right. As such it demands the same careful thought and execution as our bench or laboratory investigations. Careless work, just as in the laboratory, will lead to wrong answers and missed discoveries.

We apply this previous knowledge to finding a gene or protein related to our newly determined sequence through a common evolutionary ancestor, an homologous gene or protein. Evolution is a conservative process. Although the sequence residues may change, the chemical and physical properties necessary to maintain very similar biochemical and physiological processes are conserved. Thus, if we are able to discover an already characterized sequence in the database that is homologous to our newly determined gene or protein we may conclude that our newly determined sequence carries out biochemical and physiological processes very much like those of the well characterized sequence in the database.

Homology, relatedness through a common evolutionary ancestor, is not directly observable. What we observe in sequence database searching is sequence likeness or similarity. If the similarity is great enough we are allowed to make the scientific inference that the two sequences are homologous. So, much of the previously determined knowledge we apply in database searching involves how to best measure sequence likeness and how to determine whether the observed degree of sequence similarity or likeness is sufficient to allow us to infer that the sequences are homologous.

Sources of Previous Knowledge in Database Searching

The similarity scores used to assess sequence likeness are the most important source of previous knowledge in database searching. These include the widely used PAM and Blosum similarity matrices used in protein database searches as well as PAM matrices and other scores used to search nucleic acid databases. These matrices incorporate many observations of which amino acids have replaced each other while the proteins were evolving in different species but still maintaining the same biochemical and physiological functions. They rescue us from the ignorance of having to assume that all amino acid changes are equally likely and equally harmful. Different similarity matrices are appropriate for different degrees of evolutionary divergence. Any matrix is most likely to find good matches with other sequences that have diverged from your query sequence to the extent for which the matrix is suited (4). Similar matrices are available, if not widely used, for DNA. The DNA matrices can incorporate knowledge about differential rates of transitions and transversions in the same way that some substitutions are judged more favorable than others in protein similarity matrices.

The computer algorithm or program used in searching the database is the second most important source of previous knowledge. The different algorithms, Smith-Waterman, FASTA, or BLAST add different restrictions to the simple model of sequence evolution on which database searching is based. Smith-Waterman is the most rigorous algorithm and does not place any heuristic restrictions on the evolutionary model. Smith-Waterman is both the most sensitive and the least selective algorithm. The actual pattern of evolutionary changes between your newly determined query sequence and any homologue in the database can be incompatible with the heuristic restrictions imposed by either BLAST or FASTA. Alternatively the additional selectivity that results from these restrictions can sometimes be an advantage. There is no one program that is always best at finding distantly related sequences for all gene and protein families, although the Smith-Waterman algorithm is most often the best (2,3).

Finally, the sequence database itself represents a large store of previously acquired knowledge. Making the best use of this knowledge can save us many months of expensive laboratory experimentation and allow us to use our limited resources to acquire truly new knowledge. The size of this potential gain is the determining factor in deciding how much effort to devote to any particular database search.

Database Searching

The Assumptions

Applied Considerations


Database Searching Programs: Fitting the Model

The dynamic programming algorithm provides a rigorous mathematical approach towards sequence alignment. It is guaranteed to find the best alignment between a pair of sequences given a particular choice of scoring matrix and gap penalties (17,18,21,22). There are several variants of the dynamic programming algorithm that yield different kinds of alignments. The dynamic programming algorithm was first proposed for sequence alignment by Needleman and Wunsch (21) and independently by Sellers (22). These equivalent variants of dynamic programming provide a global alignment between two sequences. A global alignment aligns the entire length of both sequences.

The variant known as the Smith-Waterman algorithm (17,18) yields a local alignment. A local alignment aligns the pair of regions within the sequences that are most similar given the choice of scoring matrix and gap penalties. Database searches generally use local alignments as opposed to the global alignment. This allows the database search to focus on the most highly conserved regions of the sequences without having to overcome interference from less well conserved regions of the sequences. It also allows similar domains within sequences to be identified, such as nucleotide binding domains or kringle domains, even though the sequences may not be related over their entire length. Thus, in what follows, we will focus on the Smith-Waterman algorithm and the local alignments that result from its use. Bear in mind that it may be appropriate to use the Needleman-Wunsch or Sellers algorithm for the final alignment with your query sequence of sequences discovered by a local alignment database search.

Smith-Waterman: A Rigorous Fit

The Smith-Waterman algorithm does not impose any additional restrictions on the model of sequence evolution used in database searching. The Smith-Waterman algorithm places no restriction on the alignment it reports other than that it have a positive score in terms of the similarity table used to score the alignment (17). Biologically, this means that the weights or scores assigned to replacements that occur more frequently than expected at random must outweigh those assigned to the replacements that occur less often than expected at random. In other words the preponderance of the evidence is in favor of the two aligned sequence sections being homologous (although the preponderance may not be great enough to justify inferring that the sequences are homologous). Both BLAST (20) and FASTA (19) place additional restrictions on the alignments that they report in order to speed up their operation. Because of this Smith-Waterman is more sensitive than either BLAST or FASTA (3).

Smith-Waterman is mathematically rigorous, it is guaranteed to find the best scoring alignment between the pair of sequences being compared. (17) It does this by constructing a two dimensional table of partial alignment scores. The tables, as show below has one dimension or axis for each sequence. Each cell in the table contains the score for the best partial alignment that terminates with the pair of sequence residues (one from each sequence) that correspond to that cell in the table. That best scoring partial alignment will be extended to subsequent cells in the table only when it is the prior cell that results in the best scoring partial alignment for the subsequent cell. In this way all possible alignments are considered until they are proven inferior to a competing alignment that also involves aligning at least one of the same pairs of sequence residues. The final alignment is thus the best scoring alignment possible.

Because of this mathematical rigor and lack of restrictions the Smith-Waterman algorithm is more sensitive than either BLAST or FASTA. This additional sensitivity comes at the price of being a very much slower way to search a sequence database than are either BLAST or FASTA (4). Because of this Smith-Waterman is most often run on either a supercomputer or sometimes special purpose hardware is purchased. The examples shown here with a 470 amino acids query sequence searching 89,912 sequences with 28,507,787 amino acids took between 20 and 25 minutes on a single processor of the Cray C-90 at the Pittsburgh Supercomputing Center.

How Smith-Waterman is computed

It is worth the effort to understand how Smith-Waterman alignments are actually computed. First, users should be convinced that the algorithm is mathematically rigorous in sense described here, that it does test all possible alignments, and always does find the best scoring alignment. Smith-Waterman also provides the standard against which we compare the heuristics involved in using BLAST or FASTA. Both BLAST and FASTA are heuristic approximations to rigorous algorithms. FASTA (19) is an approximation of Smith-Waterman and BLAST (20) is an approximation to a simplified version of Smith-Waterman with the effect of gaps removed, called maximal segment pair alignments.

The Smith-Waterman is easily described in a recursive, mathematical equation (17,18). Recursive means that the results are computed in steps with any subsequent step depending on the answers to previous steps. The Equation is written:

SWi,j = max{ SWi-1,j-1 + s(ai,bj); SWi-k,j + gj; SWi,j-k + gi; 0 }

SWi,j is the Smith-Waterman score for the partial alignment ending at residue i of sequence a and residue j of sequence b. In computing the Smith-Waterman score we must consider the four terms in the equation separated by semicolons and select the term with the maximum value, the highest possible score at that point. The first term, SWi-1,j-1 + s(ai,bj), corresponds to extending the alignment by one residue from each sequence. This is illustrated by the black arrow in the figure below which schematically shows the two dimensional table, described above, used in computing Smith-Waterman alignments. The second term, SWi-k,j + gj, describes extending the alignment by including residue j from sequence b and inserting a gap of k residues in length, aligned to end with residue j of sequence b, into sequence a. This corresponds to the vertical green and gold arrows in the schematic below. The third term,SWi,j-k + gi, is the equivalent term for inserting a gap into sequence b and corresponds to the horizontal green and gold arrows in the schematic below. The fourth term, zero, is what distinguishes the Smith-Waterman algorithm from the currently used version of the Needleman-Wunsch algorithm. For a global alignment, the fourth term, the zero, is eliminated (21,22). That is, the partial scores within the table are allowed to become negative.

The Needleman-Wunsch algorithm aligns a pair of sequences over their entire lengths while the Smith-Waterman algorithm finds the best matching regions in the same pair of sequences. Putting the zero in the recursion is saying that if the partial alignment score becomes negative during the calculations we want to ignore that as well as ignore the preceding calculations and start over from a neutral score. Thus the best scoring regions do not have to overcome the effects of surrounding regions of low similarity in order to achieve a high score. The best scoring alignment is the alignment that ends at the cell in the table with the highest score.

The table below shows a completely filled in two dimensional table for computing the alignment between two short DNA sequences using a simplified version of the Smith-Waterman algorithm. The simplification is that the form of the gap penalty was simplified by using an open gap penalty of zero. This allows us to ignore the parts of the calculation corresponding to the green arrows in the implementation schematic above and limit ourselves to those calculations corresponding only to the black and gold arrows.

Smith-Waterman Alignment Table
Similarity Scores: DNA PAM 47, Match = 5, Mismatch = -4;
Open Gap = 0, Extend Gap = -7

            -   g   c   t   g   g   a   a   g   g   c   a   t
- 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 5 0 0 5 5 0 0 5 5 0 0 0
c 0 0 10 3 0 1 1 0 0 1 10 3 0
a 0 0 3 6 0 0 6 6 0 0 3 15 8
g 0 5 0 0 11 5 0 2 11 5 0 8 11
a 0 0 1 0 4 7 10 5 4 7 1 5 4
g 0 5 0 0 5 9 3 6 10 9 3 0 1
c 0 0 10 3 0 2 5 0 3 6 14 7 0
a 0 0 3 6 0 0 7 10 3 0 7 19 12
c 0 0 5 0 2 0 0 3 6 0 5 12 15
t 0 0 0 10 3 0 0 0 0 2 0 5 17
g a a g - g c a
g c a g a g c a

The cells in the table above corresponding to the best scoring alignment between the two DNA sequences are highlighted in red. The alignment is shown below the table lined up with the columns in the table corresponding to the alignment. It is a useful exercise to calculate by hand some of the values for specific cells in the table to be sure of understanding the algorithm. In doing this example the value of k in the Smith-Waterman equation above will always be one. Notice that there are an initial row and column added to the table that are composed entirely of zeros. This provides the necessary starting point for the recursive calculations described by the equation above.

FASTA: A fast approximation to Smith-Waterman

FASTA is a two step algorithm (19). The first step is a word search with a specific word size which finds regions in a two dimensional table similar to that shown for the Smith-Waterman algorithm above that are likely to correspond to highly similar segments of the two sequences. These regions are a diagonal or a few closely spaced diagonals in the table which have a high number of identical word matches between the sequences. The second step is a Smith-Waterman alignment centered on the diagonals that correspond to the alignment of the highly similar sequence segments. The region for the Smith-Waterman alignment is bounded by the window size. The window size limits the number of insertions or deletions one sequence can accumulate with respect to the other sequence in the alignment. Thus, the significant speedups in observed in a FASTA search relative to a full Smith-Waterman search is due to the prior restriction in alignment space as well as skipping the Smith-Waterman step when no diagonals are found that correspond to aolignments between highly similar sequence segments.

The FASTA algorithm is a heuristic approximation to the Smith-Waterman algorithm. The heuristics used by FASTA allow it to run much faster that the Smith-Waterman algorithm but at the cost of some sensitivity. Two heuristics are employed. Both can be interpreted as restrictions on the model of sequence evolution that is used in comparing the sequences. The first restriction is implemented by the word size parameter, usually two for proteins and six for nucleic acids. This means that FASTA constrains the evolution between a pair of sequences to preserve a number of unchanged dipeptides or hexanucleotides. The effect of this word size restriction can be seen in the dot plots shown below.

In the dot plots below each asterisks marks the beginning of a string of matches in the two short nucleotide sequence shown at the edges of the dot plot. The length of the matches thus varies between one and three nucleotides. As can be seen with a word size of one the dot plot is so noisy that it is difficult to pick out any region that might correspond to an alignment that accurately reflects homology between the sequences. However, when we shift to a word size of two the region more or less in the center of the dot plot stands out as offering the most possibilities for a long alignment with a high degree of sequence identity. In cases like this the loss of sensitivity resulting from increasing the word size has also resulted in a worthwhile increase in selectivity. In the final dot plot, with a word size of three, not only has most of the noise in the dot plot been removed but so has most of the information indicating the regions involved in the best alignment between the two sequences. Given the same scoring parameters the best alignment should be the same alignment found by the Smith-Waterman algorithm and shown above.

Dot Plot: Word Size = 1

    g  c  t  g  g  a  a  g  g  c  a  t
g * * * * *
c * *
a * * *
g * * * * *
a * * *
g * * * * *
c * *
a * * *
c * *
t * *

Dot Plot: Word Size = 2

    g  c  t  g  g  a  a  g  g  c  a  t
g * *
c *
a *
g *
a *
g * *
c *
a
c *
t

Dot Plot: Word Size = 3

    g  c  t  g  g  a  a  g  g  c  a  t
g *
c
a
g
a
g *
c
a
c
t

The first step in the FASTA algorithm is to divide the query sequence into its constituent overlapping words of length two for proteins or six for nucleic acids. This process is shown in the table below. Then as each sequence is read from the database it is also divided into its constituent overlapping words. These two list of words are compared to find the identical words in both sequences. An initial score is computed based on the number of identities concentrated within small regions of the dot plot. If this initial score is high enough a second score is computed by evaluating which of the initial identities can be joined into a consistent alignment using only gaps of less than some maximum length. This maximum lengths for gaps is set by the window size parameter. Finally, if the secondary score is high enough then a Smith-Waterman alignment is performed within the region of the dot plot defined by the concentrated identities and the window size. This third score is reported as the optimal score.

Creating a Word List for FASTA, Word Size = 6


g c t g g a a g g c a t
g c t g g a
c t g g a a
t g g a a g
g g a a g g
g a a g g c
a a g g c a
a g g c a t

The weaknesses in the FASTA approach can be shown in two pathological examples. The first example would have two proteins that share 50% identity - but the proper alignment consists of alternating match and mismatches. With a word size of two, there would be no word matches along the main diagonal of the dot plot for the sequences (although there will potentially be random or spurious word matches on the off-diagonals) and the proper alignment would probably not be found. The second case consists of two proteins that are almost identical, except the second protein has a 20 residue insertion into the middle of the sequence. If the window size is 15, then the Smith-Waterman alignment phase of FASTA will not have enough alignment space to jump the 20 residue insertion. Thus, the resulting alignment will be either the sequence prior to or after the insertion (whichever had the higher diagonal scores) and the fact that the proteins were basically identical (with only one long insertion) will be missed.

The window size discussed in the previous paragraph is the second heuristic used by FASTA. Its effect is more variable than that of the word size. If the best alignment lies entirely within the window defined by the window size and the concentrated identities there is no effect. However if the best alignment, as defined by a full Smith-Waterman analysis, goes outside the window then a lower scoring alignment will be found by FASTA. This can lead the user to conclude that the sequences are not homologous when in fact they are and homology could have been inferred from a full Smith-Waterman alignment.

In practice these pathological cases are very unlikely. However, similar cases do occur and the loss of sensitivity caused by the use of these heuristics will be seen in the examples presented at the end of the tutorial.

BLAST: A simplification of Smith-Waterman

The BLAST algorithm (20) uses a word based heuristic similar to that of FASTA to approximate a simplification of the Smith-Waterman algorithm known as the maximal segment pairs algorithm. Maximal segment pairs alignments do not allow gaps and are specified by an equation that includes only the first and fourth terms of the Smith-Waterman equation presented above. Maximal segment pair alignments have the very valuable property that their statistics are well understood (15). Thus, we can readily compute a significance probability for a maximal segment pair alignment. Recent advances in maximal segment pairs statistics allow the use of several independent segment alignments to be used in evaluating the significance probability.

The price for being able to readily compute a significance probability is that the alignments can not have gaps (15). Thus, the evolutionary model requires that there be a fairly long stretch of sequence that has evolved without insertions or deletions, or at least with a complimentary pattern of insertions and deletions that do not significantly disrupt the alignment.

The BLAST algorithm is less sensitive than Smith-Waterman and in appropriate circumstances more selectivity. For proteins the BLAST word based heuristic is more sensitive than the FASTA heuristic even though BLAST uses a word size of three for proteins while FASTA uses a word size of two. However, BLAST uses a very long word size, eleven, for nucleic acid sequences. Also the modifications to the heuristic that improve the sensitivity for protein sequences do not work as well for nucleic acid sequences because have only a four letter alphabet and the similarity scores are usually calculated with equal rates of replacement for all of the nucleotides. Thus, FASTA is more sensitive than BLAST for nucleic acid sequences and should used instead of BLAST.

The BLAST word based heuristic uses a default word size of three for proteins and eleven for nucleic acid sequences. The tables below illustrate how the BLAST heuristic differs from the FASTA heuristic using a word size of two applied to a short protein query sequence. A word size of two is used in the example to keep it to a managable size.

Creating a Word List for BLAST, Word Size = 2

Adipokinetic hormone II - Migratory locust

 q l n f s a g w

q l l n n f f s s a a g g w

In the BLAST heuristic this list of words is expanded in order to recover the sensitivity lost by matching only identical words. Any word that scores at least a minimum threshold when aligned with any of the initial list of words is added to the list. BLAST then examines the database sequences for words that exactly match any of the expanded list. The expanded below to a total list of 47 words derived from the initial 7. The expanded list contains any word that scores at least eight when aligned with the initial word and scored with the PAM 120 similarity table.

Initial
Word Expanded List
ql: ql, qm, hl, zl
ln: ln, lb
nf: nf, af, ny, df, qf, ef, gf, hf, kf, sf, tf, bf, zf
fs: fs, fa, fn, fd, fg, fp, ft, fb, ys
sa: no words score 8 or more, including the initial word sa
ag: ag
gw: gw, aw, rw, nw, dw, qw, ew, hw, iw, kw, mw, pw, sw,
tw, vw,bw, zw, xw

One noteworthy feature of the table above is that there is no word that scores eight or more when aligned with the initial word "sa". This situation does occur in actual BLAST searches. The user has the option to force the initial word into the final list. The default is not to include such low scoring words because they contain so little information that they are unlikely to be critical in finding the maximal segment pair alignment. In practice, BLAST has used a word length of 3 for protein database searches with a required threshold score of 13 using the Blosum62 similarity scoring matrix.

Recent Developments in BLAST: More Sensitivity and Gaps

The inventors of the BLAST algorithm have continued to look for ways to improve both the sensitivity and the speed of their algorithm (23). As we will see later, the growth of the sequence databases has raised the miniimum score and hence the length of alignment that must be found by BLAST for a match to be significant. For these longer alignments both the speed and the sensitivity of BLAST can be improved by requiring that the algorithm find two matches above some (lower) threshold rather than one match. Both matches must be on the same diagonal. The increase in speed results from improved specificity so that fewer sequences are completely evaluated In practice BLAST now looks for two words of length 3 that each score at least 11 using the Blosum62 similarity scoring matrix. The two matches must be within 40 amino acids of each other on the same diagonal.

Another recent improvement in the BLAST algorithm is to provide a rigorous gapped alignment as part of the results. The first steps that BLAST made towards providing gapped alignments was to find a number of high scoring segments in addition to the maximal scoring segment. A later version used a strategy similar to that found in FASTA. In this strategy the maximal scoring segment is used to define a band in the path graph and the Smith-Waterman (17) algorithm is used to find a gapped alignment that lays entirely within the band. However, as noted in the discussion of FASTA, this strategy can produce a less than optimal Smith-Waterman alignment if the number of gaps needed in the best gapped alignment is greater than the width of the band.

The new gapped BLAST gets around this problem and still avoids the high computational cost of a full Smith-Waterman alignment on the pair of sequences by building the alignment out from a central high scoring pair of aligned amino acids in a way analogous to BLAST extends the initial maximal segment pair alignment. The initial pair of aligned amino acids is chosen as the middle pair of the highest scoring window of 11 amino acids in the high scoring segment pair alignments. The Smith-Waterman alignment is then extended in all directions in the path graph until it falls below a fixed percentage of the highest score yet computed in the Smith-Waterman phase.

This method of computing the Smith-Waterman gapped alignment will always find the best scoring Smith-Waterman alignment if two conditions are met. First, the calculation is extended until a lower bounds of a score of zero is reached. Using a higher threshold for stopping the calculation accepts a very small risk of not finding the complete alignment in return for a very large savings in computer resources. The second criterion that must be met is that the initial pair of amino acids selected as the midpoint from which to extend the alignment must actually be part of the alignment that would be reported as the best alignment by a complete Smith-Waterman alignment of the pair of sequences. Again the selection criteria used by gapped BLAST appear to be very effective in selecting an appropriate pair of amino acids from which to extend the alignment.

The gapped BLAST is an superb database searching tool doing a very good job of identifying homologous sequences and yielding a very informative gapped alignment. However, it is probably prudent to use a Smith-Waterman program to do a complete Smith-Waterman (17) analysis on the pair of sequences before publishing the alignment. Indeed, we would suggest that a complete analysis would make use of the Waterman-Eggert (18) extensions to Smith-Waterman, the MaxSegs algorithm, to look at the best several independent local alignments and to examine each sequence for repeated motifs.


Similarity Scores

In order to use any of these search programs, one must have a quantification of whether the substitution of one amino acid for another is likely to conserve the physical and chemical properties necessary to maintain the structure and function of the protein or is more likely to disrupt essential structural and functional features of the protein. Numerous bases have been used in creating similarity tables: explicit or implicit (empirical) evolutionary models, structural properties such as Chou-Fasman propensities, chemical properties such as charge, polarity, and shape, as well as combinations like those used in the Structural-Genetics Matrix. Regardless of the underlying bases, all similarity tables are attempts to quantify whether a mutation preserves or disrupts the function of a protein. In the following discussion we will describe the bases for several early similarity matrices followed by a presention of the theory and methodology for the creation of the two most commonused empirical evolution based tables. The same methodology and approaches can be used to construct similarity tables from different bases.

Early Matrices

Early sequence alignment programs used the unitary scoring matrix. A unitary matrix scores all matches the same and penalizes all mismatches the same. Although this scoring is sometimes appropriate for DNA and RNA comparisons, for protein alignments using a unitary matrix amounts to proclaiming ignorance about protein evolution and structure. Thirty years of research in aligning protein sequences have shown that different matches and mismatches among the 400 amino acid pairs that are found in alignments require different scores.

Many alternatives to the unitary scoring matrix have been suggested. One of the earliest suggestions was scoring matrix based on the minimum number of bases that must be changed to convert a codon for one amino acid into a codon for a second amino acid. This matrix, known as the minimum mutation distance matrix, has succeeded in identifying more distant relationships among protein sequences than the unitary matrix approach. The minimal mutation distance matrix is an improvement because it incorporates knowledge about the process of generating mutations from one amino acid into another. However it still ignores the processes of selection that determine which mutations will survive in a population.

Another improvement over the unitary matrix is a scoring matrix based on selected physical, chemical, or structural properties shared and not shared by the 400 pairs of amino acids. Specific instances of this approach work well for some sequences, but not so well for other sequences. The approach works best if the matrix is based upon properties that have been strongly conserved during the evolution of your sequences. This reflects that the properties' matrix attempts to specify the criteria that determine whether or not a mutation can survive and be fixed in a population. However, this approach suffers from problems with balancing the contributions of the different properties to the positive selection of mutations and from ignoring the different rates at which different mutations are generated.

Empirical Evolutionary Matrices and log-odds scores

The biggest improvement achieved over the unitary matrix (and other theoretically based matrices) was based on the empirical study of the evolutionary replacements of amino acids. Margaret Dayhoff pioneered this approach in the 1970's. She made an extensive study of the frequencies in which amino acids substituted for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. This approach incorporates both the generation and selection of mutations and has been very successful sequence alignment applications. Below we present more details of two widely used families of these empirically based matrices. .

Summary of properties

Amino Acid Similarity Scores

Calculating Similarity Scores

To help us understand the knowledge incorporated in amino acid similarity scores we should briefly look at how they are calculated (4). First we compute an amino acid similarity ratio, Rij for every pair of amino acids i and j.

Rij = qij / pipj

Where qij is the relative frequency with which amino acids i and j are observed to replace each other in homologous proteins. pi and pj are the frequencies at which amino acids i and j occur in the set of proteins in which the substitutions are observed. Their product, pipj, is the frequency at which they would be expected replace each other if the replacements were random. If the observed replacement rate is equal to the theoretical replacement rate, then the ratio is one ( Rij = qij / pipj = 1.0 ). If the replacements are favored during evolution (i.e. a conservative replacement) the ratio will be greater than one and if there is selection against the replacement the ratio will be less than one.

The similarity reported in the evolutionary-based tables for any pair of amino acids i and j, Sij is the logarithm to the base 2 of this ratio, Rij, although it is often scaled by some constant factor.

Sij = log2( Rij ) = log2( qij / pipj )

Scores above zero (Sij > 0.0) indicate that two amino acids replace each other more often during evolution than we would expect if the replacements were random. Likewise, scores below zero indicate that amino acids replace each other less often than we would expect if the replacements were random. Thus a positive alignment score means that the pattern of identities and substitutions described by an alignment are more likely to result from previously observed evolutionary processes than to result from random replacements.

Similarity Scores computed on a Basis Other than Empirically Observed Replacements

It is important to note that any matrix of similarity scores is implicitly a matrix of log-odds scores like those described above (4). This includes scores that were explicitly developed from other kinds of data such as physical and chemical properties or the minimum number of differences between codons specifying the two amino acids. To see this, consider that given the amino acid composition of an appropriately chosen sequence database, perhaps the one we are going to search, we can derive a random replacement model of pi and pj. Given this and the matrix of similarity scores we can solve the equations above for the implicit replacement frequencies, qij. This set of replacement frequencies determine which amino acids will be aligned with each other during a database search. Searching a database with any set of similarities is equivalent to making the assumption that evolution has substituted amino acids for each other in the pattern described by the implied replacement frequencies, qij. Thus the sequences that have the best scores in a database search will be those related to the query sequence by the pattern of substitutions implied in the scores. If this pattern is different from the pattern of replacements that actually occurred during the evolution of the sequences, the wrong sequences will be reported as possibly homologous as a result of the database search.

Counting the Replacements

In light of the critical role of the observed replacement frequencies in computing similarity scores the manner in which the replacements are counted takes on crucial importance. Differences in the way replacements are counted is one of the biggest differences between the two most widely used families of similarity matrices, the PAM matrices and the more recently developed Blosum matrices. The PAM matrices use counts derived from an explicitly tree like, branching evolutionary model. The Blosum matrices use counts directly derived from highly conserved blocks within an alignment. In either case the first step in counting the replacements is to create an accurate alignment.

The PAM or Dayhoff Family of Matrices

Margaret Dayhoff and her co-workers performed the first careful, systematic study to create the first amino acid similarity matrix, the Point Accepted Mutation (PAM) similarity matrix. In computing the PAM matrices the alignment was created from a limited set of closely related sequences. The alignment was a global alignment, that is, it encompassed the entire length of the sequences. Thus both highly conserved regions and highly variable regions are included in the alignments and used in counting replacements. After the sequences are carefully aligned an explicit evolutionary tree is constructed. Next, ancestral sequences are computed for each of the internal nodes in the tree using the principle of minimum replacements or maximum parsimony. Finally, replacements are counted along each leg of the tree as shown in the figure below.

The parentheses in the ancestral sequences in the figure above mark ambiguous sequence residues that are not uniquely defined. The highlighted single letter amino acids codes connected by arrows show unambiguous amino acid replacements along the branch of the tree marked by the arrow. Note that in this process the amino acid replacements are not symmetrical but are directional, that is the replacement is the descendant amino acid replacing the ancestral amino acid. After the directional table is completed it is made symmetrical by averaging reciprocal entries. These symmetrical values are then normalized to create the relative replacement frequencies, qij used in computing the PAM similarity scores (5).

The calculation of the PAM similarity matrices is normalized so that the initial calculation yields the PAM 1 matrix. The PAM 1 matrix incorporates the amino acid replacements that would be expected if one mutation had occurred per 100 amino acids of sequence. This is a very small amount of change and the sequences would be almost identical. For database searching or aligning less closely related sequences it is better to have a similarity matrix that reflects more underlying mutations. These matrices can be computed from the PAM 1 matrix by raising it to the appropriate power. The PAM 250 matrix, originally created by Margaret Dayhoff is shown below. This matrix is appropriate for searching for sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about 20 percent identical.

PAM 250 Amino Acid Similarity Matrix

Cys  12
Gly -3 5
Pro -3 -1 6
Ser 0 1 1 1
Ala -2 1 1 1 2
Thr -2 0 0 1 1 3
Asp -5 1 -1 0 0 0 4
Glu -5 0 -1 0 0 0 3 4
Asn -4 0 -1 1 0 0 2 1 2
Gln -5 -1 0 -1 0 -1 2 2 1 4
His -3 -2 0 -1 -1 -1 1 1 2 3 6
Lys -5 -2 -1 0 -1 0 0 0 1 1 0 5
Arg -4 -3 0 0 -2 -1 -1 -1 0 1 2 3 6
Val -2 -1 -1 -1 0 0 -2 -2 -2 -2 -2 -2 -2 4
Met -5 -3 -2 -2 -1 -1 -3 -2 0 -1 -2 0 0 2 6
Ile -2 -3 -2 -1 -1 0 -2 -2 -2 -2 -2 -2 -2 4 2 5
Leu -6 -4 -3 -3 -2 -2 -4 -3 -3 -2 -2 -3 -3 2 4 2 6
Phe -4 -5 -5 -3 -4 -3 -6 -5 -4 -5 -2 -5 -4 -1 0 1 2 9
Tyr 0 -5 -5 -3 -3 -3 -4 -4 -2 -4 0 -4 -5 -2 -2 -1 -1 7 10
Trp -8 -7 -6 -2 -6 -5 -7 -7 -4 -5 -3 -3 2 -6 -4 -5 -2 0 0 17
Cys Gly Pro Ser Ala Thr Asp Glu Asn Gln His Lys Arg Val Met Ile Leu Phe Tyr Trp

The PAM 250 matrix above has been arranged so that similar amino acids are close to each other. This gives rise to regions along the diagonal of the matrix that contain only positive scores. The colored regions in the figure above mark one possible grouping of such positive scores. These regions provide an objective basis for defining conservative substitutions, namely as amino acids that replace each other more frequently than would be expected from random replacements. Note that the amino acids that make up these regions can change at different levels of sequence divergence. The diagonal terms of the matrix vary appreciably. This variation reflects both how often an amino acid is found in protein sequences and how often it is observed to be replaced by other amino acids. Thus rare amino acids which are replaced infrequently have the highest scores when they are conserved in an alignment.

The Blosum Family of Matrices

There are three principal differences between the Blosum and PAM matrices (7). The first difference is that the PAM matrices are based on an explicit evolutionary model (that is, replacements are counted on the branches of a phylogenetic tree), whereas the Blosum matrices are based on an implicit rather than explicit model of evolution. The second difference is the sequence variability in the alignments used to count replacements. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The Blosum matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. The last difference is in the method used to count the replacements. The Blosum procedure uses groups of sequences within which not all mutations are counted the same. A small block of lipid binding proteins, taken from the Blocks database, is shown below and used to illustrate the process of counting replacements that underlies the Blosum similarity matrices.

Counting Replacements for a Blosum 70 Similarity Matrix

 Bpi Bovine  npGivaRItqkgLdyacqqgvltlQkele
 Bpi Human   npGvvvRIsqkgLdyasqqgtaalQkelk
Cept Human   eaGivcRItkpaLlvlnhetakviQtafq
 Lbp Human   npGlvaRItdkgLqyaaqegllalQsell

Lbp Rabbit npGlitRItdkgLeyaareg
llalQrkll
^
Replacement         Counts
v » t = 0.0 (within thesame group)
v » a = 0.5 * 1.0 = 0.5
v » l = 0.5 * (0.5 + 0.5) = 0.5
t » a = 0.5 * 1.0 = 0.5
t » l = 0.5 * (0.5 + 0.5) = 0.5
a » l = 1.0 * (0.5 + 0.5) = 1.0
l » l = 0.0 (within the same group)

The replacement counts in the table above correspond to the column in the alignment highlighted in red. The number 70 in the similarity matrix name, Blosum 70, indicates that sequences that are more than 70 percent identical within the block where replacements are being counted are grouped together and treated as a single sequence for the purpose of counting replacements. In the alignment above the two Bpi (green) sequences are grouped together as are the two Lbp (blue) sequences. Each of the three groups is shown in a different color. Replacements within a group are not counted. Thus the v,t replacement is given a count of 0.0 since they are both within the Bpi group. The v,a and t,a replacements are given a count of 0.5 since v and t are each one-half of the Bpi group. The a,l replacement gets a count of 1.0 since a and l are the only amino acids present in their respective groups at the position being counted. This counting procedure is the sum of pairs evolutionary model.

Blosum 45 Amino Acid Similarity Matrix

Gly   7
Pro  -2   9
Asp  -1  -1   7
Glu  -2   0   2   6 
Asn   0  -2   2   0   6
His  -2  -2   0   0   1  10
Gln  -2  -1   0   2   0   1   6
Lys  -2  -1   0   1   0  -1   1   5
Arg  -2  -2  -1   0   0   0   1   3   7
Ser   0  -1   0   0   1  -1   0  -1  -1   4
Thr  -2  -1  -1  -1   0  -2  -1  -1  -1   2   5
Ala   0  -1  -2  -1  -1  -2  -1  -1  -2   1   0   5
Met  -2  -2  -3  -2  -2   0   0  -1  -1  -2  -1  -1   6
Val  -3  -3  -3  -3  -3  -3  -3  -2  -2  -1   0   0   1   5
Ile  -4  -2  -4  -3  -2  -3  -2  -3  -3  -2  -1  -1   2   3   5
Leu  -3  -3  -3  -2  -3  -2  -2  -3  -2  -3  -1  -1   2   1   2   5
Phe  -3  -3  -4  -3  -2  -2  -4  -3  -2  -2  -1  -2   0   0   0   1   8
Tyr  -3  -3  -2  -2  -2   2  -1  -1  -1  -2  -1  -2   0  -1   0   0   3   8
Trp  -2  -3  -4  -3  -4  -3  -2  -2  -2  -4  -3  -2  -2  -3  -2  -2   1   3  15
Cys  -3  -4  -3  -3  -2  -3  -3  -3  -3  -1  -1  -1  -2  -1  -3  -2  -2  -3  -5  12
     Gly Pro Asp Glu Asn His Gln Lys Arg Ser Thr Ala Met Val Ile Leu Phe Tyr Trp Cys  

Summary: PAM and Blosum Matrices

The Blosum and PAM matrices are the most widely used amino acids similarity matrices for database searching and sequence alignment. In empirical tests of the effectiveness of the matrices both generally perform well (8,9). However, the Blosum matrices have most often been the better performers (8,9,3). This likely reflects the fact that the Blosum matrices are based on the replacement patterns found in more highly conserved regions of the sequences. This appears to be an advantage because these more highly conserved regions are those discovered in database searches and they serve as anchor points in alignments involving complete sequences. It is reasonable to expect that the replacements that occur in highly conserved regions will be more restricted than those that occur in highly variable regions of the sequence. This is supported by the different pattern of positive and negative scores in the two families of matrices. Some of the difference is also likely to be because the Blosum matrices are based on much more data than the PAM matrices.

The PAM matrices still perform relatively well despite the small amount of data underlying them. The most likely reasons for this are the care used in constructing the alignments and phylogenetic trees used in counting replacements and the fact that they are explicitly based on a simple model of evolution. Thus they still perform better than some of the more modern matrices that are less carefully constructed. Both the PAM and Blosum matrices generally perform better than matrices explicitly based on criteria other than observed replacement frequencies (8).

We can see the concrete result of these differences in the PAM 250 and Blosum 45 matrices shown above. These two similarity tables are directly comparable and are suitable for searching for sequences the same degree of divergence from the query sequence. They have the same amount of information per alignment position (entropy) for determining whether or not sequences are homologous. Thus differences in these scores should reflect differences in the data and model used in counting substitutions rather than any other effects.

One striking difference is among the amino acids with carboxylate and amide side chains. At physiological pH the carboxylate side chains can only act as a proton acceptor while the amide side chain can simultaneously accept and donate a proton. In the PAM 250 table these four amino acids, Asp, Asn, Glu, and Gln (along with His) form a single conservative substitution group, that is the similarity score for any pair is greater than zero. His, has two nitrogen atoms in its side chain and, like the amide side chains of Asn and Gln, can simultaneously both donate and accept a hydrogen bond. In the Blosum 45 similarity table this single group must be split into two groups: One with the carboxylate amino acids Asp and Glu, and the other with Asn, Gln, and His.

Another noticeable difference is the relationship between Ser and Thr, the alcoholic side chain amino acids, and other small amino acids. In the PAM 250 table any of the three amino acids Gly, Pro, or Ala can be added to the Ser, Thr pair to form a three member conservative substitution group. In the Blosum 45 table Ser and Thr form a two member conservative substitution group with no other possible members. The last difference we will mention is that only Phe and Tyr are members of an aromatic conservative group in the PAM 250 table while Trp is also a member in the Blosum 45 table.

Which Similarity Scores to Use

Similarity scores for database searches or sequence alignments perform much better if the similarity scores are based on replacement patterns that correspond to the degree of divergence of the sequences being aligned or discovered in the database search. In database searching using a PAM or Blosum table corresponding to an inappropriate degree of divergence can cause you to fail to discover homologous sequences that are present in the database. Therefore, a thorough database search will involve using at least 2 and most likely 3 different matrices. Using different matrices usually has a higher payoff than using different programs and search algorithms.

Comparable Blosum and PAM Tables

                                                
Percent

Sequence
Blosum PAM Identity
Tables (Entropy) Tables (Entropy) PAM Tables
 Blosum 90  (1.18)       PAM 100  (1.18)           43

Blosum 80 (0.99) PAM 120 (0.98) 38
Blosum 60 (0.66) PAM 160 (0.70) 30
Blosum 52 (0.52) PAM 200 (0.51) 25
Blosum 45 (0.38) PAM 250 (0.36) 20

The comparability of the Blosum and PAM tables shown above is based on the table entropy. The entropy as defined by information theory is the average amount of information per position in a sequence alignment that is available to determine whether or not the sequences are homologous. This amount of entropy is available only if the similarity scores used in the database search or alignment are matched for the appropriate degree of sequence divergence. We will later be able to make use of this to get a rough indication of whether or not a specific database search result is significant.

Scores for Nucleic Acids

If possible it is generally better to use an amino acid sequence for a database search (1). There are several reasons why amino acid based searches are more effective. First, because of the redundancy in the genetic code with up to six codons translated as the same amino acid there is more information in amino acid sequences than in nucleic acid sequences once the sequences have diverged beyond about 50 PAMs. While nucleic acid sequences contain more information below this limit there are other factors that can interfere with nucleic acid based searches. One of these is the compositional bias found in many organisms and organelles. Another is that some nucleic acid sequences are derived from messengers while others are genomic DNA with exons and the introns may be too short to give a significant alignment with a messenger derived sequence.

There are circumstances when there is no choice but to search with nucleic acid sequences. When you find yourself in these conditions there are additional considerations about the choice of programs. BLAST uses a very long word size, eleven, for nucleic acid sequences. Also the modifications to the heuristic that improve the sensitivity for protein sequences do not work as well for nucleic acid sequences because nucleic acids have only a four letter alphabet and the similarity scores are usually calculated with equal rates of replacement for all of the nucleotides. Thus, FASTA is more sensitive than BLAST for nucleic acid sequences and should used instead of BLAST if you want to use one of the faster searching programs.

It is also possible to use different sets of scores for nucleic acids just as been recommended for proteins. Commonly used scores for nucleic acid sequences are the PAM 47 scores assuming equal rates of transitions and transversions. This assumption leaves us with only two scores, 5 for identities or matches and -4 for nonidentities or mismatches. It is possible to create nucleic acid scores that do not assume equal rates of transitions and transversions. A PAM 50 scoring matrix that assumes a three to one transition to transversion ratio is shown below (6). If necessary it can be scaled by an appropriate constant factor to generate integer scores.

Nucleic Acid PAM 50 Scores, 3 to 1 Transition to Transversion Ratio


 A   1.36 
 G  -0.37  1.36 
 C  -1.60 -1.60  1.36 
 T  -1.60 -1.60 -0.37  1.36
      A     G     C     T

Scoring Insertions and Deletions

In most alignment and search programs, the gap penalty consists of two terms, the cost to open the gap and the cost to extend the gap. This section describes what one should consider when choosing these penalties. This form of gap penalty has been shown to give better results in database searches (3) and in producing accurate alignments (11) then a form of the penalty that includes only one of the terms.

The selection of appropriate scores for insertions and deletions, the gap penalties, is as important as selecting the similarity scores for the success of a database search. The intuititive choice of gap penalties is to use penalties designed to create the most accurate possible alignment between your query sequence and any homologues in the database. To do this we may need to accommodate several small gaps in highly variable regions of the sequence. This implies fairly small penalties for opening a gap in the alignment. However, more knowledge, experience, and thinking about the experiment you are performing with a database search indicates that the most accurate alignment may not be the correct goal.

A little thought should convince us that the appropriate goal is to distinguish any homologous sequences in the database from the vast majority of nonhomologous sequences in the database. The score for homologous sequences will be dominated by the relatively long sections of the alignment uninterrupted by gaps. Thus changing the gap penalty will have only a minor effect on the score for an alignment of your query sequence with an homologous database sequence.

Gap penalties, however, are critical in determining the expected maximum score for random sequences being compared with your query sequence (10). The number of nonhomologous sequences that achieve a particular score decreases exponentially as the score increases as long as the gap penalties are sufficiently large relative to the similarity scores used to compare the sequences. If the gap penalties are too small relative to the similarity scores the number of sequences achieving a particular score decrease as the square of the number of sequences (10). This leads to an increase in the number of relatively high scoring nonhomologous sequences reported by a database search. Unfortunately, there is no analogous effect on the scores for homologous sequences. Thus choosing gap penalties that are too small leads to matches with homologous sequences being obscured by spurious matches with nonhomologous sequences.

While we want to force alignments to have relatively few gaps we need to allow them to be fairly long. This allows a few gaps to accommodate a larger number of insertion and deletion events in the evolutionary history of the sequences. To accomplish this we should match a large penalty for creating or opening a new gap with a small penalty for lengthening or extending a gap. In this context "large" will typically be two to three times the largest negative value in the similarity table and at least as large as the largest positive value. For an integer similarity table will typically be one and should not be zero.


Measuring Statistical Significance

Using the Database Search to Create a Reference Distribution

A significance probability is an answer to the question: How often will an event at least as extreme as the one just observed happen if these events are the result of a well defined, specific, random process? For database searches the pertinent random process is evolution from independent, unrelated ancestral sequences rather than a common ancestral sequence. With respect to any newly determined query sequence most of the sequences in the database are such randomly evolved sequences.

We expect the scores for random sequences aligned with our query sequence to decay exponentially (10,12,13). That is, for any group of sequences that achieve at least some minimum score we expect only some constant fraction of them to achieve a given somewhat higher score. If the gap penalty is high enough this exponential decay is observed in practice. This exponential decay process gives a straight line when we plot the log of the number of sequences that achieve a particular score against the score. Extrapolating this line to its intercept with the score axis gives us a score higher than we might reasonably expect to see from a group of random sequences the size of the database (12,13). Thus any appreciably higher score is a statistically rare event deserving careful evaluation.

Empirically this result depends on the gap penalty being high enough (10). If the Gap penalty is too low the scores have been observed to decay as the square of the number sequences achieving at least a given score. This slower decay broadens the distribution of scores from the random majority of the database and obscures most homologous matches. It also invalidates the fitting of the scores to a logarithmic function and yields a much too high extrapolated score.

Using Information Theory and the Length of the Alignment

Every alignment has a certain amount of information per position about whether or not the aligned sequences are homologous. The amount of this information per position, called entropy, is determined by the similarity table used in computing the alignment. Every similarity table has an associated amount of entropy. This amount of entropy is a maximum and is achieved only in alignments between sequences that have diverged to the degree appropriate to the matrix (4).

Unfortunately, it is so far impossible to assign an entropy value to alignment gaps corresponding to insertions or deletions in the evolutionary history of the aligned sequences. Thus the information can be computed only for regions of the alignment uninterrupted by gaps (15,16). The amount of information associated with an uninterrupted region of the alignment is given by the length of the alignment in sequence residues multiplied by the entropy value for the scoring table. It is measured in units called bits. One bit is the amount of information needed to answer a yes or no question.

In order to make use of this amount of information in the alignment we need to know how much information we expect to find in alignments between random sequences. This answer obviously depends on the length of the query sequence and the size of the database. Without going into the details of the derivation this is the amount of information required to repeatedly divide the database into the remaining half containing the best scoring segment. This minimum amount of information for a significant match is given by the logarithm to the base 2 of the product of the length of the query sequence and the number of residues in the database (4). Thus if there is more than this amount of information in the alignment the match is statistically significant and should be investigated farther.

This statistical theory was initially developed for use with maximal segment pair alignments like those produced by the BLAST program. However, the longest uninterrupted section of an alignment produced by the Smith-Waterman or the FASTA algorithms cannot be longer than the maximum segment pair alignment between the same pair of sequences (although it may be shorter). Thus this method can safely be used to evaluate a Smith-Waterman or FASTA alignment as long as it is applied only to the longest uninterrupted section of the alignment.


Conclusions: What should I do?


Acknowledgements

This tutorial was developed as part of the Pittsburgh Supercomputing Center's role as an NIH Research Resource in biomedical supercomputing which is supported by grant RR06009 from the Division of Research Resources. This material has been presented in workshops supported by that grant on several different university compuses. The material has been presented as part of a workshop on nucleic acid and protein sequence analysis presented annually at the Pittsburgh Supercomputing Center. This workshop is funded by grant number T-15 HG00015 from the NIH's National Human Genome Research Institute. The authors of the tutorial have greatly benefited from hearing and talking with the other instructors at this workshop, Dr. Michael Gribskov from the San Diego Supercompter Center and Dr. Gary Churchill from Cornell University. We should also acknowledge Dr. Stephen Altschul whose participation in early editions of the workshop was invaluable.

Annotated Bibliography

1    Altschul, S.F., Boguski, M.S., Gish, W., and Wooten, J.C.
     1994.  "Issues in Searching Molecular Sequence Databases"
     Nature Genetics. 6: 119 - 129
     An overview of database searching and description of
     potential pitfalls with a very complete and useful
     bibliography.

2    Pearson, W.R.  1991.  "Searching Protein Sequence Libraries:
     Comparison of the Sensitivity  and Selectivity of the Smith-
     Waterman and FASTA algorithms"  Genomics.  11: 635 - 650

3    Pearson, W.R.  1995.  "Comparison of metnods for searching protein
     sequences databases"  Protein Science.  4: 1145-1160.
     A comparison of the performance of the FASTA, BLAST, and
     Smith-Waterman algorithms on a number of well characterized
     families.  Different scoring matrices and gap penalties are
     also tested as is scaling the scores to adjust for the lengths
     of the sequences.

4    Altschul, S.F.  1991.  "Amino acid substitution matrices from an
     information theoretic perspective." Journal of Molecular Biology,
     219: 555-665.
     This paper looks at the PAM and Blosum scoring matrices in the
     context of information theory and develops guidelines making
     effective use of the information encapsulated in scoring matrices.

5    Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.  1978.  "A model of
     evolutionary change in proteins."   In "Atlas of Protein Sequence
     and Structure" 5(3)  M.O. Dayhoff (ed.), 345 - 352, National
     Biomedical Research Foundation, Washington.
     This paper describes the development of the PAM family of protein
     scoring matrices.

6    States, D.J., Gish, W., Altschul, S.F.  1991.  "Improved Sensitivity
     of Nucleic Acid Database Search Using Application-Specific Scoring
     Matrices"    Methods: A companion to Methods in Enzymology 3(1):
     66 - 77.
     Scoring matrices for nucleic acid sequence that take into account
     different levels of sequence divergence and different rates of
     transversions and transitions.

7    Steven Henikoff and Jorja G. Henikoff.  1992  "Amino acid
     substitution matrices from protein blocks."  Proc. Natl.
     Acad. Sci. USA. 89(biochemistry): 10915 - 10919.
     This paper describes the calculation of the Blosum family of
     protein scoring matrices.

8    M.S. Johnson and J.P. Overington.  1993.  "A Structural Basis
     of Sequence Comparisons:  An evaluation of scoring methodologies." 
     Journal of Molecular Biology.  233: 716 - 738.
     Comparison of Amino Acid substitution matrices with visual
     representation of the the important features of the matrices for
     protein similarity and the differences between them.

9    Steven Henikoff and Jorja G. Henikoff.  1993.  "Performance
     Evaluation of Amino Acid Substitution Matrices."  Proteins:
     Structure, Function, and Genetics.  17: 49 - 61.
     Comparison of Amino Acid substitution matrices.

10   Vingron, M. and Waterman, M.S.  1994.  "Sequence Alignment and
     Penalty Choice"  Journal of Molecular Biology.  235: 1 - 12
     This paper describes how the choice of gap penalties affects the
     distribution of scores from database searching

11   Fitch, W.M. and Smith, T.F.  1983.  "Optimal sequence alignments"
     Proc. Natl. Acad. Sci. USA. 80: 1382 - 1386
     This paper describes how the choice of gap penalties affects the
     the accuracy of alignment

12   Collins, J.F., Coulson, A.F.W., and Lyall, A.  1988.  "The
     significance of protein sequence similarities"  Computer
     Applications in the Biosciences,  4: 67 - 71
     Papers 12 and 13 describe using the database itself to generate
     a distribution of random alignment scores that can be used to
     assess whether or not high scoring sequences merit further
     investigation.

13   Collins, J.F. and Coulson, A.F.W.  1988.  "The
     significance of protein sequence similarities"  Methods
     in Enzymology, vol. 183 - Molecular Evolution:  Computer
     Analysis of Protein and Nucleic Acid Sequences.  R.F.
     Doolittle (ed.)  Chapter 30, pp. 474 - 487

14   Smith, T.F., Waterman, M.S., and Burkes, C.  1985.  "The
     statistical distribution of nucleic acid similarities"
     Nucleic Acids Research  13: 645 - 656
     An empirical look at the distribution of scores to be expected
     when searching nucleic acid databases.
     
15   Karlin, S. and Altschul, S.F.  1990.  "Methods for assessing
     the statistical significance of molecular sequence features
     by using general scoring schemes"  Proc. Natl. Acad. Sci.
     USA.  87: 2264 - 2268.

16   Karlin, S. and Altschul, S.F.  1993.  "Applications and
     statistics for multiple high-scoring segments in molecular
     sequences"  Proc. Natl. Acad. Sci. USA.  90:  5873 - 5877


17   Smith, T.F. and Waterman, M.S.  1981.  "Identification of
     common molecular subsequences"  Journal of Molecular
     Biology  147: 195 - 197
     The Smith-Waterman algorithm.

18   Waterman, M.S. and Eggert, M.  1987.  "A new algorithm for
     subsequence alignments with application to tRNA-rRNA
     comparisons"  Journal of Molecular Biology.  197: 723 - 728
     Extensions to the Smith-Waterman algorithm.

19   Pearson, W.R. and Lipman, D.J.  1988.  "Improved tools for
     Biological Sequence Comparison"  Proc. Natl. Acad. Sci. USA.
     85: 2444 - 2448
     The most recent version of the FASTA algorithm.

20   Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman,
     D.J.  1990.  "Basic local alignment search tool"  Journal of
     Molecular Biology.  215: 403 - 410
     The BLAST algorithm

21   Needleman, S.B. and Wunsch, C.D.  1970.  "A general method
     applicable to the search for similarities in the amino acid
     sequences of two proteins"  Journal of Molecular Biology.
     48:  443-453.
     The original Needleman-Wunsch algorithm which uses only an open
     gap penalty.

22.  Sellers, P.H.  1974.  "On the Theory and computation of
     evolutionary distances"  SIAM J. Appl. Math.  26:  787 - 793.
     A global alignment algorithm based on distance or sequence
     dissimilarity.

23.  Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z.,
     Miller, W., and Lipman, D.J.  1997.  "Gapped BLAST and
     PSI-BLAST: a new generation of protein database search programs"
     Nucleic Acids Research.  25:  3389 - 3402.