GeneDoc's Similarity Tables

When Similarity Groups are enabled, GeneDoc assigns amino acids to substitution groups, groups of amino acids that are treated as if they are equivalent to each other, for the purpose of measuring the degree conservation in each column of the alignment. We have attempted to place the selection of members of each equivalence group on an objective and rational basis. Thus the members of each equivalence group are a set of amino acids that have mutually positive scores in the similarity representation of the scoring matrix selected for use with an alignment. The default scoring matrix is the Blosum 62 matrix and thus the default equivalence groups reflect the scores in this matrix. Note that GeneDoc, in order to properly deal with multiple sequence alignments, uses the similarity matrices in their distance representation. The discussion below is in terms of the similarity representation and the matrices below are shown in the similarity form while those displayed GeneDoc will be in their distance representation. The two forms can be interconverted with the equation:

Distance(i,j) = Maximum Similarity - Similarity(i,j).

Where (i,j) designates the score for a specific pair of amino acids.

The choice of similarity matrix determines both the pattern and the extent of substitutions that will be considered as favorable in evaluating the alignment.

In order to objectively evaluate an alignment one must have a quantification of whether the substitution of one amino acid for another is likely to conserve the physical and chemical properties necessary to maintain the structure and function of the protein or is more likely to disrupt essential structural and functional features of the protein. Numerous bases have been used in creating similarity tables: explicit or implicit (empirical) evolutionary models, structural properties such as Chou-Fasman propensities, chemical properties such as charge, polarity, and shape, as well as combinations like those used in the Structural-Genetics Matrix. Regardless of the underlying bases, all similarity tables are attempts to quantify whether a mutation preserves or disrupts the function of a protein.

Similarity scores used in GeneDoc are based on observed substitutions of one amino acid (or nucleotide) for another in homologous proteins or genes. Similarity scores organize the observations into scores that contrast the observed pattern of substitutions in homologous proteins with the random pattern of substitutions we would expect to observe in unrelated proteins. Modern similarity scores, computed as log-odds scores, have been shown to be the most efficient way to use the observed substitution data to detect homologous sequences.

If the replacements are favored during evolution (i.e. a conservative replacement) the similarity score will be greater than zero and if there is selection against the replacement (i.e., a nonconservative replacement) the similarity score will be less than zero. Thus similarity scores above zero indicate that two amino acids replace each other more often during evolution than we would expect if the replacements were random. Likewise, similarity scores below zero indicate that amino acids replace each other less often than we would expect if the replacements were random.

Differences in the way replacements are counted is one of the biggest differences between the two most widely used families of similarity matrices, the PAM matrices and the more recently developed Blosum matrices. The PAM matrices use counts derived from an explicitly tree like, branching evolutionary model. The Blosum matrices use counts directly derived from highly conserved blocks within an alignment.

Margaret Dayhoff and her co-workers performed the first careful, systematic study to create the first amino acid similarity matrix, the Point Accepted Mutation (PAM) similarity matrix. In computing the PAM matrices the alignment was created from a limited set of closely related sequences. The alignment was a global alignment, that is, it encompassed the entire length of the sequences. Thus both highly conserved regions and highly variable regions are included in the alignments and used in counting replacements.

The PAM 250 matrix, originally created by Margaret Dayhoff is shown below. This matrix is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about 20 percent identical.

PAM 250 Amino Acid Similarity Matrix

C 12


G -3   5


P -3  -1   6


S  0   1   1   1


A -2   1   1   1   2


T -2   0   0   1   1   3


D -5   1  -1   0   0   0   4


E -5   0  -1   0   0   0   3   4


N -4   0  -1   1   0   0   2   1   2


Q -5  -1   0  -1   0  -1   2   2   1   4


H -3  -2   0  -1  -1  -1   1   1   2   3   6


K -5  -2  -1   0  -1   0   0   0   1   1   0   5


R -4  -3   0   0  -2  -1  -1  -1   0   1   2   3   6


V -2  -1  -1  -1   0   0  -2  -2  -2  -2  -2  -2  -2   4


M -5  -3  -2  -2  -1  -1  -3  -2   0  -1  -2   0   0   2   6


I -2  -3  -2  -1  -1   0  -2  -2  -2  -2  -2  -2  -2   4   2   5


L -6  -4  -3  -3  -2  -2  -4  -3  -3  -2  -2  -3  -3   2   4   2   6


F -4  -5  -5  -3  -4  -3  -6  -5  -4  -5  -2  -5  -4  -1   0   1   2   9


Y  0  -5  -5  -3  -3  -3  -4  -4  -2  -4   0  -4  -5  -2  -2  -1  -1   7  10


W -8  -7  -6  -2  -6  -5  -7  -7  -4  -5  -3  -3   2  -6  -4  -5  -2   0   0  17


   C   G   P   S   A   T   D   E   N   Q   H   K   R   V   M   I   L   F   Y   W


The PAM 250 matrix above has been arranged so that similar amino acids are close to each other. This gives rise to regions along the diagonal of the matrix that contain only positive scores. These regions provide an objective basis for defining conservative substitutions, namely as amino acids that replace each other more frequently than would be expected from random replacements. Note that the amino acids that make up these regions can change at different levels of sequence divergence, that is, different similarity scores matrices correspond to different sets of conservative substitutions. The diagonal terms of the matrix vary appreciably. This variation reflects both how often an amino acid is found in protein sequences and how often it is observed to be replaced by other amino acids. Thus rare amino acids which are replaced infrequently have the highest scores.

The Blosum Family of Matrices

There are three principal differences between the Blosum and PAM matrices. The first difference is that the PAM matrices are based on an explicit evolutionary model (that is, replacements are counted on the branches of a phylogenetic tree), whereas the Blosum matrices are based on an implicit rather than explicit model of evolution. The second difference is the sequence variability in the alignments used to count replacements. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The Blosum matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps. The last difference is in the method used to count the replacements. The Blosum procedure uses groups of sequences within which not all mutations are counted the same.

Blosum 45 Amino Acid Similarity Matrix

G  7


P -2   9


D -1  -1   7


E -2   0   2   6


N  0  -2   2   0   6


H -2  -2   0   0   1  10


Q -2  -1   0   2   0   1   6


K -2  -1   0   1   0  -1   1   5


R -2  -2  -1   0   0   0   1   3   7


S  0  -1   0   0   1  -1   0  -1  -1   4


T -2  -1  -1  -1   0  -2  -1  -1  -1   2   5


A  0  -1  -2  -1  -1  -2  -1  -1  -2   1   0   5


M -2  -2  -3  -2  -2   0   0  -1  -1  -2  -1  -1   6


V -3  -3  -3  -3  -3  -3  -3  -2  -2  -1   0   0   1   5


I -4  -2  -4  -3  -2  -3  -2  -3  -3  -2  -1  -1   2   3   5


L -3  -3  -3  -2  -3  -2  -2  -3  -2  -3  -1  -1   2   1   2   5


F -3  -3  -4  -3  -2  -2  -4  -3  -2  -2  -1  -2   0   0   0   1   8


Y -3  -3  -2  -2  -2   2  -1  -1  -1  -2  -1  -2   0  -1   0   0   3   8


W -2  -3  -4  -3  -4  -3  -2  -2  -2  -4  -3  -2  -2  -3  -2  -2   1   3  15   


C -3  -4  -3  -3  -2  -3  -3  -3  -3  -1  -1  -1  -2  -1  -3  -2  -2  -3  -5  12


   G   P   D   E   N   H   Q   K   R   S   T   A   M   V   I   L   F   Y   W   C


Summary: PAM and Blosum Matrices

The Blosum and PAM matrices are the most widely used amino acids similarity matrices for database searching and sequence alignment. In empirical tests of the effectiveness of the matrices both generally perform well. However, the Blosum matrices have most often been the better performers. This likely reflects the fact that the Blosum matrices are based on the replacement patterns found in more highly conserved regions of the sequences. This appears to be an advantage because these more highly conserved regions are those discovered in database searches and they serve as anchor points in alignments involving complete sequences. It is reasonable to expect that the replacements that occur in highly conserved regions will be more restricted than those that occur in highly variable regions of the sequence. This is supported by the different pattern of positive and negative scores in the two families of matrices. These different patterns of positive and negative scores reflect different estimates of what constitute conservative and nonconservative substitutions in the evolution of proteins. These differences reflect the differences in constructing the two families af matrices. Some of the difference is also likely to be because the Blosum matrices are based on much more data than the PAM matrices.

The PAM matrices still perform relatively well despite the small amount of data underlying them. The most likely reasons for this are the care used in constructing the alignments and phylogenetic trees used in counting replacements and the fact that they are explicitly based on a simple model of evolution. Thus they still perform better than some of the more modern matrices that are less carefully constructed. Both the PAM and Blosum matrices generally perform better than matrices explicitly based on criteria other than observed replacement frequencies.

We can see the concrete result of these differences in the PAM 250 and Blosum 45 matrices shown above. These two similarity tables are directly comparable and are suitable for alignments of sequences the same degree of divergence from each other. They have the same amount of information per alignment position (entropy) for determining whether or not sequences are homologous. Thus differences in these scores should reflect differences in the data and model used in counting substitutions rather than any other effects.

One striking difference is among the amino acids with carboxylate and amide side chains. At physiological pH the carboxylate side chains can only act as a proton acceptor while the amide side chain can simultaneously accept and donate a proton. In the PAM 250 table these four amino acids, Asp, Asn, Glu, and Gln (along with His) form a single conservative substitution group, that is the similarity score for any pair is greater than zero. His, has two nitrogen atoms in its side chain and, like the amide side chains of Asn and Gln, can simultaneously both donate and accept a hydrogen bond. In the Blosum 45 similarity table this single group must be split into two groups: One with the carboxylate amino acids Asp and Glu, and the other with Asn, Gln, and His.

Another noticeable difference is the relationship between Ser and Thr, the alcoholic side chain amino acids, and other small amino acids. In the PAM 250 table any of the three amino acids Gly, Pro, or Ala can be added to the Ser, Thr pair to form a three member conservative substitution group. In the Blosum 45 table Ser and Thr form a two member conservative substitution group with no other possible members. The last difference we will mention is that only Phe and Tyr are members of an aromatic conservative group in the PAM 250 table while Trp is also a member in the Blosum 45 table.

Which Similarity Scores to Use Similarity scores for sequence alignments perform much better if the similarity scores are based on replacement patterns that correspond to the degree of divergence of the aligned sequences. More information about similarity scores is available in the references below or in the on-line tutorial on sequence databasse searching at the Pittsburgh Supercomputing Center. The URL for this tutorial is:

http://www.psc.edu/biomed/TUTORIALS/SEQUENCE/DBSEARCH/tutorial.html

Altschul, S.F. 1991. "Amino acid substitution matrices from an information theoretic perspective." Journal of Molecular Biology, 219: 555-665. This paper looks at the PAM and Blosum scoring matrices in the context of information theory and develops guidelines making effective use of the information encapsulated in scoring matrices.

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C. 1978. "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 - 352, National Biomedical Research Foundation, Washington. This paper describes the development of the PAM family of protein scoring matrices.

States, D.J., Gish, W., Altschul, S.F. 1991. "Improved Sensitivity of Nucleic Acid Database Search Using Application-Specific Scoring Matrices" Methods: A companion to Methods in Enzymology 3(1): 66 - 77. Scoring matrices for nucleic acid sequence that take into account different levels of sequence divergence and different rates of transversions and transitions.

Steven Henikoff and Jorja G. Henikoff. 1992 "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919. This paper describes the calculation of the Blosum family of protein scoring matrices.

M.S. Johnson and J.P. Overington. 1993. "A Structural Basis of Sequence Comparisons: An evaluation of scoring methodologies." Journal of Molecular Biology. 233: 716 - 738. Comparison of Amino Acid substitution matrices with visual representation of the the important features of the matrices for protein similarity and the differences between them.

Steven Henikoff and Jorja G. Henikoff. 1993. "Performance Evaluation of Amino Acid Substitution Matrices." Proteins: Structure, Function, and Genetics. 17: 49 - 61. Comparison of Amino Acid substitution matrices.

Karlin, S. and Altschul, S.F. 1990. "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes" Proc. Natl. Acad. Sci. USA. 87: 2264 - 2268.