Group and Secondary Structure Features

Increasingly families of genes and proteins are organized into superfamilies. Database organizations often use a the percent of residue identity as the criterion for distinguishing whether a pair of sequences should be assigned to the same family or different families within a superfamily. Perhaps a more useful criterion is whether or not a gene duplication has taken place in the common evolutionary history of the two sequences. Evolutionary biologists use this classification and refer to homologous sequences that have only speciation events and not have gene duplication events in their common evolutionary history as being orthologous. Homologous sequences that have a gene duplication event in their common evolutionary history are referred to as paralogous. Orthologous genes or proteins generally carry out the same biochemical and physiological functions while paralogous proteins generally carry out similar but related functions. For instance, mammalian myoglobins, which carry oxygen within cells are a orthologous family. They are part of a superfamily that includes the alpha hemoglobins and the beta hemoglobins, both of which are also orthologous families. These three homologous families are part of the same paralogous superfamily, as are other globin genes and proteins.

Families and superfamilies can be organized around functional criteria as well as evolutionary criteria and sequence divergence. Although all sixty plus transfer RNA sequences in E. coli are paralogous with each other they can be grouped together by their twenty amino acid acceptor activities.

The group functions in GeneDoc are designed to allow users to work with and analyze groups based on any of the above criteria or any user determined criteria for dividing a set of sequences into groups. The first step in working with groups is to access the groups configuration dialog. The group configuration dialog can be accessed either by selecting the "edit sequence groups" item on the groups menu or by clicking the groups button, the button marked with an upper case G on the upper toolbar. The group configuration dialog allows the user to allocate the sequences to groups and to select a color to be associated with each group. For the purposes of most of the GeneDoc group analyses sequences that are not explicitly assigned to a group will be treated as if each unassigned sequence constitutes the only member of its own group. These implicit groups will not be analyzed but the sequences will be used in the analyses of the defined groups as "other" groups and thus they will contribute to the analysis.

The group analyses result in different shading for individual groups. These shadings highlight different degrees of different kinds of conservation of residues or properties within and between groups. One analysis, referred to as the Dstat analysis, measures how different the groups are from one another and whether this difference is statistically significant.. The Dstat analysis presents its results as a graph and numerical values. The simplest analysis is performed by the "shade group conserved" entry on the Groups menu. This analysis highlights positions within each group that are completely conserved, that is there is only one residue at that position within the group. This highlighting is done in the color assigned to the group in the group definition dialog. This measurement of conservation within the groups does not take into account any equivalency groups, even if they are active. The second thing this analysis does is to highlight the positions that are completely conserved across all of the groups, that is there is only one residue at that position for all of the sequences in the alignment. This part of the analysis does take the equivalency groups into account if they are in effect. The final action is to compute a consensus sequence based on the entire alignment.

The most useful information derived from this analysis is to identify for the user the regions of the alignment where structural or functional requirements may have been relaxed or eliminated (or alternatively added as the group evolved a new function) for some groups relative to others. For this kind of information to be reliable the conserved groups need to be both large and from a diverse range of organisms. Otherwise the observed conservation may simply be the result of a small data set with highly dependent observations.

A more stringent analysis is performed by the "shade group PCR contrast" entry on the Groups menu. Sites highlighted by this analysis meet two criteria. First is that a single residue is completely conserved within the group. Second this conserved residue does not appear, at that position, in any sequence outside of the group in which it is conserved. This analysis marks unique sequence features of the group that can be useful in defining a group motif and possibly in defining a primer sequence to be used in a polymerase chain reaction (PCR) amplification of the gene.

The "shade group contrast" entry on the Groups menu performs an analysis similar to that of the "shade group PCR contrast" entry. This analysis makes use of the scoring table designated for alignment scoring to divide scores for pairs of amino acids into three classes, positive, negative, and neutral. The positive scores are those that are positive numbers in the similarity scores form of the table. Similarly, the negative scores are those that are negative numbers in the similarity scores form of the table, while the neutral scores have a zero score. The scores are stored in GeneDoc as distance or dissimilarity scores and hence must be converted to the similarity form. This is done by subtracting the score for a pair of sequence residues in the table from a constant called the zero cost distance, stored with the table. Thus the largest distances become negative similarities and small distances become positive similarities. The interpretation of the scoring tables is that positive similarities are conservative substitutions and are favored over random substitutions in the evolutionary process relating the sequences.

The analysis performed by the "shade group contrast" entry on the Groups menu is less restrictive about the degree of conservation within the group than is . All of the sequence residues found at a position within the group are required to have a positive similarity score with each other, and thus to be conservative substitutions. This analysis is, however, more restrictive than is the analysis performed by the "shade group PCR contrast" entry on the Groups menu when dealing with residues outside the group. The residues outside of the group must have a negative similarity score with every residue from within the group, thus they are not allowed to be either conservative or neutral substitutions. An example of using this kind of analysis to study the recognition of transfer RNAs by aminoacyl tRNA synthetase enzyme can be found in McClain and Nicholas, 1987. Nicholas et al., 1987 describes using the contrasts to plan site directed mutagenesis experiments to confirm the analysis of the tRNAs.

The analysis called the Dstat analysis is the Kolmogorov-Smirnov test for the equality of two distributions. The Dstat analysis is accomplished by first selecting a region (or all) of the alignment for use in the test calculations. Then you can either select the analysis under the Dstat menu or click the Dstat tool bar button. The Dstat toolbar button is near the right end of the upper toolbar and is marked by a pair of "S" shaped curves representing the cumulative distributions used in the test. As noted above, the Dstat analysis is a statistical test of whether the groups defined by the user are significantly different from each other.

The first step in the test is to compute an alignment score for each pair of sequences over the region selected by the user. These scores are the partitioned into two distributions. The first distribution is composed entirely from scores where both of the sequences used to compute the score are members of different user defined groups. This is called the between groups distribution. The second distribution is composed entirely from scores where both of the sequences used to compute the score are members of the same user defined group. Note that this includes scores from every group with two or more sequences. This distribution is called the within groups distribution.

These two distributions are plotted as cumulative distributions. That is the score is plotted versus the fraction of the scores in the distribution that are less than or equal to the score being plotted. The Kolmogorov-Smirnov D statistic (Dstat) is the maximum difference between the two distributions (along the fractional axis). Recent advances in the understanding of the distribution of values taken on by Dstat allow us to compute its one-sided significance probability. The one-sided significance probability is used rather than the two-sided significance probability because we are only interested in the case where the between groups distribution is composed of larger scores than the within groups distribution. The other situation, where the within groups distribution is composed of larger scores than the between groups distribution corresponds to either convergent evolution or some sort of selection in favor of divergence, situations that are not usually part of the hypothesis.

The Kolmogorov-Smirnov test was selected instead of the more common Studentís t test or the F test because it is sensitive to both the location of the distributions along the scores axis and to the shape of the distribution. Studentís t test is sensitive to only the location of the distributions and the F test is sensitive only to differences in the variance of the distribution, only one of several aspects affecting the shape of the distributions. Thus the Kolmogorov-Smirnov test can find the distributions to be different when either Studentís t test or the F test might have failed. Because of this it is necessary for the user to examine the plot carefully to determine the exact nature of the differences in the two distributions being tested. The user should exercise care that the biological hypothesis being examined should lead to the type of difference actually observed.

Examples of testing biological hypotheses with sequence data and the Kolmogorov-Smirnov test can be found in Nicholas and Graves, 1983 and in Nicholas and McClain, 1995. The Nicholas and Graves paper contains an extended discussion of formulating Kolmogorov-Smirnov tests that correspond to different kinds of biological hypotheses.