Exercises for bioinformatics.psc.edu:
Analysis of Sequence Subfamilies
Finding Distinctive Features in Groups of Sequences
This exercise teaches the use of the Pittsburgh Supercomputing Center's Group Entropy program, GEnt, for analyzing a multiple sequence alignment of a superfamily of sequences that has been subdivided into mutually exclusive groups. The analysis assumes that all of the sequences in the superfamily are homologous and that different subsets have developed distinctive features such as different substrate specificities or specific interactions with different macromolecules. The analysis is intended to identify the sequence residues most likely to be responsible for those distinctive features. It asks the scientific question: What makes each defined subset of sequences different from all of the other sequences in the alignment?
The exercise contains two parts. The first presents the use of an ordination technique, Principle Components using the program SeqSpace. The Principle Components analysis takes distance measures among a set of sequences and converts them into a self-consistent set of co-ordinates in some arbitrary number of dimensions. These co-ordinates are then plotted and the plots examined for clusters of sequences. Sequences in the same cluster share a similar pattern of substitutions and this is presumed to reflect some common biochemical or physiological property or function.
The second presents a cross entropy analysis of the columns in a multiple sequence alignment using the PSC developed program GEnt (Group Entropy). The GEnt program's cross entropy analysis contrast the amino acid or nucleotide composition, of a single alignment column, within a defined group of sequences with the amino acid or nucleotide composition in all of the sequences outside of the group.Alignment positions where the residue composition inside the group is very different from that outside the group are expected to indicate alignment positions associated with distinctive properties of sequences within the group. The results will be displayed using the xgobi graphics program.
In both analyses the results will be presented graphically using remote graphics displays that will be sent to your local computer or workstation. Please make sure you are using a local workstation capable of recieving this information from the host bioinformatics computer. Also please be sure that the computer is set up to recieve information from the host computer.
Also note that in these exercises items are sometimes enclosed in double quotation marks or brackets. This is to make the exercise easier to read, do not include either the double quotation marks or the brackets when you are entering responses on the computer.
- Use SSH to Log into the sequence analysis computer (bioinformatics.psc.edu)
Set up the sequence analysis computer.
- Instruct the computer where to send the remote graphics display by giving it the name of your local computer by entering the command setenv DISPLAY localcomputer:0.0 where localcomputer is the Internet name of your local computer and will have a form similar to computer.site.sitetype such as bioinformatics.psc.edu. If you are unsure of the local address of the computer that you are using, issue the command who -m. The localcomputer that you are using will be listed within parenthesis at the end of the line starting with your user id. If you are doing this exercise at a workshop held on-site at the Pittsburgh Supercomputing Center, this name will be something like ctc01.psc.edu.
- Define the location of the postscript definitions that the xgobi graphics program needs to write a postscript file or to print its graphs with the command: setenv XGOBID /biomed/lib/xgobi (If you have already done the UNIX operating system hands-on, you can skip this step as you have already added this line to your .login file)
- Create a new, empty subdirectory for this exercise with the command: mkdir Xent
- Move into the Xent subdirectory with the command: cd Xent
Carry out the SeqSpace analysis and visualize the clusters of sequences with distinct
patterns of residue substitution.
- Get a sample MSF file. Copy a multiple sequence alignment of 30 Serine proteases from the Trypsin, Elastase and Chymotrypsin families into you current working directory (which should be Xent) with the command: cp /biomed/lib/example/sprot30.msf sprot30.msf
- Perform the SeqSpace analysis with the command: SeqSpace -msf sprot30.msf The program prints messages marking its progress with the analysis. The last things printed are the names of the three files of results.
- Write the file ending in _protein.sc here:
- Write the file ending in _aa_pos.sc here:
- Write the file ending in _alignment.sc here:
Visualize the results of the SeqSpace analysis.
- Run the SeqSpace Viewer. enter SeqSpaceView
- Open the file menu by putting the cursor on the word File at the top of the Sequence Space Browser window and depressing the left mouse button. Highlight the word Open in the menu and release the left mouse button. A list of file names will appear. Click on the one that ends in _aa_pos.sc (The file that was listed in step 3.4). The complete file name should appear in the dialogue box at the bottom "Open" window. Click on the OK button at the bottom of the dialogue box. A list of the names of the analyzed sequences should appear in the main "Sequence Space Browser" window. You may want to enlarge this window so that all of the sequence names are displayed.
- Open the viewer menu by putting the cursor on the word Viewer at the top of the Sequence Space Browser window and depressing the left mouse button. Highlight the phrase SeqSpace 2D proteins in the menu and release the left mouse button. A new window, titled Sequence Space 2D Proteins will open. In this window each small open box represents one of the sequences in the data set. Move this new window so that it does not cover the Sequence Space Browser window.
- Highlight, by clicking on it, the sequence name tryp_xenla in the Sequence Space Browser window. Note that in the Sequence Space 2D Proteins window the color of one of the small boxes (lower right of the window) is turned from black to red. This box corresponds to the tryp_xenla sequence.
- In the Sequence Space 2D Proteins window place your cursor on a different black box. If the cursor is placed on the box without clicking the left mouse button a small text field will appear that displays the name of the sequence or sequences pointed to by the cursor. If the left mouse button is clicked the sequence name will be highlighted in the Sequence Space Browser window. Note that the display of the text field in the Sequence Space 2D Proteins window is not always readable on all computers. Very small differences in the graphics hardware or software settings can sometimes cause the text not to be visible. If this happens to you, try a different computer or change the color display settings on your computer and perhaps it will work.
- Open the Sequence Space 2D Residues window (menu item SeqSpace 2D Residues) in the same way that you opened the Sequence Space 2D Proteins window. Each small black box in this plot represents the combination of a single, specific amino acid residue at a specific position in the alignment. If you place the cursor on a box without clicking the information will again be displayed in a small text label within the window. If you click on the box a vector (a line with a specified direction) will be drawn from the origin of the plot through that box. The same vector will be drawn in the Sequence Space 2D Proteins window as well. If this vector passes through a cluster of sequences those sequences are likely to have the specified amino acid in the designated alignment position and few other sequences will have the same combination of a specified amino acid in a designated position. You can experiment with other viewers and dimensions in the plot. The alignment viewer is linked to the other windows. Unfortunately, because of differences in the fonts available on different computers the alignment viewer does not always provide a useful view.
- After you finish experimenting, close the windows and program by clicking Exit under the file menu in the Sequence Space Browser window. (Note: if the browser does not end cleanly, you may have to force termination with [cntrl] c)
Use the GEnt program to assign sequences that were not previously assigned to a group to the group
whose profile matches best with the group.
- Copy the group definition file for the 30 Serine proteases from the Trypsin, Elastase and Chymotrypsin families into your current working directory (which should be Xent) with the command: cp /biomed/lib/example/sprot30m3.groups sprot30m3.groups
- Examine the group definitions file, with the command: more sprot30m3.groups Note that the last group defined in the file is named "TOBECLASSIFIED". It contains the sequences ctr2_vesor, el3b_human, and trya_rat. They will be added to one of the other three groups, "Trypsin", "Elastase", or "Chymotrypsin" defined in the file.
- Run the GEnt program interactively and respond to its data files and other information. The program is started by entering the program name, GEnt
- The GEnt program starts with a message about the analyses it conducts and a request for the name of the data file containing the aligned sequences. Respond with the name of the file you copied in step 3.1, (sprot30.msf)
- The GEnt program will then request the name of the groups definition file. Respond with the name of the file you copied in step 5.1, (sprot30m3.groups).
- The GEnt program will solicit the number of gap characters to allow in any column of the alignment that is to be written to a file. This allows you to easily isolate that part of the file that contain few or no insertions and deletions. This part of the alignment can be useful as the basis for a phylogenetic bootstrap analysis to help you assign sequences to well delineated, robust groups. Enter the numeral zero 0 The results will appear in a file named trimmed.aa
- The GEnt program will ask you to select an amino acid transition frequency matrix to be used in generating pseudocounts for the cross entropy and PSSM (profile) calculations. The pseudocounts are computed by the Henikoff and Henikoff method discussed in the workshop lecture on representing patterns and motifs. Select a matrix that represents a reasonable degree of divergence for sequences within each group. In this context and for this data, select the PAM160 transition frequencies by entering 11.
- The GEnt program will solicit a value for the multiplier to use in determining the weight to give the Henikoff and Henikoff pseudocounts relative to the observed data tabulated from the sequences. Enter 4.0
- The GEnt program will next solicit whether you want to assign the sequences previously unassigned to groups in the current interactive session or if you would prefer to have the assignment data written to a file that you can examine later. Choose interactive assignment during the current session by entering the letter I
- The GEnt program will then ask if you want to base the assignment on scores from Henikoff style PSSMs that incorporate all of the positions in the alignment without gaps or on a restricted subset of these positions selected by the amount of information the positions contain about group membership. Choose all ungapped (analyzed) positions by entering the letter A
- The GEnt program will ask if you want to have it compute a symmetric cross entropy distance between every pair of sequences in the alignment. Choose to do this by entering the letter Y. In addition to the distances file this will cause the program to write the files between.plt and within.plt The within.plt file contains the distribution of cross entropy distances between every pair of sequences where both members of the pair are assigned to the same group. The between.plt file contains the distribution of cross entropy distances between every pair of sequences where each member of the pair belongs to a different group. Plotting these files provides an overview of the distinctness of the defined groups in terms of the composition of the sequences assigned to each group.
- The GEnt program will solicit which of four different analyses and results you want computed and written to files. Select the cross validation files, the plot files, and the GeneDoc files by entering the numbers 1 2 3 The numbers should be separated by spaces. At this point the GEnt program will begin its calculations and write brief messages to the screen describing what it is computing.
- After the calculations the GEnt program will write to the screen the scores necessary to allow you to assign the unaligned sequence ctr2_vesor to one of the three possible groups, Trypsin, Elastase, or Chymotrypsin. A sequence can be reliably assigned to a group if the score for the group is positive while the scores for other groups are negative and there is a substantial difference between the highest two scores. A positive score indicates that sequence being classified looks substantially like the sequences already assigned to that group and has substantial differences from sequences assigned to other groups. A negative score indicates that sequence being classified is substantially more like the sequences already assigned to other groups and is substantially different from sequences assigned to the group. Assign the sequence ctr2_vesor to the Chymotrypsin group by entering the number 3
- The GEnt program will then ask you to assign the sequences elb3_human and trya_rat to groups. Assign elb3_human to the Elastase group by entering the number 2
- Assign trya_rat to the Trypsin group by entering 1
- The GEnt program will ask if you want to write the new group assignments to a file. You should do this if you want to stop at this point without completing the analysis. For this exercise answer no by entering the letter N
- The GEnt program will again solicit whether you want to assign the sequences previously unassigned to groups in the current interactive session or if you would prefer to have the assignment data written to a file that you can examine later. This time choose later, off-line assignment by entering the letter O
- The GEnt program will again ask if you want to have it compute a symmetric cross entropy distance between every pair of sequences in the alignment. Again choose to do this by entering the letter Y.
- The GEnt program will again solicit which of four different analyses and results you want computed and written to files. Select the cross validation files, the plot files, and the GeneDoc files by entering the numbers 1 2 3 The numbers should be separated by spaces. At this point the GEnt program will begin its calculations and write brief messages to the screen describing what it is computing. It will also provide a statistical summary of the between and within group cross entropy distance distributions. For compact well-defined groups the average distance between sequences within the same group should be substantially smaller than the average distance between sequences in different groups.
- The last thing the GEnt program will ask is for the name of a file to which it can write the new groups assignments. (You may want to call this file name new.groups). Enter the name of the file selected below:
Examine the results of the Group Cross Entropy calculation using the xgobi display program It will display an interactive plot of the results from the GEnt program. Each of the three groups can be displayed. The necessary files are those ending with the extensions .dat, .col, and .row.
- Start the xgobi program and give it the name of the group whose analysis you want to display using the command: xgobi Chymotrypsin
- An Xwindow should appear on your screen that contains a plot of the Chymotrypsin results with Group Entropy on the vertical axis and Family Entropy on the horizontal axis. Interchange the axes by placing your cursor in the circle at the right of the screen under the Group Entropy label. Place the cursor to the right and even with the center of the circle while remaining in the circle. Click the left mouse button. The cursor should become an X and after the click the plot should redraw with the axes interchanged.
- Resize the plot. This requires two steps. First, place the cursor on the lower right corner of the entire window, press the left mouse button and drag the down and drag the cursor to the right. Release the mouse button.
- Next place the cursor on the small colored box near the bottom of the vertical line just to the right of the plot area. The cursor should change shape and become a double headed arrow. Again hold down the left mouse button and drag the cursor to the right. Be sure and leave room for the labels to the right of the plot area.
- Activate the labeling of individual points in the plot. Place the cursor in the second menu box at the top left of the screen. This box should be labeled View: XYPlot. Hold down the left mouse button and highlight the menu entry Identify. Release the mouse button. The text box at the bottom left of the plot should now read "L:Toggle sticky labels".
- The points in the plot that are most informative are those with high group entropy values that should now be on the right side of the plot. Place the cursor over the right most point in the plot and leave it there. A label should appear near the point that reads 205y-h. The label is in the form: NNNg-f. Where NNN is the alignment index, the location in the alignment of the column analyzed. The letter g (y for tyrosine in this case) is the single letter code for the amino acid that that makes the largest contribution to the group entropy. The letter f (h for histidine in this case) is the single letter code for the amino acid that that makes the largest contribution to the family entropy. The label can be made persistent by clicking on the left mouse button.
- Click, with the left mouse button, on all of the points that you would like to have labeled.
- To save this plot to a postscript file that you can either print or import into other programs, place the cursor on the file menu button at the top left of the window. Press the and move the cursor to highlight the Print option. Then release the mouse button and the Print dialogue box will open.
- Place the cursor in the box that contains the file name foo.ps which is just to the right of the box labeled Filename:. The cursor should be placed just to the right of the ".ps". Use the backspace key to erase "foo.ps" and type in a new file name, chymotrypsin.ps. Click, with the left mouse button, the box marked write to file. Close the print dialogue box by clicking in the box at the bottom of the dialogue box, marked click here to dismiss.
- To print this plot on a printer at the PSC (do this only if you are currently attending a workshop on-site at the PSC) open the print dialogue box by placing the cursor on the file menu button at the top left of the window. Press the left mouse button and move the cursor to highlight the Print option. Then release the mouse button and the Print dialogue box will open. Place the cursor in the box that reads lp. This box is just to the right of the box labeled Postscript printer. Use the backspace key to erase the text lp and enter lpr -Ptoast in its place. Note that there is a space between "lpr" and "-P", but not between "-P" and "toast". The "P" in "-P" is uppercase. Then click on the box labeled Send to printer. This will print the plot at a printer called Toast that is connected to the sequence analysis computer at the PSC. Close the print dialogue box by clicking in the box at the bottom of the dialogue box, marked click here to dismiss.