Exercises for bioinformatics.psc.edu:
Multiple Sequence Alignment
A multiple sequence alignment shows the juxtaposition of residues between a set of sequences. The juxtaposition can help highlight which residues are conserved or selectively mutated, showing residues that may be structurally/functionally important. A multiple sequence alignment can be thought of as representing the best guess as to the detailed evolutionary history of the sequences being aligned.
This exercise is designed to take you through the steps of performing multiple sequence alignmnents. This exercise is not a complete step-by-step example, You will have to think about the problem and what you are trying you accomplish before moving onto the next step. Please read the entire step, before typing in anything on the computer. Also please fill in the blank lines. Your recorded responses are often referred to in later steps. This exercise assumes that the user is in the C shell (csh) and the participant has successfully completed the Unix Operating System Hands-On exercises.
Get related sequences to align
- In this example we will align set of aspartyl proteases sequences. Copy the file containing these sequences which is called asp.pir in the directory /biomed/lib/examples Enter cp /biomed/lib/examples/asp.pir asp.pir Write the name of the file in your directory containing the aspartyl proteases sequences here:
________________________________________________________________________ - Copy the file containing additional aspartial sequences, which is in a file called addasp.pir in the directory /biomed/lib/examples. Enter cp /biomed/lib/examples/addasp.pir addasp.pir Write the name of the file in your directory containing the additional aspartyl proteases sequences here:
________________________________________________________________________
Use the MSA program to create a multiple sequence alignment
- Use the MAKSEQ program to write a script file. Enter makseq to start the Makseq program
- Select Multiple sequence alignment.
- Select MSA.
- Enter the file written down in step 1.1 as the query file.
- Enter YES; these sequences are proteins.
- Enter NO This will do both the Heuristic alignment and the optimal alignment. Normally one would only do the optimal alignment after reviewing the results of the heuristic alignment
- Enter NO. You will not need the heuristic alignment returned as an MSF file.
- Answer YES to the question asking for an optimal alignment returned as an MSF file.
- Elect not to penalize terminal gaps the same as internal gaps.
- Enter YES to use the evolutionary tree.
- Enter NO. We are not specifying a maximum divergence
- Enter NO. We are not defining the epsilons.
- Select the MSA-250 matrix.
- Select NO we are not adjusting the epsilons.
- Use the default output file. Write that file name below:
___________________________________________________________ - Write the script file name below:
_______________________________________________________ - Write the filename for the optimal alignment in MSF format below:
__________________________________________________________________ - Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 2.16 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Fasta.job then a good [logfile] name would be Fasta.log. (In this example, one would submit the job with the command "qsub Fasta.job -o Fasta.log"). Write the name of the [logfile] below:
______________________________________________________________ - When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
____________________________________________________________________ - The script file will take several minutes to run, depending on how many other workshop participants are running items. Unless the system is particularally heavily loaded, The script should complete within about fifteen minutes. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 2.18) for errors. Next examine the optimal multiple alignment file (step 2.17).
Use the T-Coffee Program to produce a multiple sequence alignment
- Use the MAKSEQ program to write a script file. Enter makseq to start the Makseq program
- Select Multiple sequence alignment
- Select T-COFFEE
- Enter the file written down in step 1.1 as the query file.
- Enter YES; these sequences are proteins
- Select Gotoh's;
- Select the Slow tree computational method
- Select the Slow normalize method
- Accept the default output file name. Write that file name below:
________________________________________________________ - Write the script file name below:
________________________________________________________ - Write the MSF filename below:
________________________________________________________ - Write the Score filename below:
________________________________________________________ - Write the dendogram filename below:
________________________________________________________ - Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 3.10 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Fasta.job then a good [logfile] name would be Fasta.log. (In this example, one would submit the job with the command "qsub Fasta.job -o Fasta.log"). Write the name of the [logfile] below:
_______________________________________________________________________ - When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
_____________________________________________________________________ - The script file will take several minutes to run, depending on how many other workshop participants are running items. Unless the system is particularally heavily loaded, The script should complete within about fifteen minutes. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 3.14) for errors. Next examine the multiple alignment file (step 3.11). Finally, examine the score file (step 3.12). Note that the score file is a postscript file. You should review this file by printing it on a postscript printer or looking at it with a postscript previewer such as ghostscript.
Use the Clustal Program to produce a multiple sequence alignment & phylogenetic tree.
- Enter clustalw
- Enter 1 to load your sequences.
- Enter the filename containing the sequences in the NBRF-PIR format (see step 1.1)
- Enter 2 for the multiple alignment menu.
- Enter 9 for the output format menu.
- Select 3 to produce an MSF file.
- Hit the Enter key to leave the output format menu.
- Enter 8 to turn the screen display off.
- Enter 1 to produce a complete multiple sequence alignment.
- Take the default Clustal output file name. Write the that name below: ________________________________________________________________________
- Take the default GCG MSF file name. Write that name here: ________________________________________________________________________
- Take the default Clustal guide tree name. Write the that name here: ________________________________________________________________________
- Hit the Enter key to leave the multiple alignment menu.
- Select 4 phylogenetic trees.
- Select 2 to exclude positions with gaps.
- Select 3 to correct for multiple substitutions.
- Draw the tree. Enter 4.
- Take the default phylip tree output name. Write the that name here: ________________________________________________________________________
- Hit the Enter key to leave the phylogenetic tree menu.
- Enter X to leave the program.
- Examine the MSF output file. (The name is written in step 4.11)
Add more sequences to an already existing alignment
- Edit the optimal alignment file (.msf), produced by the MSA program in step 2.17, to make it identifiable by the Clustal W program (a GCG .msf file). Open the file the file using your prefered text editor and insert the word "PileUp" (without the quotes) at the left edge of the first line in the file. Save the file.
- Start the CLUSTAL W program by entering the command: clustalw
- Select Profile / Structure Alignments
- From the Profile Alignment menu, select Input 1st profile
- You will be asked to enter the name of a file containing a multiple sequence alignment. Enter the name of the file you edited in step 5.1 above (originally recorded in step 2.17). The menu should reappear with the notation "loaded" after item 1 in the menu.
- Now select Input 2nd profile/sequences from the Profile Alignment menu.
- You will be asked for the name of a file containing a multiple sequence alignment or individual sequences. Type in the name of the file containiung the additional aspartyl protease sequences (step 1.2) The names of the sequences should scroll across the screen and the menu should reappear with the notation "loaded" after item 2 in the menu.
- Select Output format options from the menu.
- Select Toggle GCG/MSF format output from the output format menu so that the item is marked "ON".
- Press the Enter key to return to the Profile Alignment menu.
- Select Align sequences to 1st profile from the menu. You will be asked to enter names for the alignment output files and the guide tree output file. Write the name of the clustal output file name below:
_________________________________________________________ - Write the name of the clustal GCG MSF file name below:
______________________________________________________ - Write the name of the clustal guide tree file name below:
_______________________________________________________________ - Exit the Clustal W program.
- An optional, but informative exercise, is to put all eight aspartyl proteases included in this alignment into a single file and align them as a single set using the Clustal W program and compare this result to that obtained in step 5.12 above.





