Exercises for bioinformatics.psc.edu:
Database Searching and Alignment

The primary purpose of searching a database is to find related sequences. These related sequences may be sequences that share a common domain, or may be sequences that belong to the same family or superfamily. Finding related sequences can give us indirect insight into the possible structure or function of an unknown query sequence (e.g. If the related sequence is known to bind calcium, does the query sequence also bind calcium?)

This exercise is designed to take you through the steps of searching sequence libraries. This exercise is not a complete step-by-step example, you will have to think about the problem and what you are trying to acomplish before moving on to the next step. Please read the entire step before typing in anything on the computer. You can make a hardcopy of this exercise and fill in the blank lines. Your responses will often be referred to in later steps. Note that you will need to substitute the actual filenames for names enclosed in square brackets (eg. [infilename]). This exercise assumes that the user is in the c shell (csh) and has completed both the Unix Operating System and the Retreiving Sequences from Databases Hands-On exercises.


Create a subdirectory for this exercise

  1. Be sure you are in your home directory. Enter the command: cd $HOME
  2. Create a subdirectory called "Trydbs". This is done by typing: mkdir Trydbs
  3. Move into this directory to carry out the exercise. cd Trydbs

Get Search Sequence

  1. In this example we will use as our search sequence RHIZOPUSPEPSIN PRECURSOR (EC 3.4.23.21). Copy this sequence, which is in the file named carp_rhich.swiss in the directory /biomed/lib/examples Enter cp /biomed/lib/examples/carp_rhich.swiss carp_rhich.swiss Write the name of the file in your directory containing the RHIZOPUSPEPSIN PRECURSOR here:
    __________________________________________________________________

Use the FASTA program to search the UniRef90 database

  1. Enter makseq
  2. Select Fast database searching with the Word search algorithm
  3. Enter the sequence file. (See step 2.1)
  4. Enter YES, This is a protein sequence.
  5. Select UniRef90 data library
  6. Select the BLOSUM62 scoring matrix.
  7. Enter -12 as the open gap penalty (default).
  8. Enter -1 as the extend gap penalty (default).
  9. Enter 600 as the number of alignments to return (default).
  10. Enter 2 as the ktup (For best results, you would want to select 1 as the ktup; we are selecting two to save run time.)
  11. Enter 16 as the band width (For best results, you would want to set this higher, to perhaps 32; we are selecting 16 to save run time - default value.)
  12. Take the default output name. Write that name below:
    ________________________________________________________________________
  13. Write the script file name (the file ending in .job) below:
    ________________________________________________________________________
  14. Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 3.13 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Fasta.job then a good [logfile] name would be Fasta.log. (In this example, one would submit the job with the command "qsub Fasta.job -o Fasta.log"). Write the name of the [logfile] below:
    ________________________________________________________________________
  15. When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
    ________________________________________________________________________
  16. The script file will take several minutes to run, depending on how many other workshop participants are running items. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 3.14) for errors. Next examine the alignment file (step 3.12).

Use the MAXSEGS program to search the PDB sequence database

  1. Enter makseq
  2. Select Exhaustive database searching with the Dynamic programming algorithm
  3. Select MAXSEGS
  4. Enter the sequence file. (See step 2.1)
  5. Enter YES, This is a protein sequence.
  6. Select PDB Sequences
  7. Enter a descriptive title for the run (can be anything)
  8. Select the PAM120 matrix.
  9. Enter -16 as the open gap penalty.
  10. Enter -1 as the extend gap penalty. (default)
  11. Accept the default cutoff (let the program compute).
  12. Enter 5 as the number of subalignments per pair. (default)
  13. Select LOCAL alignments.(default)
  14. Elect to rescale the scores. (default)
  15. Take the default output name. Write that name below:
    ________________________________________________________________________
  16. Enter YES to sort the output. (default)
  17. Enter YES to get a listing file. (default)
  18. Write the script file name (the file ending in .job) below:
    ________________________________________________________________________
  19. Write the output file name (the file ending in .out) below:
    ________________________________________________________________________
  20. Write the listing file name (the file ending in .lst) below:
    ________________________________________________________________________
  21. Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 4.18 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Max.job then a good [logfile] name would be Max.log. (In this example, one would submit the job with the command "qsub Max.job -o Max.log"). Write the name of the [logfile] below:
    ________________________________________________________________________
  22. When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
    ________________________________________________________________________
  23. The script file will take several minutes (about 5x longer than the above fasta job) to run, depending on how many other workshop participants are running items. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 4.19) for errors. Next examine the listing file (listed in step 4.20) and the alignment file (listed in step 4.15 and 4.19).

Use the TFASTA program to search translated Genbank

  1. Enter makseq
  2. Select Fast database searching with the Word search algorithm
  3. Enter the sequence file. (See step 2.1)
  4. Enter YES, This is a protein sequence.
  5. Select Saccharomyces cerevisiae genome
  6. Enter NO to not allow frameshifts. (Normally you would want to do this; we are not allowing frameshifts to save time.)
  7. Enter NO to not force the use of unlimited Smith-Waterman alignment. (default)
  8. Enter YES to use all six reading frames. (Normally you would want to search all six frames for complete coverage; Answering no will select only the three forward frames).
  9. Select the BLOSUM62 scoring matrix.
  10. Enter -12 as the open gap penalty. (default)
  11. Enter -1 as the extend gap penalty. (default)
  12. Enter 100 as the number of alignments to return.
  13. Enter 1 as the ktup.
  14. Enter 32 as the band width
  15. Take the default output name. Write that name below:
    ________________________________________________________________________
  16. Write the script file name (the file ending in .job) below:
    ________________________________________________________________________
  17. Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 5.16 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Tfasta.job then a good [logfile] name would be Tfasta.log. (In this example, one would submit the job with the command "qsub Tfasta.job -o Tfasta.log"). Write the name of the [logfile] below:
    ________________________________________________________________________
  18. When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
    ________________________________________________________________________
  19. The script file will take a few minutes to run, depending on how many other workshop participants are running items. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 5.17) for errors. Next examine the alignment file (step 5.15).

Search NRBSC


NRBSC Gateways

Microphysiology Gateway image.

Volumetric Data and Viz Gateway Analysis.

Quantum Mechanics/Molecular Mechanics Simulation Gateway.


NRBSC projects are made possible by these sponsors:

NIH logo. Pittsburgh Supercomputing Center logo. NCRR logo.