Exercises for bioinformatics.psc.edu:
Retrieving Sequences from Databases

This exercise is designed to help you familiarize yourself with the steps involved in retreiving sequences from the databases. Please read the entire step before attempting to type in anything. The emboss programs "textsearch," "seqret," and "entret" are used to locate and retreive sequences locally. In addition, the example also goes over how to retreive sequences from the uniprot and entrez sites as well.

Connect to bioinformatics.psc.edu

  1. Use ssh to connect to bioinformatics.psc.edu

Initialize the software

  1. People who have successfully completed the Unix Operating System Hands-On will have the emboss software automatically initialized for them when they logged into the computer. If you haven't done the Unix Operating System Hands-On and need to initialize the EMBOSS package, enter: source /home/biomed/bin/emboss
  2. To run the readseq program, place /home/biomed/bin in your UNIX path. Otherwise, you will have to substitute /biomed/bin/readseq wherever "readseq" is mentioned below (Step 6.1). People who have successfully completed the Unix Operating System Hands-On will already have /home/biomed/bin added to their UNIX paths.

Using EMBOSS textsearch program to find sequence identifiers

  1. First use the textsearch command to search the UniProt database for human plasminogen tissue activator precursor. Enter textsearch
  2. Enter uniprot as the input sequences to be searched through.
  3. Enter tissue-type plasminogen activator as the search pattern.
  4. Select a meaningful output file name such as "tpa.textsearch." The program may take a few minutes to run, when it is done, the system will give you your prompt back.
  5. "more" the output file listed in the previous step. The accession number (The six character identifier) for the correct sequence (human plasminogen tissue activator precursor) is right before the text describing the sequence.

Using EMBOSS seqret program to retrieve a sequence in fasta format

  1. Enter: seqret
  2. For the input sequence, enter "uniprot:" followed by the identifier of the sequence that you want to retreive (see steps 3.5) (e.g. uniprot:P00750)
  3. By default seqret creates an output file in the fasta sequence file format.

Using EMBOSS entet program to retrieve all info about asequence contained in a data library

  1. Enter: entret
  2. For the input sequence, enter "uniprot:" followed by the identifier of the sequence that you want to retreive (see steps 3.5) (e.g. uniprot:P00750)
  3. Use the more command to view the two files created (in steps 4.3 & 5.3)

Reformat a sequence into the nbrf/pir file format.

  1. Enter: readseq
  2. Enter an output file name (such as tpa_human.pir).
  3. Select choice 3 nbrf/pir file format.
  4. Enter the file name containing the sequence to convert (The sequence file name in 5.3).
  5. Make sure that the program correctly identified the file format of your sequence file. (UniProt uses the EMBL format.)
  6. Hit the Enter key to quit.

OPTIONAL - Using the UNIPROT website to id a sequence and save it to a fasta file

  1. Start your favorite web browser.
  2. Go to: http://www.pir.uniprot.org/search/textSearch.shtml
  3. Click on Add input box, followed by +box, followed by +box, followed by +box.
  4. In the first box to the right of the Any Field box, enter Phospholipase
  5. In the box to the right of the second Any Field box, enter A2
  6. In the box to the right of the third Any Field box, enter rattlesnake
  7. In the box to the right of the fourth Any Field box, enter western
  8. Click the Search box.
  9. Note: As an alternative, instead of performing steps 7.10 - 7.13 below, simply locate the accession and retreive the sequence on bioinformatics by using the EMBOSS seqret program. (See Steps 4.1 - 4.3 above)
  10. Check the box next to PA2_CROAT
  11. Locate the Save Options section on the right hand side of the page in the middle. Click on the FASTA button.
  12. Click Save in the pop-up window.
  13. A second pop-up window will appear. You may want to take the opportunity to change the filename to something more meaningful. When you are satisfied with the filename selected, click Save.
  14. Use scp or kerberos ftp to transfer the file to bioinformatics.

OPTIONAL - Using the NCBI Entrez web site to id a sequence and save it to a fasta file

  1. Start your favorite web browser.
  2. Go to: http://www.ncbi.nlm.nih.gov/Entrez/
  3. Enter Phospholipase a2 rattlesnake in the Search Across Databases line.
  4. Click GO
  5. Click on protein sequence database. Note that you can also use Entrez to search Genbank and a number of other sequence data libraries and other data sources.
  6. Click on the entry for the western diamondback rattlesnake
  7. Notice that the sequence is displayed in genbank format.
  8. Display the entry as fasta format. Locate the box titled default directly to the right of the Display button (at the top of the page right before the sequence entry is shown.) Change this menu from default to FASTA then click display
  9. Save the entry as a fasta file. Locate the box titled File directly to the right of the Send to button at the top of the page right before the sequence entry is shown. Change the menu from File to Text and then click Send to
  10. Use the browsers Save As option to save the file. Under the browsers File menu, select Save As ...
  11. In the pop-up window, change the Save as type: to text. You may also want to change the file name to something more meaningful.
  12. Use scp or kerberos ftp to transfer the file to bioinformatics.
  13. Note: As an alternative, instead of performing steps 8.8 - 8.12 above, simply locate the accession and retreive the sequence on bioinformatics by using the EMBOSS seqret program. (See Steps 4.1 - 4.3 above). Also note that this will work for most entries (such as GenBank and SwissProt/UniProt) but not for all entries.

Search NRBSC


NRBSC Gateways

Microphysiology Gateway image.

Volumetric Data and Viz Gateway Analysis.

Quantum Mechanics/Molecular Mechanics Simulation Gateway.


NRBSC projects are made possible by these sponsors:

NIH logo. Pittsburgh Supercomputing Center logo. NCRR logo.