Exercises for bioinformatics.psc.edu:
Pattern Identification and Matching

Identifing patterns or Motifs (consisting of conserved and selectively mutated regions) can help us to discover the resudues that are essential to the structure/function of the family/domain. These motifs can also be used to probe a sequence database for distant relatives belonging to the same family/domain. Finding distant relatives gives valuable insight to the structure/function because the residues critical to structure/function will be conserved or selectively mutated in these distant relatives.

This exercise is designed to familiarize you with the the process of creating patterns and searching the databases with those patterns. This exercise is not a complete step-by-step example; you will have to think about the problem and what you are trying you accomplish before moving onto the next step. Please read the entire step, before typing in anything on the computer. Also print off this exercise and fill in the blank lines. Your recorded responses are often referred to in later steps. This exercise assumes that the user is in the c shell (csh). Please do the Multiple Sequence Alignment Hands-on session before attempting this example.


Get sequences

  1. In this example we will align set of aspartyl proteases sequences. Copy the file containing these sequences which is called asp.pir in the directory /biomed/lib/examples Enter cp /biomed/lib/examples/asp.pir asp.pir Write the name of the file in your directory containing the aspartyl proteases sequences here:
    __________________________________________________________________
  2. Copy the file containing the alignment of these sequences which is called asp.msf in the directory /biomed/lib/examples Enter cp /biomed/lib/examples/asp.msf asp.msf Write the name of the file in your directory containing the aspartyl proteases sequences here:
    __________________________________________________________________
  3. Reformat the file listed in step 1.2. Enter tomsf [inputfile] [outputfile] where [inputfile] is the file listed in step 1.2 and [outputfile] is a filename that you made up. The output file from this procedure will be in the GCG MSF file format. Note that the MSF file produced by clustalw does not have the correct header information. You will need to run tomsf on the clustalw MSF file to fix this. tomsf may produce warning messages even when run successfully. Please check the output file and write the filename below:
    ____________________________________________________________________

Hidden Markov Models
The first step is to create a hidden markov model using the hmmt program. Once the model is created, we can search a sequence database with the hmmsw program.

  1. Run makseq to create the HMM file. Enter: makseq
  2. Select Pattern Identification and matching
  3. Select Search a database with a hidden Markov model
  4. Enter the MSF filename (See step 1.3)
  5. Accept the default query name
  6. Select P (or N - whichever is appropriate)
  7. Select option 2 multidomain/seq
  8. Select Dirichlet mixture as the prior to use
  9. Select Sibbald/Argos Voronoi sequence weighting algorithm
  10. Select NO - Do NOT calebrate the HMM (default).
  11. Now we are going to search a database with the hidden Markov model.
  12. Select the PDB sequences.
  13. Enter NO - Do NOT change the E-value Cutoffs (default).
  14. Write the name of the script (.job) file below:
    _________________________________________________________
  15. Write the name of the database search results file below:
    _______________________________________________________
  16. Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 2.14 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Fasta.job then a good [logfile] name would be Fasta.log. (In this example, one would submit the job with the command "qsub Fasta.job -o Fasta.log"). Write the name of the [logfile] below:
    ________________________________________________________________
  17. When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
    _________________________________________________________________
  18. The script file will take several minutes to run, depending on how many other workshop participants are running items. Unless the system is particularally heavily loaded, The script should complete within about fifteen minutes. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 2.16) for errors. Next examine the database search file (step 2.15).

EM model - MEME
The first step is to use the meme program to search a group of sequences for patterns. Once the patterns are found, we can use the mast program to search a sequence database for additional examples of the patterns.

  1. Run makseq to create the meme script file. Enter: makseq
  2. Select Pattern Identification and matching
  3. Select MEME
  4. Enter the filename containing the sequences. (See step 1.1)
  5. Enter YES, These are protein sequences.
  6. Select the ZOOPS model
  7. Enter 7 as the maximum number of motifs
  8. We do NOT wish to define the width of the motif (default).
  9. We do NOT want to define the starting widths (default).
  10. Accept the default output filename. Write that name below:
    ________________________________________________________________
  11. Write the name of the script file below: ________________________________________________________________________
  12. Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 3.11 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Fasta.job then a good [logfile] name would be Fasta.log. (In this example, one would submit the job with the command "qsub Fasta.job -o Fasta.log"). Write the name of the [logfile] below:
    ________________________________________________________________
  13. When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
    ____________________________________________________________
  14. he script file will take a few minutes to run, depending on how many other workshop participants are running items. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 3.12) for errors. Next examine the output file (step 3.10).

EM model - MAST search of a database with MEME results
The second step is to use the mast program to search a sequence database for additional examples of the patterns.

  1. Run makseq to create the mast script file. Enter: makseq
  2. Select Pattern Identification and Matching
  3. Select MAST
  4. Enter the meme output name. (See step 3.10)
  5. Enter asp as the query name.
  6. Enter YES The patterns are from protein sequences (default).
  7. Select the PDB sequences.
  8. Enter YES to use all motifs (default)
  9. Accept the default output name. Write that name below:
    ____________________________________________________________
  10. Write the name of the script file below:
    _________________________________________________________________
  11. Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 4.10 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Fasta.job then a good [logfile] name would be Fasta.log. (In this example, one would submit the job with the command "qsub Fasta.job -o Fasta.log"). Write the name of the [logfile] below:
    __________________________________________________________________
  12. When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
    ____________________________________________________________
  13. The script file will take a few minutes to run, depending on how many other workshop participants are running items. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 4.11) for errors. Next examine the output file (step 4.9).

Position-specific scoring matrix (Profile)
The makseq program will write a script to call the makepssm program to build a profile from an MSF file then call the profiless program to search a sequence database with the profile.

  1. Run makseq to create the PROFILE-SS script file. Enter: makseq
  2. Select Pattern Identification and Matching
  3. Select Profile-SS
  4. Enter the filename containing the MSF file. (See step 1.3)
  5. Accept the default query name.
  6. Enter YES The profile was built from protein sequences (default).
  7. Select the PDB protein sequences.
  8. Select the Henikoff method (default).
  9. Chose Blosum60 for the matrix.
  10. Select Gribskov's model for the gap penalty.
  11. Enter a title for the run (can be anything).
  12. Enter -10 as the default open gap penalty.
  13. Enter -1 as the default extend gap penalty.
  14. Enter no; We want to enter our own cutoff.
  15. Enter 30 as the cutoff.
  16. Enter 5 as the number of subalignments per pair (default).
  17. Enter 25 as the mimimum length.
  18. Select a LOCAL search (default).
  19. Enter NO do not adjust scores for sequence composition (default).
  20. Enter NO to produce the alignments (default)
  21. Accept the default output file name. Write that name below:
    _____________________________________________________________
  22. Elect to have the output sorted. Enter YES (default)
  23. Enter NO for the listing file.
  24. Write the name of the script file created below:
    _______________________________________________________________
  25. Write the name of the zscore file created below:
    ____________________________________________________________________
  26. Submit the script file to the PBS queue. Enter: qsub [scriptfile] -o [logfile] where [scriptfile] is the filename in step 5.24 and [logfile] is a file name that you made up. A good practice in naming a [logfile] is to simply substitute .log for .job -- For example if your [scriptfile] was named Fasta.job then a good [logfile] name would be Fasta.log. (In this example, one would submit the job with the command "qsub Fasta.job -o Fasta.log"). Write the name of the [logfile] below:
    ___________________________________________________________________
  27. When the script file is successfully submitted, the system will respond with an identifier (e.g. 132.codon.psc.edu). Write that identifier here:
    ______________________________________________________________
  28. The script file will take a few minutes to run, depending on how many other workshop participants are running items. (Remember, you can check on the status of your job by typing in "qstat"). When your job is complete, examine the log file (step 5.26) for errors. Next examine the zscore file (step 5.25) and the alignment output file (step 5.21).

Search NRBSC


NRBSC Gateways

Microphysiology Gateway image.

Volumetric Data and Viz Gateway Analysis.

Quantum Mechanics/Molecular Mechanics Simulation Gateway.


NRBSC projects are made possible by these sponsors:

NIH logo. Pittsburgh Supercomputing Center logo. NCRR logo.