Skip to content

Welcome courses Facilities research Members News Search
  You are not logged in Log in
You are here: Home » biocourses » Bioinformatics for Biologists - BIOS 599 - Spring 2004 » projects » Profile HMMs for protein searches and other applications

« March 2010 »
Su Mo Tu We Th Fr Sa
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Profile HMMs for protein searches and other applications

I. Introduction

Hidden Markov models (HMMs) are powerful probabilistic models for detecting homology among evolutionarily related (homologous) sequences. HMMs are statistical models that consider all possible combinations of matches, mismatches and gaps to generate an alignment of a set of sequences.

There are three possible “states” for each amino acid position in a particular sequence alignment: a “main” state where an amino acid can match or mismatch, an “insert” state where a new amino acid can be added to one of the sequences to generate an alignment, or a “delete” state where an amino acid can be deleted from one of the sequences to generate the alignment. Probabilities are assigned to each of these states based on the number of each of these events encountered in the sequence alignment. Each arrow in the model represents a transition from one state to another and is also associated with a probability. The greater the number and diversity of sequences included in the alignment, the better the model will be at identifying related (especially distantly related) sequences. Including a large number of divergent sequences in the alignment is referred to as “training” the model. HMMs that represent a sequence profile of a group of related proteins are called profile HMMs .


An adequately “trained” profile HMM has many uses. It can more accurately align a group of related sequences than progressive alignment methods, more effectively search databases for distantly related sequences than local alignment algorithms like BLAST, and identify subfamily-specific signatures within large protein superfamilies (Truong and Ikura 2002) . There are several software packages available that implement profile HMMs (listed and briefly discussed in Eddy 1998) and two are available on the ISU EGG server, SAM and HMMER. The performance of these two program packages are compared in a recent paper (Madera and Gough 2002) . This document summarizes how to use HMMER (Eddy 1998) to create and use profile HMMs.

II. Preparing to use HMMER

In order to build a profile HMM with HMMER you must first select a group of homologous sequences using an appropriate method. These methods might include BLAST, PSI-BLAST or PHI-BLAST searches against the available protein databases with a sequence of interest as the query. With BLAST, determining the appropriate e-value cutoff that identifies sequences that are orthologs or paralogs of your query sequence may be difficult, especially if you are looking for remote homologies. PSI-BLAST improves upon the BLAST algorithm by producing an iterative position-specific scoring matrix that allows the identification of more distantly related sequences. PHI-BLAST allows the input of a motif sequence that limits your search to all proteins sequences containing this motif.

You may also use the Pfam database (Bateman, Birney et al. 2000) , which is a large curated collection (7316 families as of January 2004) of multiple sequence alignments and HMMs that identify protein families and domains. One of the HMMER programs, hmmpfam, is specifically built to search this database for existing HMMs that show significant similarity to one or more query sequences.

Once you have chosen your group of homologous sequences you must align them with a multiple sequence alignment algorithm. The most commonly used program for multiple alignment is ClustalW (Thompson, Higgins et al. 1994) which is available on the ISU EGG server and also from the EMBL-EBI website at http://www.ebi.ac.uk/clustalw/ . It is important to produce an accurate alignment because the quality of the alignment is critical to the reliability of the HMM sequence profile. Once you have produced a correct alignment of your sequences, the alignment file is then used to build a profile HMM.

III. Using HMMER

HMMER
is a package of programs developed by Sean Eddy that are freely available from Washington University School of Medicine at http://hmmer.wustl.edu . The package consists of eight programs designed for building and/or using profile HMMs. Each is described below in the order in which you might use them to first build and then use a profile HMM.

  1. hmmbuild : Hmmbuild reads a multiple sequence alignment file and builds the new profile HMM from it. The program can read and automatically detect multiple MSA file types including ClustalW, GCG, MSF, SELEX, Stockholm or aligned FASTA formats. The new profile HMM is saved in hmmfile . The default setting for building the model is multiple global alignments with respect to the model, and local with respect to the sequence. This can be changed to configure the model for multiple local alignments with respect to the model and local with respect to the sequence (the -f option). This option will find multiple domains per sequence, where each domain can be a local (fragmentary) alignment. The -g option will configure the model for finding a single global alignment to a target sequence. The -s option will configure the model to find a single local alignment per target sequence which is analogous to the standard Smith/Waterman algorithm. The last two options are rarely used. There are also other parameters that should be changed only by expert users.

    The output of hmmbuild is a matrix of scores that represent the probability of a match, mismatch, insertion or deletion along each position in the sequence alignment. The capital letters along the top edge of the matrix represent the different possible amino acids. Negative numbers represent a low probability and positive numbers a high probability.



  2. hmmcalibrate :   Hmmcalibrate reads the profile HMM ( hmmfile ) and scores a large number of synthesized random sequences with it, and then fits an extreme value distribution (EVD) to a histogram of those scores. The profile HMM is then resaved including the EVD parameters. The only options are for expert users, such as changing the length or number of synthetic sequences.

    The output of hmmcalibrate is a revised matrix of scores that reflect the extreme value distribution parameters. Additionally, the histogram of scores with the EVD is appended to the bottom of the revised matrix, as shown in the following figure. The histogram can also be saved as a separate file.



  3. hmmsearch :   Hmmsearch reads an HMM from hmmfile and then searches seqfile for significantly similar sequence matches. Seqfile can be a text file containing a custom set of sequences to search against, or it can be a database such as an existing BLAST database. Hmmsearch will first look in the currently working directory for seqfile, then in a directory named by the environment variable BLASTDB. There are options to change the database searched, the name of the output file, the number of alignments reported, and the E value and bit score cutoffs.

    Output from hmmsearch is a list of sequences with significant scores and e-values with alignments following the list. This output looks very similar to the output from a BLAST search.



  4. hmmemit :   Hmmemit reads an HMM file and generates either one majority-rule consensus sequence or multiple sequences that are consistent with a sequence family consensus. The program cannot emit both a majority-rule single consensus sequence and a group of consensus sequences at the same time.

    The output for a single majority-rule consensus uses capital letters for amino acids with high support, lowercase letters for amino acids less frequent occurrence.


    The output for multiple consensus sequences can be in FASTA format or in Selex (aligned) format. The aligned format allows easy comparison of the emitted sequences, as shown below. You can also change the number of sequences emitted.


  5. hmmalign : hmmalign reads an HMM file ( hmmfile ) and aligns a set of sequences ( seqfile ) to the profile HMM, and outputs a multiple sequence alignment. The seqfile may be in any unaligned or aligned file format accepted by HMMER. If it is in a multiple alignment format (e.g. Stockholm, MSF, SELEX, ClustalW), the existing alignment is ignored (i.e., the sequences are read as if they were unaligned - hmmalign will align them the way it wants). The default output is in Selex format but can be changed to GCG MSF format. You can also choose to have only the symbols aligned to match states included in the alignment.
  6. hmmpfam : hmmpfam   reads a sequence file and compares each sequence in it, one at a time, against all the HMMs in   an HMM database , looking for significantly similar sequence matches. You can choose the Pfam database and the number of sequences reported, name the output file, and choose the E value and bit score cutoffs.
  7. hmmfetch: hmmfetch is a small utility that retrieves a HMM from a HMM database, such as Pfam, and prints the output to standard HMMER 2 ASCI II format. There are options to enter the name of the HMM, choose the database, and name the output file.
  8. hmmconvert: hmmconvert is a small utility that converts an HMM file from one format to a new format. By default, the new format is HMMER 2 ASCI II. You can also append the new HMM file to an existing HMM file.
IV. Summary

Using hidden Markov models to create protein profiles (profile HMMs) is an underutilized but excellent method to align a group of related sequences, determine gene family signatures, and to search databases for distantly related homologous sequences. The profile HMM is a matrix of probabilities that describe the probability of finding a specific amino acid, insertion or deletion at every position in a given sequence, as compared to the model. Searching a database or aligning a set of sequences with a profile HMM is akin to using a position-specific scoring matrix to produce an alignment. As such, it is a superior method for finding homologies or aligning sequences than methods that use other scoring matrices such as BLOSUM or PAM. In addition, HMM databases, such as Pfam, have large collections of profile HMMs for many different protein families readily available for download and use by researchers.

V. References

  1. Bateman, A., E. Birney, et al. (2000). "The Pfam protein families database." Nucleic Acids Research 28 : 263-266.
  2. Eddy, S. R. (1998). "Profile hidden Markov models." Bioinformatics 14 : 755-763.
  3. Madera, M. and J. Gough (2002). "A comparison of profile hidden Markov model procedures for remote homology detection." Nucleic Acids Research 30 (19): 4321-4328.
  4. Thompson, J. D., D. G. Higgins, et al. (1994). "CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Research 22 (22): 4673-80.
  5. Truong, K. and M. Ikura (2002). "Identification and characterization of subfamily-specific signatures in a large protein superfamily by a hidden Markov model approach." BMC Bioinformatics 3 (1).