A User’s Guide to predicting protein secondary structure using iNquiry tools on EGG
INTRODUCTION
Understanding the secondary structure of proteins is the foundation of elucidating tertiary and quaternary structure, thus leading to the function of proteins. While laboratory methods (X-ray diffraction) are the cornerstone in determining protein structure, they are not accessible to everyone and are not easy to conduct. With the abundant generation of protein sequence data during the past 10 years, there is an increased need to predict secondary structure from amino acid residue data. Predicting protein secondary structure from amino acid sequence is challenging. However, with increased empirical protein data, elegant prediction algorithms, and increased computing power, fairly accurate predictions (76% accurate) can be generated (Rost 2001).
Protein secondary structure prediction programs can be applied to a variety of uses. Newly discovered protein sequences can be easily examined to assess basic structural features that may elucidate the functional role of the protein. Structural changes of proteins from distantly related taxa can be assessed, thus providing information on how proteins have evolved through evolutionary time, and aid in identifying sections under selection.
Secondary structures of proteins are analyzed by using seven iNquiry tools available on EGG. Two of these tools ( PEPNET and PEPWHEEL ) strictly display amino acid residues in a graphical format allowing the user to easily see residue patterns. Patterns such as amphipathicity may prompt the user to further investigate regions of the protein. The other five programs make predictions about secondary structure (alpha helix, beta sheet, and bends or turns). These programs are: HELIXTURNHELIX for the identification of helix-turn-helix structures, PEPCOIL for the prediction of coiled coils , HMOMENT measures hydrophobic moments which assess alpha helices versus beta sheet structures, TMAP to predict transmembrane regions of proteins, and GARNIER to predict overall secondary structure (helices, sheets, turns, and coils).
AMINO ACID RESIDUE DISPLAY PROGRAMS
Program information: PEPENET and PEPWHEEL
The Emboss programs PEPNET and PEPWHEEL should be used first when trying to assess protein secondary structure. These display programs allow for easy visualization of the amino acid composition of the protein being studied. These programs are especially useful in seeing patterns. For example, both programs allow the user to see patterns of amphipathicity (both polar and apolar), aliphaticity (side chains consist only of hydrogen or carbon), and charged residues. Through the use of these programs, areas of interest may be identified.
How to use PEPNET and PEPWHEEL
Both programs only allow the display of one protein at a time. Input your protein into both programs by pasting the amino acid sequence into the data window, or by selecting the filename containing your protein that is in FASTA or plain format. In the output section of both programs, you have the option of changing the graphics displayed around the residues. Be aware that if you click on “amphipathic residue marking” box then only the amphipathic residues will have a graphic (square) and all other residues will be unmarked. Results are outputted as a graphic and a variety of display options exist to view the results. PEPWHEEL has an additional section in which you can manipulate the way the residues are displayed on the wheel.
Results: PEPNET and PEPWHEEL
PEPNET displays protein residues as a helical net in a repeating 3, 4, 3, 4 pattern (approximating the positions of the amino acids around an alpha helix). PEPWHEEL displays amino acids in helical manner (Figure 1).
Figure 1. PEPWHEEL display of the amino acid residues of cytochrome oxidase II for Balaenoptera musculus. Aliphatic residues are indicated by the squares, positively charged residues by octagons, and residues D,E,N,Q,S,T by diamonds.
PROTEIN SECONDARY STRUCTURE PREDICTION PROGRAMS
Program information: GARNIER
GARNIER predicts secondary structure of proteins using GOR I algorithm (Garnier et al. 1978). This algorithm uses known X-ray crystallography information from the laboratory and yields four-state predictions: alpha helix, extended chain, beta sheet, and reverse turn or coil. Probability of being in a structure class depends on the amino acid's state and the states of its neighboring amino acids. This method however, is only about 65% accurate (Garnier et al. 1978) and should not be used exclusively in characterizing protein secondary structure.
How to use GARNIER
GARNIER can read one or more sequences at a time. Individual protein sequences can be pasted into the data window, or a file of protein sequences in FASTA format can be used.
Results: GARNIER
The output file is a standard Emboss report file. The file contains the amino acid residues and indicates if the residues are part of a helix, sheet, turn, or coil. Again, these results are 65% accurate at best and should not be used alone. GARNIER is a solid second step (after PEPNET and PEPWHEEL ) in identifying areas of potential interest. For example, areas predicted as coils may be of interest if you are interested in coiled coils (see below).
Program information: HELIXTURNHELIX
HELIXTURNHELIX detects the alpha helix/beta turn/alpha helix motif (also called helix-turn-helix). This motif enables proteins that contain them to bind DNA in a very specific manner. First discovered in l Cro proteins (Anderson et al. 1981) and E. coli CAP proteins (McKay & Steitz 1981), a large class of proteins has since been discovered to use this motif (Sauer et al. 1982). This motif is important in DNA-binding regulatory proteins.
HELIXTURNHELIX uses the method of Dodd and Egan (1987) to detect helix-turn-helix motifs. A reference set of 91 known helix-turn-helix sequences is used to create a scoring matrix (Dodd & Egan 1990). The scoring matrix is applied to all segments of the protein of interest to find areas of high scores. Any region with a score more than 2.5 standard deviations above 238.71 (mean value of high scoring non-HTH proteins) denotes a helix-turn-helix motif. A cautionary note: other DNA-binding motifs (that are not helix-turn-helix) exist and will not be identified by this method.
How to use HELIXTURNHELIX
HELIXTURNHELIX can read one or more sequences at a time. Individual protein sequences can be pasted into the data window, or a file of protein sequences in FASTA format can be used. Under the “advanced section” of the program there is an option to change default values of Mean value, Standard deviation value , and Minimum SD . However, these default values have been determined to work well in detecting true helix-turn-helix motifs (Dodd & Egan 1990). An option is available to use the 1987 weight data, with motif lengths of 20 residues. The default (box not checked) is Ehth.dat which uses motifs of 22 residues. Weight matrices using the 22 position motif are slightly better at finding helix-turn-helix motifs (Dodd & Egan 1990).
Results: HELIXTURNHELIX
The output is a standard EMBOSS report file, but you have the option to output the results in other styles. A protein region with a score greater than 2.5 standard deviations (SD) above the mean value 238.71 is reported as a helix-turn-helix. The output reports helix-turn-helix positions, sequences, and scores.
Program information: PEPCOIL
Coiled coils are 2 or 3 alpha helices in parallel that cross at an angle of approximately 20 ° and are strongly amphipathic with hydrophobic and hydrophilic residues repeating every seven residues. Coiled coils have a mechanical role in which they stabilize alpha helices in proteins and they are typically found in structural fibrous proteins both inside and outside of cells (keratins, myosin, epidermin).
The occurrence of amino acids at each position of the coiled coil was determined and compared against non-coiled coil structures (Lupas et al . 1991), resulting in a relative frequency of amino acids in coiled coils. PEPCOIL uses this relative frequency to calculate the probability of a coiled coil using a gliding window of 28 residues. A residue length of 28 is used because it is the shortest stable coiled coil structure (4-5 heptads long).
How to use PEPCOIL
PEPCOIL can read one or more sequences at a time. Individual protein sequences can be pasted into the data window, or a file of protein sequences in FASTA format can be used. “Window size” is set to 28 (recommended but can be changed). In the output section you can have non-coiled coil regions reported as well as coil frame shifts shown.
Results: PEPCOIL
The output is a SwissProt annotation. The file reports areas where coiled coils are predicted, the number of residues involved, a maximum score, and a probability value for the coiled coil stucture. If you indicated in the output section to have non-coiled coil regions reported those results are also reported.
Program information: Hmoment
Segments of proteins can be amphiphilic (one side is more polar than the other) which affects the folding of the protein (Ptitsyn & Rashin 1975). HMOMENT calculates the hydrophobic moment, the hydrophobicity (amphiphilicity) of a peptide measured for a specified angle of rotation. Proteins of known tertiary structure containing alpha helices have strong periodicity every 3.6 residues with the angle of rotation of 100 ° . Proteins containing beta sheets have strong periodicity every 2.3 residues with an angle of rotation of 160 ° . Thus, the hydrophobic moment of a protein can be estimated from its primary structure when the period is known (Eisenberg et al. 1984).
HMOMENT uses the method of Eisenberg et al. (1984) to report the hydrophobic moment of a protein. It uses a moving window with a default angle of rotation of 100 ° for alpha helices and an angle of rotation of 160 ° for beta structures. HMOMENT reports the hydrophobic moment (uH) at a specified angle across the residues of a protein. The magnitude of uH is affected by several factors. Helices and beta structures on the surface of proteins are amphiphilic and have large hydrophobic moments, while interior portions will have smaller values (Eisenberg et al. 1982). Longer segments of secondary structure will also have smaller hydrophobic moment values because they are unlikely to have uniform secondary structure (Eisenberg et al. 1984).
How to use HMOMENT
HMOMENT can read one or more sequences at a time. Individual protein sequences can be pasted into the data window, or a file of protein sequences in FASTA format can be used. Under the advanced section the window size can be adjusted (default is 10) and the angles of rotation can be changed for both the alpha helix and beta sheet. Changing these values is only necessary if a different periodicity is being examined. Under the output section there is the option to have results for both the alpha helix and beta sheet structure. There are two outfiles: a graph of the hydrophobic moment, and an outfile with hydrophobic moment values.
Results: HMOMENT
The graphical plot of the hydrophobic moment (uH) is shown for the specified alpha helix and beta sheet angle (Figure 2). The magnitude of the peaks shows the strength of the hydrophobic moment across the protein for the angles of rotation (alpha helix or beta sheet).
Figure 2. Graphical output of HMOMENT for alpha helices (100 degrees) and beta structures (160 degrees) for the cytochrome oxidase II gene in Balaenoptera musculus. The strength of the hydrophobic moment is indicated by the magnitude of the peaks.
Program information: TMAP
Receptors for neurotransmitters and hormones, respiratory chain proteins, transport proteins, and ion channels are examples of transmembrane proteins. They have a signature of hydrophobicity and positively charged residues in the inner membrane of the protein. TMAP predicts transmembrane portions of proteins using the algorithm of Persson & Argos (1994). The prediction algorithm uses information from 28 families of membrane proteins to obtain two sets of propensity values. One set of propensity values is used for the middle (hydrophobic) portion, and one set is used for the terminal residues of the transmembrane region.
How to use TMAP
TMAP can read one or more protein sequences at a time. Individual protein sequences can be pasted into the data window, or a file of protein sequences in FASTA format can be used. In the output section there is an option of what type of format you want the graph displayed.
Results: TMAP
The graphical output displays the mean propensity values (mean propensity values of the middle and terminal residues of the transmembrane region) as a continuous line against the amino acid residue (Figure 3). The propensity values of the terminal residues are indicated by the dashed line. Sections of the protein with high propensity values are predicted to be transmembrane, and indicated by the black bar above the peaks. A text file is also produced which summarizes the transmembrane areas.
Figure 3. Graphical output of TMAP with amino acid residue numbers on the x axis and propensity values on the y axis for cytochrome oxidase II of Balaenoptera musculus. Areas of the protein with high propensity values are predicted to be transmembrane and are indicated by the black horizontal lines.
CONCLUDING REMARKS
The programs listed above predict basic secondary structures of proteins. These results can help the researcher characterize newly discovered proteins thus providing insight into possible functions. In addition, by comparing protein structure from highly divergent taxa through evolutionary time, areas of proteins that are under selection can be recognized. Because prediction programs rely heavily on the empirical data set on which the algorithms were built, it is prudent to use several types of programs to characterize protein secondary structure.
REFERENCES
- Anderson WF, Ohlendorf DH, Takeda Y, Matthews BW (1981) Nature , 290 :754-758.
- Dodd IB, Egan JB (1987) Journal of Molecular Biology , 194 :557-564.
- Dodd IB, Egan JB (1990) Nucleic Acids Research , 18 (17):5019-5026.
- Eisenberg D, Weiss, RM, Terwilliger TC, Wilcox W (1982) Faraday Symposium Chemistry Society , 17 , 109-120.
- Eisenberg D, Weiss RM, Terwilliger TC (1984) Proceedings of the National Academy of Sciences , 81 :140-144.
- Garnier J, Osguthorpe DJ, Robson B (1978) Journal of Molecular Biology , 120 :97-120.
- Lupas A, Van Dyke M, Stock J (1991) Science , 252 :1162-1164.
- McKay DB, Steitz TA (1981) Nature , 290 :744-749.
- Persson B, Argos P (1994) Journal of Molecular Biology , 237 :182-192.
- Ptitsyn OB, Rashin AA (1975) Biophysical Chemistry , 3 :1-20.
- Rost B (2001) Journal of Structural Biology , 134 :204-218.
- Sauer RT, Yoeum RR, Doolitttle RF, Lewis M, Pabo CO (1982) Nature , 298 : 447-451.