Skip to content

Welcome courses Facilities research Members News Search
  You are not logged in Log in
You are here: Home » biocourses » Bioinformatics for Biologists - BIOS 599 - Spring 2004 » projects » A user's guide to iNquiry tools for protein sequence analysis on EGG

« October 2008 »
Su Mo Tu We Th Fr Sa
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  

A user's guide to iNquiry tools for protein sequence analysis on EGG

Introduction

Protein sequence analysis is used to predict gene functions using the protein sequence translated from the nucleotide sequence of a gene. Compared to the nucleotide sequence, protein sequence has many advantages for studying gene functions, protein localization and phylogenetics. The variability of the nucleotide sequence, such as the degeneracy, wobble position, codon bias and preference, make it difficult to study sequence similarity among different species, which may have related functions. However, the analysis of protein sequence provides an alternative way for studying gene functions determined by the protein structure, taking the advantage of the universal amino acid set and the basic properties of each amino acid.

iNquiry tools for protein sequence analysis on EGG can be divided into four categories: composition, motif, 2D and 3D structure tools. The composition and motif tools will be discussed in this summary. The protein composition tools are used for analyzing the basic properties of a protein sequence, and the motif tools can be utilized for searching a specific pattern in a protein sequence. The functions of the potential gene product can be predicted by utilizing the composition and motif tools.

Composition tools

Classification and functions

The inquiry tools for protein composition analysis on EGG can be divided into four categories according to their functions (table 1). The first three categories of tools are used for analyzing a single protein sequence and the fourth group of tools is used for protein analysis from mass spectrometry, which includes a series of protein sequences.

First of all, for an uncharaterized protein sequence, the protein statistics tools can be used to calculate the basic amino acid composition. The checktrans program can be used to identify the potential protein sequence between start and stop codons from a raw sequence. The pepstats and pepinfo programs can be used to calculate the molecular weight, the number and percentage of each type of residues with the physico-chemical properties and isoelectric point, etc.

Secondly, the electrical properties of a protein, such as the isoelectric point, can be calculated by the iep program. The isoelectric point is the pH of an aqueous protein solution where the numbers of positive and negative charges on the protein are equal, which can be used for protein isolation, purification and crystallization (Kantardjieff and Rupp 2003 ).

Thirdly, protein hydropathy tools can be used to predict the potential functions of an uncharacterized protein by recognizing the hydrophobic and hydrophilic regions on a protein sequence. The transmembrane regions of a protein sequence have a thermodynamic preference for a hydrophobic environment inside the membrane lipid bilayer. The program octanol calculates two free energy differences according to White and Wimley (1999), indicating the potential transmembrane regions a protein.

Table 1. Protein composition tools on EGG
Protein Composition iNquiry Tools Functions
Basic protein statistics pepstats Protein statistics
pepinfo Plots simple amino acid properties in parallel
checktrans Reports STOP codons and ORF statistics of a protein
Electrical properties charge Protein charge plot
iep Calculates the isoelectric point of a protein

Protein hydropathy

octanol Displays protein hydropathy
pepwindow Displays protein hydropathy
Mass protein analysis mwfilter Removes unwanted (noisy) data from mass spectrometry output in proteomics
emowse Protein identification by mass spectrometry
pepwindowall Displays protein hydropathy of a set of sequences

Usages and examples

The following examples are used to demonstrate the usage of these analysis tools. The sequence of the Drosophila D1 protein, which is an uncharacterized non-histone DNA binding protein, is analyzed by using iNquiry tools.

pepstats

The pepstats program is used to calculate basic amino acid composition of the D1 protein (figure 1). The molecular weight, number of residues, average residue weight, charge and isoelectric point are shown in the output. For each type of amino acid the number, molar percent and DayhoffStat are displayed. DayhoffStat is the amino acid's Dayhoff statistic divided by the molarpercent. The Dayhoff statistic is the amino acid's relative occurrence per 1000 aa normalised to 100 by rls@ebi.ac.uk (original work from 1993). The amino acids are also sorted by the physico-chemical properties and the number and the molecular percentage of each class of amino acid are shown. Probability of protein expression in E. coli inclusion bodies is calculated as a type of solubility measure. Molar extinction coefficient (A280) and the extinction coefficient at 1 mg/ml (A280) are also calculated. The extinction coefficients for proteins are generally reported with respect to an absorbance measured at the wavelength of 280 nm. Molar extinction coefficient estimates the molar concentration of a protein solution from its measured absorbance.

Figure1. The basic amino acid compositions of the D1 protein calculated by the pepstats.

iep

Besides the pepstats program, the isoelectric point of the D1 protein can also be calculated and graphically displayed by the iep program (figure 2). The pI of the D1 sequence is 6.57, indicating this protein is a neutral nucleic protein.

Figure 2. The isoelectric point of the Dt protein calculated by the iep.

octanol

To determine the potential transmembrane regions of the D1 protein, the octanol program is used to display a graph of the classic White and Wimley hydropathy plot. The octanol program calculates two free energy differences. The first free energy is the free energy difference between solution in water and association with the interface (glycerol group) of a POPC (palmitoyloleoylphosphocholine) bilayer and the second is the free energy difference between water and octanol, equivalent to the environment inside a lipid bilayer. Protein sequences of the transmembrane regions are assumed to have a thermodynamic preference for a hydrophobic environment rather than an aqueous environment in water. The location of probably transmembrane regions is indicated by a sliding window of either free energy difference (White and Wimley, 1999). The window size of 19 residues, which is the size of a membrane spanning alphahelix, is used to analyze the D1 protein sequence by the octanol program. The peaks above zero represent the transmembrane regions. The D1 protein sequence doesn't have any transmembrane regions, indicated by the plot in figure 3.

Figure 3. The White and Wimley of the D1 protein calculated by the octanol

Motif tools

Classification and functions

Motif tools are used to find motifs and cleavage sites in a protein sequence for predicting the potential protein functions. The motif tools on EGG can be divided into two categories: the cleavage sites search tools and the pattern search tools. The motif tools and their functions are shown in table 2.

First, the cleavage sites search tools (digest and sigcleave) can be used to find protein proteolytic enzyme, reagent cleavage digest and signal cleavage sites, which are indicated by specific patterns.

Second, the pattern search tools are used for searching a database for defined motifs, which indicate the potential function of the protein (pscan and patmatmotifs). The fuzzpro program can be used for searching a self-defined motif within a protein sequence. The potential antigenic sites can be predicted by using the antigenic program according to the method of Kolaskar and Tongaonkar (1990). The oddcomp program can be used for comparing a given protein sequence to a standard protein sequence for evaluating the differences of amino acid composition.

Table 2. Protein motif tools on EGG

Protein Motifs iNquiry Tools Functions
Cleavage sites search digest Protein proteolytic enzyme or reagent cleavage digest
sigcleave Reports protein signal cleavage sites
Pattern search antigenic Finds antigenic sites in protein sequences
fuzzpro Protein pattern search
pscan cans proteins using PRINTS
preg Regular expression search of a protein sequence
patmatmotifs Search a PROSITE motif database with a protein sequence
patmatdb Search a protein sequence with a motif
oddcomp Finds protein sequence regions with a biased composition

Usages and examples

To demonstrate the use of the cleavage sites search tools, an uncharacterized proteins sequence from Drosophila gene CG3281 is used. The digest, antigenic, fuzzpro and pscan programs are utilized to analyze the protein sequence.

digest

The digest program is used to find potential protein proteolytic enzyme or reagent cleavage sites. The output a file containing the positions where the agent cuts and the peptides produced. Because trypsin will not normally cut after a K if it is followed by another K or a P. Choosing parameter “unfavoured” shows those unfavoured cuts as well as the favoured ones. A partial digestion can be emulated by using the “–overlap” qualifier from a digest. In a partial digestion, all cut sites are used and one site at a time is not cut. The allpartials qualifier is used to emulate a very partial digestion and show all possible fragments from using all possible combinations of cutting sites. Trypsin is selected as the required parameter for analyzing the sample protein sequence. The qualifier overlap, unfavoured and allpartials are not selected. The result shows the Trypsin cleavage sites listed in figure 4.

Figure 4. The Trypsin cleavage sites of the CG4281 protein calculated by the digest

antigenic

The antigenic program is used to predict antigenic sites on protein sequences. The hydrophobic residues Cys, Leu and Val are likely to be a part of antigenic sites if they occur on the surface of a protein. The physicochemical properties of amino acid residues and their frequencies of occurrence in experimentally known segmental epitopes can be used to predict antigenic determinants on proteins (Kolaskar and Tongaonkar, 1990). The antigenic sites of the CG3281 protein are found by using the antigenic program (figure 5).

Figure 5. The antigenic sites of the CG3281 protein calculated by the digest

fuzzpro

The fuzzpro program is utilized to search for a specific pattern, which is a specification of a short length of sequence using PROSITE style patterns in a given protein sequence. The fuzzpro program can be used to search for an exact sequence and allow various ambiguities, matches. A specific motif xxgrpxx in the D1 protein, which is a DNA binding motif, is used to search the uncharacterized CG3281 protein. The result is shown in figure 6. fuzzpro uses to search protein sequences.

Figure 6: The motif xxgrpxx found in uncharacterized protein by the fuzzpro

pscan

The pscan program is used to search potential motifs in a given protein sequence using the PRINTS database, which is a database of diagnostic protein signatures, or fingerprints. The fingerprints are groups of conserved motifs or elements that together form a diagnostic signature for particular protein families. The pscan program finds matches between a query protein sequence and the motifs or elements in the PRINTS database. The athook, alphaamylase and profilin motifs are found in the uncharacterized sample sequence (figure 7).

Figure 7. The motif search of the uncharacterized protein in PRINTS by the pscan.

Conclusion

iNquriy tools for protein composition and motif searches are used for analyzing potential properties and functions of a given protein sequence. The protein composition tools can be used to analyze the basic properties, the electrical properties and protein hydropathy. The protein motif tools are used for predicting potential properties and functions by searching cleavage sites and specific patterns in a given protein sequence.

References

Attwood, T.K., Flower, D.R., Lewis, A.P., Mabey, J.E., Morgan, S.R., Scordis, P., Selley, J. and Wright, W. (1999) PRINTS prepares for the new millennium.Nucleic Acids Research, 27(1), 220-225.

Kolaskar,AS and Tongaonkar,PC (1990). A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Letters 276: 172-174.

Kyte J, Doolittle RF A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982 May 5;157(1):105-132

White S.H. and Wimley W.C. (1999) "Membrane protein folding and stability: physical principles" Ann. Rev.Biophys. Biomol. Struct. 28:319-365.

EMBOSS TOOLS http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/index.htm

PRINTS database http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/