A user's guide to iNquiry tools for protein sequence analysis on EGG
Introduction
Protein sequence analysis is used to predict gene functions using the protein sequence translated from the nucleotide sequence of a gene. Compared to the nucleotide sequence, protein sequence has many advantages for studying gene functions, protein localization and phylogenetics. The variability of the nucleotide sequence, such as the degeneracy, wobble position, codon bias and preference, make it difficult to study sequence similarity among different species, which may have related functions. However, the analysis of protein sequence provides an alternative way for studying gene functions determined by the protein structure, taking the advantage of the universal amino acid set and the basic properties of each amino acid.
iNquiry tools for protein sequence analysis on EGG can be divided into four categories: composition, motif, 2D and 3D structure tools. The composition and motif tools will be discussed in this summary. The protein composition tools are used for analyzing the basic properties of a protein sequence, and the motif tools can be utilized for searching a specific pattern in a protein sequence. The functions of the potential gene product can be predicted by utilizing the composition and motif tools.
Composition tools
Classification and functions
The inquiry tools for protein composition analysis on EGG can be divided into four categories according to their functions (table 1). The first three categories of tools are used for analyzing a single protein sequence and the fourth group of tools is used for protein analysis from mass spectrometry, which includes a series of protein sequences.
First of all, for an uncharaterized protein sequence, the protein statistics tools can be used to calculate the basic amino acid composition. The checktrans program can be used to identify the potential protein sequence between start and stop codons from a raw sequence. The pepstats and pepinfo programs can be used to calculate the molecular weight, the number and percentage of each type of residues with the physico-chemical properties and isoelectric point, etc.
Secondly, the electrical properties of a protein, such as the isoelectric point, can be calculated by the iep program. The isoelectric point is the pH of an aqueous protein solution where the numbers of positive and negative charges on the protein are equal, which can be used for protein isolation, purification and crystallization (Kantardjieff and Rupp 2003 ).
Thirdly, protein hydropathy tools can be used to predict the potential functions of an uncharacterized protein by recognizing the hydrophobic and hydrophilic regions on a protein sequence. The transmembrane regions of a protein sequence have a thermodynamic preference for a hydrophobic environment inside the membrane lipid bilayer. The program octanol calculates two free energy differences according to White and Wimley (1999), indicating the potential transmembrane regions a protein.
| Protein Composition | iNquiry Tools | Functions |
|---|---|---|
| Basic protein statistics | pepstats | Protein statistics |
| pepinfo | Plots simple amino acid properties in parallel | |
| checktrans | Reports STOP codons and ORF statistics of a protein | |
| Electrical properties | charge | Protein charge plot |
| iep | Calculates the isoelectric point of a protein | |
Protein hydropathy |
octanol | Displays protein hydropathy |
| pepwindow | Displays protein hydropathy | |
| Mass protein analysis | mwfilter | Removes unwanted (noisy) data from mass spectrometry output in proteomics |
| emowse | Protein identification by mass spectrometry | |
| pepwindowall | Displays protein hydropathy of a set of sequences |
Usages and examples
The following examples are used to demonstrate the usage of these analysis tools. The sequence of the Drosophila D1 protein, which is an uncharacterized non-histone DNA binding protein, is analyzed by using iNquiry tools.
pepstats
The pepstats program is used to calculate basic amino acid composition of the D1 protein (figure 1). The molecular weight, number of residues, average residue weight, charge and isoelectric point are shown in the output. For each type of amino acid the number, molar percent and DayhoffStat are displayed. DayhoffStat is the amino acid's Dayhoff statistic divided by the molarpercent. The Dayhoff statistic is the amino acid's relative occurrence per 1000 aa normalised to 100 by rls@ebi.ac.uk (original work from 1993). The amino acids are also sorted by the physico-chemical properties and the number and the molecular percentage of each class of amino acid are shown. Probability of protein expression in E. coli inclusion bodies is calculated as a type of solubility measure. Molar extinction coefficient (A280) and the extinction coefficient at 1 mg/ml (A280) are also calculated. The extinction coefficients for proteins are generally reported with respect to an absorbance measured at the wavelength of 280 nm. Molar extinction coefficient estimates the molar concentration of a protein solution from its measured absorbance.

iep
Besides the pepstats program, the isoelectric point of the D1 protein can also be calculated and graphically displayed by the iep program (figure 2). The pI of the D1 sequence is 6.57, indicating this protein is a neutral nucleic protein.

octanol
To determine the potential transmembrane regions of the D1 protein, the octanol program is used to display a graph of the classic White and Wimley hydropathy plot. The octanol program calculates two free energy differences. The first free energy is the free energy difference between solution in water and association with the interface (glycerol group) of a POPC (palmitoyloleoylphosphocholine) bilayer and the second is the free energy difference between water and octanol, equivalent to the environment inside a lipid bilayer. Protein sequences of the transmembrane regions are assumed to have a thermodynamic preference for a hydrophobic environment rather than an aqueous environment in water. The location of probably transmembrane regions is indicated by a sliding window of either free energy difference (White and Wimley, 1999). The window size of 19 residues, which is the size of a membrane spanning alphahelix, is used to analyze the D1 protein sequence by the octanol program. The peaks above zero represent the transmembrane regions. The D1 protein sequence doesn't have any transmembrane regions, indicated by the plot in figure 3.

Motif tools
Classification and functions
Motif tools are used to find motifs and cleavage sites in a protein sequence for predicting the potential protein functions. The motif tools on EGG can be divided into two categories: the cleavage sites search tools and the pattern search tools. The motif tools and their functions are shown in table 2.
First, the cleavage sites search tools (digest and sigcleave) can be used to find protein proteolytic enzyme, reagent cleavage digest and signal cleavage sites, which are indicated by specific patterns.
Second, the pattern search tools are used for searching a database for defined motifs, which indicate the potential function of the protein (pscan and patmatmotifs). The fuzzpro program can be used for searching a self-defined motif within a protein sequence. The potential antigenic sites can be predicted by using the antigenic program according to the method of Kolaskar and Tongaonkar (1990). The oddcomp program can be used for comparing a given protein sequence to a standard protein sequence for evaluating the differences of amino acid composition.
| Protein Motifs | iNquiry Tools | Functions |
|---|---|---|
| Cleavage sites search | digest | Protein proteolytic enzyme or reagent cleavage digest |
| sigcleave | Reports protein signal cleavage sites | |
| Pattern search | antigenic | Finds antigenic sites in protein sequences |
| fuzzpro | Protein pattern search | |
| pscan | cans proteins using PRINTS | |
| preg | Regular expression search of a protein sequence | |
| patmatmotifs | Search a PROSITE motif database with a protein sequence | |
| patmatdb | Search a protein sequence with a motif | |
| oddcomp | Finds protein sequence regions with a biased composition |
Usages and examples
To demonstrate the use of the cleavage sites search tools, an uncharacterized proteins sequence from Drosophila gene CG3281 is used. The digest, antigenic, fuzzpro and pscan programs are utilized to analyze the protein sequence.
digest
The digest program is used to find potential protein proteolytic enzyme or reagent cleavage sites. The output a file containing the positions where the agent cuts and the peptides produced. Because trypsin will not normally cut after a K if it is followed by another K or a P. Choosing parameter “unfavoured” shows those unfavoured cuts as well as the favoured ones. A partial digestion can be emulated by using the “–overlap” qualifier from a digest. In a partial digestion, all cut sites are used and one site at a time is not cut. The allpartials qualifier is used to emulate a very partial digestion and show all possible fragments from using all possible combinations of cutting sites. Trypsin is selected as the required parameter for analyzing the sample protein sequence. The qualifier overlap, unfavoured and allpartials are not selected. The result shows the Trypsin cleavage sites listed in figure 4.

antigenic
The antigenic program is used to predict antigenic sites on protein sequences. The hydrophobic residues Cys, Leu and Val are likely to be a part of antigenic sites if they occur on the surface of a protein. The physicochemical properties of amino acid residues and their frequencies of occurrence in experimentally known segmental epitopes can be used to predict antigenic determinants on proteins (Kolaskar and Tongaonkar, 1990). The antigenic sites of the CG3281 protein are found by using the antigenic program (figure 5).

fuzzpro
The fuzzpro program is utilized to search for a specific pattern, which is a specification of a short length of sequence using PROSITE style patterns in a given protein sequence. The fuzzpro program can be used to search for an exact sequence and allow various ambiguities, matches. A specific motif xxgrpxx in the D1 protein, which is a DNA binding motif, is used to search the uncharacterized CG3281 protein. The result is shown in figure 6. fuzzpro uses to search protein sequences.

pscan
The pscan program is used to search potential motifs in a given protein sequence using the PRINTS database, which is a database of diagnostic protein signatures, or fingerprints. The fingerprints are groups of conserved motifs or elements that together form a diagnostic signature for particular protein families. The pscan program finds matches between a query protein sequence and the motifs or elements in the PRINTS database. The athook, alphaamylase and profilin motifs are found in the uncharacterized sample sequence (figure 7).

Conclusion
iNquriy tools for protein composition and motif searches are used for analyzing potential properties and functions of a given protein sequence. The protein composition tools can be used to analyze the basic properties, the electrical properties and protein hydropathy. The protein motif tools are used for predicting potential properties and functions by searching cleavage sites and specific patterns in a given protein sequence.
References
Attwood, T.K., Flower, D.R., Lewis, A.P., Mabey, J.E., Morgan, S.R., Scordis, P., Selley, J. and Wright, W. (1999) PRINTS prepares for the new millennium.Nucleic Acids Research, 27(1), 220-225.
Kolaskar,AS and Tongaonkar,PC (1990). A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Letters 276: 172-174.
Kyte J, Doolittle RF A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982 May 5;157(1):105-132
White S.H. and Wimley W.C. (1999) "Membrane protein folding and stability: physical principles" Ann. Rev.Biophys. Biomol. Struct. 28:319-365.
EMBOSS TOOLS http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/index.htm
PRINTS database http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/