Multiple Sequence Alignments and Consensus Sequences Using Bioinformatic Tools on EGG
What is a multiple sequence alignment and why is it useful?
Simply put, a multiple sequence alignment is the alignment of residues (either amino acid or nucleotide) of several sequences, with introduced gaps. Multiple sequence alignments are useful because of easy identification of areas of conservation among the sequences. These areas of conservation can be used for primer design, hybridization probe design, database searching and phylogenetic analysis(11).
Steps in a Multiple Sequence Alignment
Multiple sequence alignment programs utilize three steps in performing an alignment. First, pairwise alignments of all the sequences are performed to calculate a distance matrix. This is done in one of two ways, fast or slow. A Fast pairwise alignment produces scores for the distance matrix that correspond to the number of k-tuple matches or runs of identical residues (1-2 for amino acid or 2-4 for nucleotides), minus a fixed penalty for every gap(13). A slow pairwise alignment utilizes dynamic programming where two penalties are assessed, the first for opening a gap and the second for extending a gap.This pairwise alignment is slower but more accurate than k-tuple matches.
The second step is generation of a guide tree from the distance matrix calculated during the pairwise alignments. This guide tree is generated using the neighbor-joining method (10). The root of this tree is placed by the midpoint method, so the branches on either side are equal (12). The actual multiple sequence alignment is derived from the guide tree.
The third step is progressive alignment of all the sequences according to the branching order on the guide tree. The most closely related sequences are aligned first, progressing to most distantly related sequences(11).
Multiple Sequence Alignment Tools Available on EGG
Several tools are available on EGG to perform a multiple sequence alignment (MSA), and display the results of the MSA in a graphical form.
There are three tools that perform a MSA, Clustalw-MPI, Clustalw, and Emma. There is very little difference among these tools. Clustalw-MPI utilizes parallel computing to perform the MSA, in other words, it splits the job among different computers and when the results come back from those computers it compiles the results for the user into an outfile(8). Clustalw, however, uses one computer to perform the MSA, so generally it is slower.
Sample Session with Clustalw-MPI
The input for the example is the amino acid and nucleotide sequence of the Acyl Carrier Protein of six different bacterial species.(1,2,3,4,5,6)
Emma
Emma is an Emboss tool that interfaces with Clustalw to perform the MSA. The user interface is nearly the same, the same parameters can be manipulated to suit the needs of the user. Emma's advantage over the other MSA tools is the ability to directly use the outfile in other Emboss applications, such as plotcon, which graphically displays the quality of sequence conservation, and prettyplot, which displays the MSA with colors indicating the areas of conservation and similarity(7,9).
Consensus sequences
Cons Creates a consensus sequence from a multiple sequence alignement. When no consensus is found at a particular position an ‘N' or an ‘X' is placed in that position. There are three parameters that can be manipulated. The first is plurality, this is a cut off number for positive matches, under which there is no consensus. The default plurality value is half the total sequence weighting. The second is identity, which is a set number of identities to give a consensus at a particular position. The third is setcase, which is a threshold value above which the consensus is uppercase, which indicates a high amount of confidence in the identity at a particular position, and below which is lowercase, which indicates less confidence in that identity. This parameter is valuable when performing searches because the lowercase positions can be filtered out(7).
References
1.Clostridium perfringens str. 13 DNA, complete genome, section 7/10. Genbank Accession Number: AP003191
2. Haemophilus influenzae Rd KW20, complete genome. Genbank Accession Number: NC_000907
3. Lactobacillus plantarum WCFS1, complete genome. Genbank Accession Number: NC_004567
4. Leptospira interrogans serovar Copenhageni str. Fiocruz L1-130 Genbank Accession Number: NC_005824
5. Mycoplasma penetrans genomic DNA, complete sequence, section 2/5. Genbank Accession Number: AP004171
6. Vibrio parahaemolyticus DNA, chromosome 1, complete sequence, 8/11. Genbank Accession Number: AP005080
7. Carver,T. (tcarver © rfcgr.mrc.ac.uk) MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK
8. Li, Kuo-Bin. 2003. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics. 19: 1585-1586
9. Longden, L. (il © sanger.ac.uk) Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
10. Saitou, N., M. Nei. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Bio. Evol. 4: 406-425
11. Thompson, J.D., D.G. Higgins, T.J. Gibson. 1994. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680
12. Thompson, J.D., D.G. Higgins, T.J. Gibson. 1994. Improved sensistivity of profile searches through the use of sequence weights and gap excision. Comput. Appl. Biosci. 10: 19-29
13. Wilbur, W.J., D.J. Lipman, 1983. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci. 80:726-30.