Skip to content

Welcome courses Facilities research Members News Search
  You are not logged in Log in
You are here: Home » biocourses » Bioinformatics for Biologists - BIOS 599 - Spring 2004 » projects » Phylogenetic Tree Reconstruction - Distance Matrix Method and Tree Drawing

« August 2008 »
Su Mo Tu We Th Fr Sa
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            

Phylogenetic Tree Reconstruction - Distance Matrix Method and Tree Drawing

I. INTRODUCTION
Reconstructing a phylogenetic tree from molecular data is complex and often controversial (Nei & Kumar, 2000). There are three methods for tree reconstruction: distance matrix method, maximum parsimony (MP) method, and maximum likelihood (ML) method. Currently, there is no perfect method (Nei & Kumar, 2000). However, regardless of different methods being applied, the ultimate goal is the same, which is to obtain a best-estimated tree.

This report summarizes some of the tools available in iNquiry that use the distance matrix method for tree reconstruction and those that plot nice graphycal trees.

II. DISTANCE MATRIX METHOD
a) Introduction
The distance matrix method uses estimated distances in a matrix form between all pairs of species (or genes) in a data set to reconstruct a phylogenetic tree. This method is computationally fast; however, since the original data set is not used, some information in the data set may be lost, and therefore it may not be as powerful as MP or ML methods (Opperdoes, 1997).

b) Preparation to Use Distance Matrix Methods
Before a distance matrix can be analyzed, any data set of nucleic acid or amino acid sequences must first be converted into estimated distances in a matrix form. The following tools in iNquiry can do this conversion.

Program Data Type
Dnadist (PHYLIP) nucleic acid sequences
Protdist (PHYLIP) amino acid sequences
NucML (MOLPHY) nucleic acid sequences
ProtML (MOLPHY) amino aid sequences

Note that NucML and ProtML are put together in the same interface under Molphy in iNquiry.

c) Methods Using Distance Matrices (Nei & Kumar, 2000)
There are various methods of the distance matrix method. I list four of them that are used in the iNquiry tools.

  1. UPGMA (Unweighted Pair-Group Method Using Arithmetic Averages)
    UPGMA involves clustering of closely distant species. At each stage of clustering, tree branches are being built, and the branch lengths are calculated. UPGMA assumes a constant evolutionary rate, and so the two species in a cluster are given the same branch length from the node. It is a simple and fast method; however, because of the assumption, it often produces incorrect topologies when the assumption is not met.
  2. Least Squares (LS) Method
    The LS method calculates the differences between the observed and estimated branch lengths between species. After it evaluates all possible topologies, it chooses the topology with the smallest difference. The estimation of branch lengths has two methods, Fitch-Margoliash and Least Squares.
  3. Minimum Evolution (ME) Method
    The ME method estimates the total branch length of each topology. After it evaluates all possible topologies, it chooses the topology with the least total branch length. This method is computationally intensive and therefore slow, and with a small number of species to compare, the NJ method usually gives the same result as the ME method in less time.
  4. Neighbor-Joining (NJ) Method
    The NJ method involves clustering of neighbor species that are joined by one node. It does not evaluate all the possible tree topologies, but at each stage of clustering the ME method is used. Thus, the NJ method is considered a simplified version of the ME method.

d) Using Distance Matrix Method
The table below shows the available programs in iNquiry that use the distance matrix method.

program default method output
Fitch (PHYLIP) LS (Fitch-Margoliash) unrooted tree
Kitshc (PHYLIP) LS (Fitch-Margoliash) rooted tree
Neighbor (PHYLIP) NJ unrooted tree
NJDist (MOLPHY) NJ unrooted tree
Weigbhor weighted NJ unrooted tree

1) Fitch, Kitsch, Neighbor in PHYLIP (Phylogeny Inference Package) (Felsenstein, 1993, unless otherwise noted)

+ Input Format
The input data must be in the PHYLIP format. The first line must contain the number of species and nothing else. The names of the species can only be 10 characters or less, which must immediately be followed by distance values. If the names are shorter than 10 characters, you must add spaces to make it up to the 10-character space. The format can be square or upper/lower-triangular matrix.

Sample PHYLIP Input

    7
CHICKEN     0.0000  4.9633  4.0284  4.4747  3.8216  3.6020  0.3728
MOUSE       4.9633  0.0000  0.1959  0.2923  0.2641  0.2729  4.1461
RAT         4.0284  0.1959  0.0000  0.2820  0.2766  0.2857  3.6916
DOG         4.4747  0.2923  0.2820  0.0000  0.2497  0.2728  4.2250
SHEEP       3.8216  0.2641  0.2766  0.2497  0.0000  0.1836  3.7778
COW         3.6020  0.2729  0.2857  0.2728  0.1836  0.0000  4.2541
HUMAN       0.3728  4.1461  3.6916  4.2250  3.7778  4.2541  0.0000

+ Options
The three tools offer many, very similar options. Some important options are listed below.
Method Option (D) (Fitch, Kitsch, Neighbor)
In Fitch and Kitsch, you can choose Fitch-Margoliash (default) or ME methods. In Neighbor, UPGMA or NJ (default) can be selected.
Jumble Option (J) (Fitch, Kitsch, Neighbor)
This option is recommended. It randomizes the order of species in your input file and reduces the error created by the order. The random number seed must be an odd integer (1 through 32767). If you enter 3, the program will generate three different trees with random species orders and will output the one best tree among them. It is recommended to enter at least 10 for the number of jumble, which indicates how many times the program will run the random seed process and select the best overall tree. A higher jumble value will decrease the chance of errors even more.
Bootstrap Option (M) (Fitch, Kitsch, Neighbor)
This option is also recommended. It randomly resamples and creates pseudo-replicates of your original data (Opperdoes, 1997). It can test the reliability of your data (Opperdoes, 1997).
User Tree Option (U) (Fitch, Kitsch, Neighbor)
If you want specific tree topologies to be evaluated, enter the trees here. A tree topology must be written with parentheses and must end with “;”. (e.g. (A(B,C));)
Outgroup Option (O) (Fitch, Neighbor)
You need to specify the species you want the outgroup species to be. The default is 1, which is the first species in your input file.
Global Rearrangements (G) (Fitch)
This is another recommended option. With this option, subtrees will be removed from the tree and put back on in all possible topologies to find a better tree. This option is default in Kitsch.

+ Output
Two files are produced, outfile and outtree. In outfile, the topology of the best tree found is shown with the branch lengths and heights. In outtree, a tree file of the best tree is produced, which can be used for tree drawing tools.

Sample Output of Neighbor Outfile


7 Populations

Neighbor-Joining/UPGMA method version 3.6b

Data set # 1:

 Neighbor-joining method

 Negative branch lengths allowed


                                              +SHEEP     
  +-------------------------------------------2 
  !                                           ! +COW       
  !                                           +-3 
  !                                             ! +RAT       
  !                                             +-5 
  !                                               ! +-DOG       
  !                                               +-4 
  !                                                 +-MOUSE     
  ! 
  1HUMAN     
  ! 
  +--CHICKEN   


remember: this is an unrooted tree!

Between        And            Length
-------        ---            ------
   1             2            3.67827
   2          SHEEP          -0.06497
   2             3            0.11908
   3          COW             0.03690
   3             5            0.09797
   5          RAT             0.05112
   5             4            0.04168
   4          DOG             0.15396
   4          MOUSE           0.13834
   1          HUMAN           0.10686
   1          CHICKEN         0.26594

2) NJdist in MOLPHY (Molecular Phylogenetics) (Adachi & Hasegawa, 1996)
NJdist is also available on the same page as ProtML and NucML under Molphy.

+ Input Format
The input data must in the MOLPHY or PHYLIP format. In the MOLPHY format, the first line must contain the number of species, the sequence length, and any other information you would like to input. Species names can be longer than 10 characters, but the distance values must be separated from the species names by a space, a tab, or a new line.

+ Options
The options are self-explanatory. The default outgroup number is 1.

+ Output
The picture of the best tree is produced in the PDF format. If you have selected the Output Tree and Output Topology options, they will be produced.

3) Weighbor (Weighted Neighbor Joining)
Weighbor uses a weighted version of the neighbor joining method and gives less weight to longer distance values in a matrix (UbiC, not dated).
+ Input Format (Bruno et al., 2001)
The format is similar to the Phylip format, except that the names of the species can be as long as 128 characters and that they must be separated from the distance values by a space, a tab, or a new line.

+ Options (Bruno et al., 2000)
Length of the sequences (-L)
The entered number should be the length of your sequence minus the portions of the sequence that do not change over time. The default value is 500.
Size of the alphabet (-b)
The default value is 4.0. If your original sequences contain biased nucleotide usage or partial selection pressure, the value should be lower than 4.0 but higher than 2.0.

+ Output
The output is a tree file with branch lengths.

III. TREE DRAWING
(Felsenstein, 1993) a) Introduction
Once you obtain the best tree from any tree reconstruction tools, the tree can be plotted in a better graphical format. PHYLIP offers such programs as well as a program that computes a consensus tree from multiple trees.

b) Using Tree Drawing Tools
The table below shows the available programs in iNquiry.

program description
Drawtree (PHYLIP) plots an unrooted tree
Drawgram (PHYLIP) plots a rooted tree
Consesnse (PHYLIP) plots a consensus tree

1) Drawtree and Drawgram in PHYLIP
+ Input Format
The input must be in the tree file format. Any tree files resulting from the above-mentioned tools as well as other PHYLIP or MOLPHY tree building tools can be used directly.

Sample Drawgram Output

((RAT:0.010961,(HUMAN:0.159112,CHICKEN:0.213688):3.581156):0.088967,
MOUSE:0.095972,(DOG:0.135146,(COW:0.098777,
SHEEP:0.084823):0.032173):0.054063);

+ Options
There are many options, such as tree types, margins, and output formats. I recommend using the default values at first and then play with the options until you have a visually satisfactory tree.

+ Output
An output is a tree picture in the format you have specified.

Sample Drawgram Ouput
With Postscript Printer File Format Option and Depth/breath of tree option

2) Consense in PHYLIP
Consense is different from Drawtree and Drawgram. It computes a consensus tree from many trees, which can be obtained from bootstrap analysis.

+ Input Format
An input contains many trees in a tree file, with each tree starting on a new line.

+ Options
Consensus type (C)
This option lets you choose which method to be used by the program. The default is Majority Rule (extended), in which the program selects any sets of species that occur in more than 50% of the input trees for a consensus tree. In Strict, the program only selects those sets that occur in only 100% of the input trees. Two other methods are Ml and Majority Rule.

There are also the outgroup species option and an option for you to choose if you want your tree to be rooted or unrooted.

+ Output
Two files are produced, outfile and outtree. In outfile, four things are produced: the species order, those sets included in the consensus tree (The species are indicated with “.” or “*”, and the sets are indicated with “*”.), those not included in the consensus tree, and a consensus tree. In outtree, a tree file of the consensus tree is written.

Sample Output of Consense Outfile

Consensus tree program, version 3.6b

Species in order: 

  1. XENOPUS
  2. CHICKEN
  3. OPOSSUM
  4. FINBACK WH
  5. BLUE WHALE
  6. COW
  7. RAT
  8. MOUSE


Sets included in the consensus tree

Set (species in order)     How many times out of  100.00

.*******                   100.00
...**...                   100.00
..******                   100.00
......**                   99.50
...*****                   99.50
...***..                   99.50


Sets NOT included in consensus tree:

Set (species in order)     How many times out of  100.00

...****.                    0.50
..*...**                    0.50
...**.**                    0.50


Extended majority rule consensus tree

CONSENSUS TREE:
the numbers forks indicate the number
of times the group consisting of the species
which are to the right of that fork occurred
among the trees, out of 100.00 trees

                                     +------MOUSE
                       +--------99.5-|
                       |             +------RAT
                +-99.5-|
                |      |             +------BLUE WHALE
                |      |      +100.0-|
         +100.0-|      +-99.5-|      +------FINBACK WH
         |      |             |
         |      |             +-------------COW
  +100.0-|      |
  |      |      +---------------------------OPOSSUM
  |      |
  |      +----------------------------------CHICKEN
  |
  +-----------------------------------------XENOPUS

IV. REMARKS
As mentioned in the introduction, there is no perfect method for phylogenic tree construction. We should always keep in mind that the best tree resulting from any program used may not be the true tree. Opperdoes (1997) suggests that we do the following to test the reliability of the result.

  • Try different methods
  • Change the parameters/options
  • Add/remove one or more species from your data
  • Use bootstrapping
  • Add an outgroup species to your data

REFERENCES

Adachi, J. and M. Hasegawa. (1996) Molphy: A Computer Program Package for Molecular Phylogenetics.
http://www.is.titech.ac.jp/~shimo/class/doc/csm96.pdf Retrieved Apr. 19, 2004.

Bruno, W. J., et al. (2000) Weighted Neighbor Joining: A Fast Approximation to Maximum-Likelihood Phylogeny Reconstruction, Molecular Biology and Evolution, 17(1): 189-197.

Bruno, W. J., et al. (2001) Weighbor - Weighted Neighor Joining.
http//egg.isu.edu/weighbor/weighbor.txt Retrieved Apr. 19, 2004.

Felsenstein, J. (1993) Phylip
http://www.cmbi.kun.nl/bioinf/PHYLIP/main.html#toc1 Retrieved Apr. 19, 2004.

Nei, M and S. Kumar. (2000) Molecular Evolution and Phylogenetics.
Oxford University Press, New York. pp73-113.

Opperdoes, F. R. (1997) Methods for the Inference of Protein Phylogeny.
http://www.icp.ucl.ac.be/~opperd/private/phylogeny.html Retrieved Apr. 19, 2004.

UBiC Bioinformatics Centre. (date not shown) Weighbor.
http://bioinformatics.ubc.ca/resources/tools/index.php?name=weighbor Retrieved Apr. 19, 2004.