Home| Letters| Links| RSS| About Us| Contact Us

On the Frontline

What's New

Table of Contents

Index of Authors

Index of Titles

Index of Letters

Mailing List


subscribe to our mailing list:



SECTIONS

Critique of Intelligent Design

Evolution vs. Creationism

The Art of ID Stuntmen

Faith vs Reason

Anthropic Principle

Autopsy of the Bible code

Science and Religion

Historical Notes

Counter-Apologetics

Serious Notions with a Smile

Miscellaneous

Letter Serial Correlation

Mark Perakh's Web Site

29+ Evidences for Macroevolution

Some Statistics of Incongruent Phylogenetic Trees

Copyright 1999-2004 by Douglas Theobald, Ph.D.

Outline

The table below and the javascript calculator following it provide values for the statistical significance of a match between two incongruent phylogenetic trees, reported as P-values. These P-values give the probability that two bifurcating rooted trees, with a given number (or less) of mismatching branches, would match by chance.

The number of incongruent branches is determined relative to the maximum agreement subtree (MAST) between two trees. A MAST is the "core" subtree that is common between two trees. The number of incongruent branches is equal to the minimum number of branches that must be pruned from one of the real trees to get the MAST. An example from John Harshman's analysis of crocodile species is given in the figure below (Harshman et al. 2003).

[Two incongruent phylogenetic trees of crocodile species]
Two incongruent crocodile phylogenies. The tree at left is based upon morphological data; the tree at right on the molecular sequence of the c-myc proto-oncogene (Harshman et al. 2003). The common MAST is shown in black. According to the distance metric described above, the distance between the two trees is one branch, due to the misplaced Gavialis branch indicated in magenta. The significance of the match between these two incongruent phylogenies is P ≤ 0.00077. Additionally, Harshman et al. performed an independent phylogenetic analysis with mitochondrial genes, which gave exactly the same tree as the c-myc proto-oncogene data. The overall significance for these three independent trees is P ≤ 7.4 10-8.

In the table below, the rows list values for a comparison of two trees with increasing numbers of taxa. The columns list the significance for a given number of differences between the two trees. Incongruency of "1 adjacent" refers to the case where a branch is misplaced by only one adjacent node (i.e., two branches next to each other are swapped relative to the other tree). The remaining columns labelled 1 through 10 refer to the case where x branches or less are misplaced anywhere in the tree. High statistical significance (P < 0.01, or greater than 99% confidence) is indicated by light blue. Statistical significance (P < 0.05, or greater than 95% confidence) is indicated by pink. Equivocal values (0.05 < P < 0.50) are indicated by white. Highly insignificant values (P > 0.50) are indicated by red, and impossible values are colored black.


Statistical Significance of Two Incongruent Phylogenetic Trees
Number of taxa Maximum P-value for two trees incongruent by given number of branches:
exactly congruent 1 adjacent 1 2 3 4 5 6 7 8 9 10
4 0.067 0.20 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
5 0.0095 0.038 0.28 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
6 0.0011 0.0052 0.050 0.97 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
7 9.6 x 10-5 5.8 x 10-4 0.0067 0.20 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
8 7.4 x 10-6 5.2 x 10-5 6.8 x 10-4 0.030 0.53 1.00 1.00 1.00 1.00 1.00 1.00 1.00
9 4.9 x 10-7 3.9 x 10-6 6.2 x 10-5 0.0035 0.089 1.00 1.00 1.00 1.00 1.00 1.00 1.00
10 2.9 x 10-8 2.6 x 10-7 4.6 x 10-6 3.3 x 10-4 0.012 0.22 1.00 1.00 1.00 1.00 1.00 1.00
11 1.5 x 10-9 1.5 x 10-8 3.0 x 10-7 2.7 x 10-5 0.0012 0.032 0.49 1.00 1.00 1.00 1.00 1.00
12 7.2 x 10-11 8.0 x 10-10 1.8 x 10-8 1.9 x 10-6 1.1 x 10-4 0.0037 0.076 0.98 1.00 1.00 1.00 1.00
13 3.1 x 10-12 3.8 x 10-11 9.1 x 10-10 1.2 x 10-7 8.3 x 10-6 3.5 x 10-4 0.0095 0.17 1.00 1.00 1.00 1.00
14 1.2 x 10-13 1.6 x 10-12 4.3 x 10-11 6.6 x 10-9 5.6 x 10-7 2.9 x 10-5 9.9 x 10-4 0.022 0.33 1.00 1.00 1.00
15 4.6 x 10-15 6.6 x 10-14 1.8 x 10-12 3.3 x 10-10 3.3 x 10-8 2.1 x 10-6 8.7 x 10-5 0.0025 0.048 0.62 1.00 1.00
16 1.6 x 10-16 2.4 x 10-15 5.6 x 10-14 1.5 x 10-11 1.8 x 10-9 1.3 x 10-7 6.7 x 10-6 2.3 x 10-4 0.0056 0.095 1.00 1.00
17 5.2 x 10-18 8.3 x 10-17 2.1 x 10-15 6.4 x 10-13 8.6 x 10-11 7.5 x 10-9 4.5 x 10-7 1.9 x 10-5 5.6 x 10-4 0.012 0.18 1.00
18 1.5 x 10-19 2.7 x 10-18 7.4 x 10-17 2.5 x 10-14 3.8 x 10-12 3.9 x 10-10 2.7 x 10-8 1.4 x 10-6 4.9 x 10-5 0.0013 0.024 0.32
19 4.5 x 10-21 8.1 x 10-20 2.3 x 10-18 8.9 x 10-16 1.6 x 10-13 1.8 x 10-11 1.5 x 10-9 8.6 x 10-8 3.7 x 10-6 1.2 x 10-4 0.0027 0.046
20 1.2 x 10-22 2.3 x 10-21 7.3 x 10-20 3.0 x 10-17 5.9 x 10-15 7.8 x 10-13 7.3 x 10-11 4.9 x 10-9 2.5 x 10-7 9.2 x 10-6 2.5 x 10-4 0.0054
Number of taxa
exact match 1 adjacent 1 2 3 4 5 6 7 8 9 10

P-value Calculator

This calculator finds the upper bound on the probability that two or more trees would mismatch by a given number of branches or less by random. It is reliable for very large numbers, as it uses the logarithmic form of an improved Stirling's approximation for large factorials (for example, try 100,000 for number of taxa and 99,389 for number of incongruent branches).

rooted   unrooted
Number of Trees:

Number of Taxa:

Number of Incongruent Branches:

P-value ≤


Mathematical Details

For an exact match between two trees (no incongruence):

P = (2N-2)(N-2)! / (2N-3)!

or

P = 1 / (2N-3)!!

where "!!" is double factorial notation and N = # of taxa. For an incongruency of "1 adjacent" branch:

P = (2N-2)(N-1)! / (2N-3)!

For an incongruency of I branches, misplaced anywhere between two trees:

P ≤ (2N-I-2)(N-I-2)!N! / (2[N-I]-3)!(N-I)!I!

or

P ≤ (N!/(N-I)!I!) / (2[N-I]-3)!!

where N = # of taxa and I = # of incongruent branches.

This last P-value calculation is an upper bound. That is, this P-value is an overestimation, since the actual P-value is very likely to be lower (better). P is the ratio of the maximum number of possible incongruent trees over the total number of possible trees. However, in the final equation the calculated maximum number of incongruent trees includes nonunique trees (i.e., some of the incongruent trees have the same topology and thus are counted more than once). For example, for N = 4 and I = 1, this calculation gives P ≤ 1.3333, while the exact P = 0.73333. At large N and I, P converges on the exact value.

These equations can be extended easily to the case of discrepancies between more than two trees, each of the same number of taxa. The probability that k rooted, binary, N-taxa trees have at most I incongruent branches is:

P ≤ (N!/(N-I)!I!) / ((2[N-I]-3)!!){k - 1}

Equivalently, this is the probability that two or more N-taxa trees will share the same MAST of size N - I or greater. The Javascript calculator above uses this equation to determine its P-values.

I would appreciate hearing from anyone who has any ideas on how to correct for nonunique trees. I independently derived most of these equations in the summer of 2002. Later I discovered via personal correspondence that Mike Steel had also derived these equations and was soon to publish all but the last in an upcoming book (Bryant et al. 2002). It appears that the final equation was independently derived by both me and Mike Steel, and to my knowledge it remains unpublished.


References

Li, W.-H. (1997). Molecular Evolution. Sunderland, MA, Sinauer Associates. p. 102.

Bryant, D., MacKenzie, A. and Steel, M. (2002). "The size of a maximum agreement subtree for random binary trees." In: Bioconsensus II. DIMACS Series in Discrete Mathematics and Theoretical Computer Science (American Mathematical Society). ed., M.F. Janowitz.

Harshman, J., Huddleston, C. J., Bollback, J. P., Parsons, T. J., and Braun, M. J. (2003). "True and false gharials: a nuclear gene phylogeny of crocodylia." Syst Biol. 52: 386-402. [PubMed]