PRINT

The Nitty Gritty Bit

By Thomas D. Schneider, Ph.D.

Posted October 12, 2005

Introduction

Too much of this "debate" in the media about the origin of living creatures is fluff - no chewy intellectual concepts. Let's get right down to the nitty gritty. As reported in all the newspapers, the main claim by advocates of "intelligent design" is that living things are too complex to have evolved. Is this right? What is 'complexity'? A practical, widely accepted measure is the one used in communications systems: information theory, developed by Claude Shannon in the 1940s. If you use a phone or the internet then you rely on Shannon's theory.

Genetic Control Systems

Now let's think about genetic control systems. An exciting one recently discovered is the controls responsible for maintaining stem cell states. As reported by scientists at MIT there are three proteins, Oct4, Sox2 and Nanog which bind to DNA and keep the cell a stem cell (Boyer et. al.). If you remove the proteins from the cell, the cell switches into specific cell types. How do these proteins work? They bind to specific spots on the DNA to control other genes.

Finding Spots on a Genome

So we have a problem: how do the proteins find those spots? If they miss the right spots, then the controls will be thrown off and the cell will go the wrong direction making, perhaps, nerve tissue where skin should be, eyeballs instead of teeth ... legs growing out of your head. Such an organism would not survive. If a genetic control protein binds to the wrong spots, then the organism will be wasting its energy and grow slower than one that doesn't waste its protein manufacturing ability. In some cases binding in the wrong place could turn on or off the wrong gene. In other words, there is strong selection for a functional genetic control system.

Now suppose (for simplicity's sake) that an organism had only 32 positions on its genome. That is, there are only 32 places where a protein could bind. This is orders of magnitude smaller than genomes in nature, but let's use it as an example.

How much "effort" is required to find one of those positions? The way to find out is to divide the genome in half a series of times. (Think about using a knife to divide a cake into pieces.) Each division requires a bit of information, the measure introduced by Shannon, to specify the choice made. So:

32/2 = 16 (1 bit)
16/2 =  8 (1 bit)
 8/2 =  4 (1 bit)
 4/2 =  2 (1 bit)
 2/2 =  1 (1 bit)

So 5 divisions or 5 bits is enough information for the protein to locate one site. Mathematically this is log2(32) = 5. (See the appendix of my Information Theory Primer for a lesson on logs from the ground up. Don't worry, I won't tell anyone that you read it.)

Now, suppose that instead of there being only one of 32 possible sites to which the protein could bind, let us say that it could bind to two. If the protein binds to either one, it is doing its job since it doesn't matter which one it binds to and the other one can be found by another copy of the protein. So one of the divisions doesn't matter and only log2(32/2) = 4 bits are needed. Likewise, if there were 4 binding sites, only log2(32/4) = 3 bits would be needed.

This number is called Rfrequency because it is based on the frequency of sites and the size of the genome. The R follows Shannon's notation: rate of information transmission. Shannon worked with bits per second in communications. In molecular information theory, we work with bits per binding site. If a protein binds at a certain rate in binding sites per second, then the two measures would be equivalent.

Patterns in the Genome at Binding Sites

Now let's look at this same problem from a radically different angle. Let's collect the binding sites together and align them. There sometimes is enough pattern in the DNA sequences to do this by eye. Suppose that one position in the sequences is always a T. This means that when the protein is searching for its binding site, it will pick T out of the four genetic bases (A, C, G and T). So it makes a 1 in 4 choice or 2 bits.

Now consider a position in the binding sites that is A half of the time and G half of the time. How many bits is that? Well, the protein picks 2 out of 4 or log2(4/2) = 1 bit.

Finally, suppose that there is a position that isn't contacted by the protein. Then we will observe all four bases and the protein picks 4 out of 4 or log2(4/4) = 0 bits.

So we can look at the binding sites and find out the information at different positions in the sites. Shannon picked logarithms so that information can be added. This turns out to be an excellent first order computation for binding sites. So we can sum up the information across the different positions of a binding site to get a total. This is called Rsequence because it is information computed from DNA sequence data.

In nature things are a little more complicated because the frequencies of bases aren't always 100%, 50% or 25% as in the examples above, but fortunately Shannon's method lets us measure the information in other cases, and the results are consistent with what we found above. If you want to know the details, you can read the Information Theory Primer.

Finding Spots on a Genome versus Patterns at Binding Sites

So we have two measures, Rfrequency and Rsequence. How are they related? In a variety of genetic control systems in nature, they have almost the same value (Schneider1986). The result makes intuitive sense: the amount of information in the binding site (Rsequence) is just enough to find the binding sites in the genome (Rfrequency). As in any scientific theory, there are interesting exceptions which are teaching us fascinating new biology, but I'll let you read those stories from my web site.

So now we have an evolutionary problem. How did this situation of Rsequence being approximately equal to Rfrequency come about? Clearly the pattern at a binding site can change more rapidly than the total genome size. Also, the number of genes that could be controlled is pretty much fixed by the organism's current circumstances and that won't change rapidly compared to single mutations. So Rfrequency is more or less fixed over time. Even if the genome were to double in size, Rfrequency would only go up by 1 bit, so it is a pretty insensitive measure. So this means that Rsequence should evolve towards Rfrequency.

Can we model this process? That question led me to write the Ev computer program. You now have enough background to explore that.

If you are eager, just launch the new java version of Ev, Evj, recently written by Paul C. Anagnostopoulos:

Click Here to Start the Evj Model
(If that doesn't work on your computer, go to: http://www.ccrnp.ncifcrf.gov/~toms/paper/ev/evj/ and follow the instructions there.)

Click the 'Run' button in the window that opens up. Look closely and you will see the values of Rfrequency and Rsequence as the evolution proceeds. Be sure to watch until at least generation 700.

There is a Guide to Evj:

http://www.ccrnp.ncifcrf.gov/~toms/paper/ev/evj/evj-guide.html

which explains all that odd flickering and jumping around that happens, and you can read the original scientific paper at

http://www.ccrnp.ncifcrf.gov/~toms/papers/ev/

There are lots more resources on my web site to learn about information theory, bits and how to apply these to molecular biology.

Have fun!

Understanding "Intelligent Design"

Now let's see if the core idea of "intelligent Design" holds up. ID proponents frequently compute the probability of a pattern in nature and then claim that it couldn't happen 'by chance'. Of course the error here is that there is replication and selection going on all the time, but they sweep that under the rug.

Let's take the standard Evj run that you will get if you click Run on the Evj window and don't do anything else (except, maybe, crank up the speed). The results are shown in this screenshot:

http://www.ccrnp.ncifcrf.gov/~toms/papers/ev/evj/icons/evj-screenshot.jpg

The genome has 256 positions to which the recognizer can bind and there are 16 sites (marked by green bars) so Rfrequency is log2(256/16) = 4.00 bits. At 10,000 generations Rsequence is 3.19 bits and at 20,000 generations it is 4.18 bits. Just as we observe in nature, Rsequence is close to Rfrequency! This result shows that the Ev model of evolution is reasonable because it gives the observed result. Rsequence fluctuates around 4 bits; let's use that. So the total information that has evolved in the 16 binding sites is 16ª4 = 64 bits. If I flip a coin 64 times, what are the chances it comes up heads every time? Once every 264 = 18446744073709524992 = 1.84ª1019 times. If a "reproduction" (a set of 64 coin flips) occurred every second, how long would this take? 1.84ª1019 / (60ª60ª24ª365.25) = 5.85ª1011 years. That's 585 billion years! By comparison, the universe is known to be only 13.7 billion years old and earth has only been around 4 to 5 billion years. So, according to the "intelligent design" advocates there isn't enough time to evolve 16 sites with 4 bits each.

But all you have to do is try the Evj program several times with different initial random seeds to see that 64 bits (or more!) can evolve rather quickly in several hundred generations.

So what's wrong with that often-repeated "intelligent design" argument?

Several things:

  1. It neglects natural selection, which works small step by small step instead of all at once. In other words, it is not legal to multiply to compute those probabilities because, as everyone should recall from high school probability, probabilities only multiply if the events are independent I call this The AND-Multiplication Error. The next step of evolution (the children) depend on variations (mutations) of the previous step (the parents), so the evolutionary process is not independent.
  2. It neglects populations. A larger population should will produce useful mutations faster. You can try this with the Ev program.

"Intelligent design" advocates have never, to my knowledge, admitted to this error.

But that error demolishes their claims.

Conclusion

This paper does not give examples where the parameters of size of genome, number of sites and mutation rate match those in nature, but this could be done. However, several factors imply that the appearance of the observed 'complexity' (information) in nature was indeed achieved by evolution.

First, in nature the mutation rate is controlled by the organism so the rate can increase or decrease to obtain advantage. The case that demonstrates this is that there are mutations in bacteriophage T4 DNA polymerase that decrease the mutation rate. The other relevant case is HIV, which gains an advantage by keeping a high mutation rate and so evades the immune system. Of course the fear that there may be a bird flu pandemic (CNN 2005 October 11) is based on our understanding of the rapid evolution of influenza. Without understanding Darwinian evolution, we could be hit by a major disaster. By understanding evolution we have a chance to avert the disaster.

Secondly, as discussed above, there is plenty of time for the basic house keeping genes to have evolved. With horizontal transfer of genetic material, innovations in one species can end up in other species.

Third, microorganisms are abundant. The Evj model evolves faster when there is a larger population because more more genetic material is exposed to mutations.

Still, the details of the time of evolution of various features of our complex heritage need to be worked out. In real science there is always more to do.


Acknowledgments

Thanks to Mark Perakh of The Panda's Thumb for useful suggestions and for encouraging me to write this article, and to Pete Dunkelberg Russell Durbin, Richard Hoppe, Burt C. Humburg, Erik "12345", and Danielle Needle for useful comments.

References


* * *


Location of this article: http://www.talkreason.org/articles/Nitty.cfm