Posted February 18, 1998
Abstract. This paper examines whether the
significance test described in the paper Equidistant Letter Sequences in
the Book of Genesis by Witztum, Rips and Rosenberg was carried
out in
accordance with accepted procedure, and whether the distance
metric used accomplishes its stated purpose. In the light of that
examination, it appears that the conclusion of the paper is unfounded.
Contents
The paper Equidistant Letter Sequences in the Book of Genesis by Witztum, Rips and Rosenberg [7] is probably the Mathematical Statistics paper which has enjoyed the widest dissemination ever, having been reproduced in extenso in Drosnin's book [2] of which hundreds of thousands of copies have been sold. Wild claims have been made about it. For example, Drosnin writes:" The original experiment that proved the existence of the Bible code (emphasis added) was published in a U.S. scholarly journal, Statistical Science, a review journal of the Institute of Mathematical Statistics...In the nearly three years since the Rips-Witztum-Rosenberg paper was published, no one has submitted a rebuttal to the math journal."(p.428). He also writes:"The three referees at the math journal...started out skeptics and ended up believers". In the remainder of this paper we shall refer to it as WRR. Equidistant letter sequences will be referred to as ELS's.
The present paper examines whether the conclusion of the WRR paper is justified.
Tests of significance are also known as tests of hypotheses because they are based on setting up a "null hypothesis" and testing its significance [1, p. 64].
The following steps must be followed when setting up a test of hypothesis. Full details and references are given in Appendix A.
We now examine to what extent WRR have carried out the accepted procedure.
In previous work published on the ELS's (e.g. Michelson [6]) the statistical calculations were based on the following hypothesis H0: each letter in the text is an independent discrete random variable taking one of 22 values, each corresponding to a letter of the Hebrew alphabet. The distribution of each discrete random variable is multinomial, with the probability of each letter being equal to the relative frequency of the letter in the whole text. Since the book of Genesis is 78,064 letters long, the sample space contains 2278,064 texts, almost all of them gibberish.
In the final version of WRR, there are still some vestiges of the earlier null hypothesis, e.g. in the calculation of the expected number of ELS's for a word and the "expected distance" between two words (p.435). However, WRR declare their null hypothesis H'0 to be as follows: They define four overall measures of proximity P1, P2, P3 and P4 between the 32 personality- date pairs considered. For each of the 32!(2.63 x 1035) permutations of the personalities they define the corresponding statistic P1. Thus the sample space is defined as the set of 32! permutations of the personalities. The P1 are put in ascending order. The null hypothesis is that P1 is just as likely to occupy any one of the 32! places in this order as any other, and similarly for P2, P3 and P4 (p.431). They provide the following motivation for the hypothesis: "If the phenomenon under study were due to chance, it would be just as likely that (the correct order of the names) occupies any one of the 32! places... as any other."
It must be emphasized at this point that this equiprobability assumption has no frequency interpretation because we are faced here with a unique object, namely the text of Genesis. The word "chance" used by the authors is meaningless in this context. Because of this, the conclusion that "the proximity of ELS's with related meanings in the book of Genesis is not due to chance" is equally meaningless.
It must also be pointed out that equiprobability in a permutation test usually follows by a symmetry argument from the assumption that the underlying variables are independent and identically distributed [1, p. 182] as the original null hypothesis assumed. But unfortunately here the symmetry argument fails for the following reasons:
Thus the proposed null hypothesis H'0 does not have any a priori justification even on the basis of the original hypothesis H0.
The problem of lack of frequency interpretation has been forcefully highlighted by Matheron [5, p. 23] in the following words: "When we deal with a unique phenomenon and a probabilistic model, that is a space (,, P) which is put in correspondence with this unique reality,... (an anthropomorphic) illusion incites us to say that everything happens, after all, as if the realized event had been 'drawn at random' according to law P in the sample space . But this is a misleadingly clear statement, and the underlying considerations supporting it are particularly inadequate. What is the mechanism of this 'random choice' that we evoke, which celestial croupier, which cosmic dice player is shaking the iron dice box of necessity? This 'random draw' myth, for it is one, (in the pejorative sense), is both useless and gratuitous. It is gratuitous, for even if we assume that a unique random draw had been performed, once and for all, and an element w0 had been selected, we would in any case have no hope of ever reconstructing either the space or the probability P. For since we are dealing with a unique event, the only source of information we possess is the unique element w0, which was chosen at first: all the rest, the whole space and the (almost) infinite ocean of possibilities it contained, will have disappeared, erased forever by this unique draw. It is useless, that is, it has no explanatory value, basically for the same reason: for the properties that we can observe in our universe are contained in this unique element w0, and no longer depend on anything else. Thus we call confidently ignore all the riches which slumbered in the other elements w, those which have not been chosen. In any case, this element w0, ours, had doubtlessly (an almost) zero probability of being drawn, and thus our universe is 'almost impossible': nevertheless, it is the only universe given to us, and the only one we can study."
Thus, the null hypothesis advanced by WRR must be considered as "entirely hypothetical" ([1, p. 5])
In contrast with the situation with the null hypothesis, WRR carefully avoid to make in the paper (or for that matter, apparently anywhere else) any statement at all about the alternative hypothesis or hypotheses. This in itself is a major flaw of the whole investigation, as is abundantly clear from Appendix A, because in the absence of an alternative hypothesis it is impossible to calculate the probability of a Type II error and there can therefore be no grounds for rejecting the null hypothesis, no matter how unlikely it might be in the light of the observations.
The WRR paper has however been widely publicized and those who have made use of it have been far less reticent than the WRR authors. For example, Drosnin [2] concludes: "We do have the scientific proof that some intelligence outside our own does exist, or at least did exist at the time the Bible was written." (p.50). In "public statements" attributed to two of the WRR authors (E. Rips and D. Witztum) and circulated on the lnternet, commenting on Drosnin's book, the above conclusion was not refuted and the expression "Bible codes" was freely used.
It is therefore appropriate to consider as a possible alternative that the book of Genesis was written by an intelligent being who could predict the future and encoded this information in the text. It must of course be noted that this hypothesis does not entail either that this being is benevolent or that it is still in existence.
Some motivation for the research is given in the Introduction of WRR. The approach is illustrated by the example of determining whether a text written in a foreign language is meaningful or not. Pairs of conceptually related words are considered and WRR write: "A strong tendency of such pairs to appear in close proximity indicates that the text might be meaningful." They then declare that the purpose of the research is "to test whether the ELS's in a given text may contain 'hidden information'."
It is difficult to avoid the impression that the above reasoning is just a naive anthropomorphism. On what grounds can we make any reasonable assumptions abou the thought processes of a being that can predict the future? Why should a text produced by such a being have properties similar to those of an ordinary text in a foreign language written by a human?
But let us suppose that we agree to test the alternative hypothesis that the encoder has actually put the appropriate dates nearest to each of the names according to some distance measure P*. The alternative hypothesis would then attribute a probability of one to the ordering where the correct match had the smallest distance measure and zero to all others. The optimal critical region would be just that one ordering, the size of the test would he 1/32! and its power wolrlrl be unity. Unfortunately, the data are not consistent with this hypothesis, since none of the proposed measures falls in the proposed critical region.
WRR estimale the rankings of their proposed measures Pi , i = 1,...,4 by a Monte-Carlo method. using one million permutations chosen at random. This is perfectly acceptable, but what they do not explicitly state is that even if we take the "best" of their measures of distance, namely P4, there are still an estimated 32! x 3/106 7.89 x 1029 permutations
Book |
P1 |
P2 |
Genesis |
718 |
2 |
Table 1: Rank order of P1 and P2 among one million Pi.
where at least two of the correct dates are not the nearest ones to the appropriate names, but whose ranks are smaller than the rank of the correct match! Some motivation for the encoder to choose such a bizarre encoding must be provided by the authors, for otherwise we must conclude that either their measures of distance are wildly off the mark, or else the alternative hypothesis that there is encoded information about the 92 personalities in Genesis obeying their "close proximity" criterion is false. The distance measure used by WRR will be examined more closely in the next Section.
Another problem is the fact that the experiment has been conducted solely on the book of Genesis, while in all the public statements attributed to Rips and Witztum reference is made to "codes in the Torah". But the word "Torah" refers to all five books of the Pentateuch. In fact it is not known when the Torah was divided into five books and by whom. There are also various opinions as to where each book begins and ends. For example the Gaon of Vilna held that Deuteronomy actually starts from the fifth verse of our present version ([6, p. 10]). The authors must explain why they chose to conduct the experiment solely on Genesis. Moreover the argument provided must be an a priori one. This is in view of the unfortunate fact that when conducted on the other four books of the Pentateuch, their experiment (carried out on the second list of personalities) failed to show any "significant" effect, at least as far as the P1 and P2 statistics are concerned, as shown in Table I ([4]). The results given for Genesis in the table are not exactly the same as those appearing in the WRR paper because the experiment was carried out with an updated version of the software, provided by WRR.
In WRR, a "corrected distance" between ELS's is defined on p.435. No mathematical justification for the procedure is given. The "corrected distance" between two words w and w' is supposed to be small when w is "unusually close" to w', and 1 or almost 1 when w is "unusually far" from w'. There are some problems in the exposition of the procedure:
But the really serious problem is that
According to WRR, the "uncorrected proximity", , "very roughly measures the maximum closeness of the more noteworthy appearances of w and w' as ELS's in Genesis - the closer they are, the larger is " (p.435). It can be said in its favour that when applied to pairs of words which occur only once and are all of the same skip and length it reduces to a monotonically decreasing function of the minimal distance along the text between the letters of the word pairs. This is quite reasonable.
On the other hand, the "corrected distance" has nothing to do with the actual position of the words whose distance is supposed to be measured. It is based on the rank of the "uncorrected distance " between the two original words when compared to the distances of pairs of "perturbed ELS's". But the sets of "perturbed ELS's" are different for different pairs of words and so do not provide any unified standard of comparison of distance.
For our counterexample we focus on just two pairs of four-letter words, each having just one ELS, so as to keep the calculations simple and transparent. They are denoted by w1, w'1, w2 and w'2.. The minimum distance along the text between the letters for the first pair is 90,000 letters and for the second pair 30,000. The "corrected distance" for the first pair turns out to be zero and for the second pair one.
Thus, according to the "corrected distance" w and w'1 are "very close':, while w2 and w'2. are "very distant". This is totally at variance with any commonsense definition of distance between ELS's and contradicts all the examples given in the Introduction of WRR to motivate the work. In addition, it is clear from the details of the counterexample given in Appendix B that by translating the set of perturbed ELS's without changing the position of the original ELS's we can vary the "corrected distance" from 0 to 1. The only limitations are due to boundary effects: if the two words are very close there may not be enough space to insert perturbed ELS's. Conversely, if the two words are near the beginning and the end of the text there may not be enough space to translate the perturbed ELS's away from the two words.
That the phenomenon described in the counterexample is not artificial can be illustrated by the data of the experiments themselves as follows: all (matching) pairs of words from the two personality lists that did yield an and a c were collected (Source: McKay [4]). There were 320 pairs. As an index of "corrected proximity" we used 1 - c . The range of was approximately (77 - 60,365). Since we are interested in the ranking of proximities, a natural measure of the concordance of the two indices is the rank correlation coefficient ([3. p. 494]). The overall value turns out to be 0.603. However, an examination of the scattergram of the ranks indicates that the correlation is mainly due to small and large values of . Indeed, if we select 100 values of from the middle of the range (specifically ranks 101 to 200, corresponding to 1450< <3623) the rank correlation between them and the corresponding 1 - c's falls to 0.088. If the pairs were a random sample the hypothesis of a zero rank correlation would be accepted at the 20% significance level. The analysis thus supports the conjecture that any association between and 1 - c is not inherent but entirely due to boundary effects.
The following flaws in the WRR paper have been documented:
Until these flaws are remedied, the claims made in the paper must be considered as statistically unfounded.
The following quotation from Kendall and Stuart's Advanced Theory of
Statistics
The next step in the testing procedure is the setting up of the critical region. We quote again from Kendall and Stuart [3, p. 171]:"To test any hypothesis on the basis of a (random) sample of observations, we must divide the sample space (i.e. all possible sets of observations) into two regions. If the observed sample point x falls into one of these regions, say w, we shall reject the hypothesis; if x falls into the complementary region, W - w, we shall accept the hypothesis. w is known as the critical region of the test, and W - w is called the acceptance region.
It is necessary to make it clear at the outset that the rather peremptory terms 'reject' and 'accept' which we have used of a hypothesis under test are now conventional usage, to which we shall adhere, and are not intended to imply that any hypothesis is ever finally accepted or rejected in science. If the reader cannot overcome his philosophical dislike of these admittedly inapposite expressions, he will perhaps agree to regard them as code words, 'reject' standing for 'decide that the observations are unfavorable to' and 'accept' for the opposite. We are concerned to investigate procedures which make such decisions with calculable probabilities of error, in a sense to be explained.
Now if we know the probability distribution of the observations under the hypothesis being tested, which we call call H0, we can determine w so that, given H0, the probability of rejecting H0 (i.e. the probability that x falls in w) is equal to a preassigned , i.e.
...The value is called the size of the test.."
We continue the quotation: "Evidently, we can in general find many, and often even an infinity, of sub-regions w of the sample space, all obeying (1) Which of them should we prefer to the other? This is the problem of the theory of testing hypotheses. To put it in everyday terms, which sets of observations are we to regard as favoring, and which as disfavoring, a given hypothesis?
Once the question is put in this way, we are directed to the heart of the problem. For it is of no use whatever to know merely what properties a critical region will have when H0 holds. What happens when some other hypothesis holds? In other words, we cannot say whether a given body of observations favors a given hypothesis unless we know to what alternative(s) this hypothesis is compared. It is perfectly possible for a sample of observations to be a rather 'unlikely' one if the original hypothesis were true; but it may be much more 'unlikely' on another hypothesis. If the situation is such that we are forced to choose one hypothesis or the other, we shall obviously choose the first, notwithstanding the 'unlikeliness' of the observations. The problem of testing a hypothesis is essentially one of choice between it and some other or others. It follows immediately that whether or not we accept the original hypothesis depends crucially upon the alternatives against which it is being tested."
We continue the quotation: "The (above) discussion... leads us to the recognition that a critical region (or, synonymously, a test) must be judged by its properties both when the hypothesis tested is true and when it is false. Thus we may say that the errors made in testing a statistical hypothesis are of two types:
These are known as Type I and Type II errors respectively. The probability of a Type I error is equal to the size of the critical region used, . The probability of a Type II error is, of course, a function of the alternative hypothesis (say, H1) considered, and is usually denoted by . Thus
This complementary probability, 1 - , is called the power of the test of the hypothesis H0, against the alternative hypothesis H1. The specification of H1 in the last sentence is essential, since power is a function of H1.
We seek a critical region w such that its power, defined at (3), is as large as possible. Then, in addition to having controlled the probability of Type I errors at , we shall have minimized the probability of a Type II error, . This is the fundamental idea, first expressed explicitly by J. Neyman and E.S. Pearson, which underlies the theory.
A critical region, whose power is no smaller than that of any other region of the same size for testing a hypothesis H0 against the alternative H1, is called a best critical region (abbreviated BCR) and a test based on a BCR is called a most powerful... test."
We construct a string of length 110,000 as follows. We use the first 22 letters of the latin alphabet (written in capitals) to represent the letters of the Hebrew alphabet. We will focus on four "words" w1: ABCD, w'1: EFGH, w 2: IJKL, and w'2: MNOP. We use the same notation as WRR for an ELS, namely (n, d, k) for the start, the skip and the length.
We set:
w1 at (11992, 3, 4),w'1 at (102001, 3, 4)
w2 at (41992, 3,4), w'2 at
(72001, 3, 4).
These will be the only ELS's for these two words. We denote
the distance along the text between the last letter of w1 and
the first letter of w'1 by U1. Here
U1= 90000. Similarly we denote the distance along the text
between the last letter of w2 and the first letter of
w'2 by U2. Here U2=
30000.
Pert. No. |
x |
y |
z |
1 |
-1 |
0 |
1 |
Table 2: Perturbation triples.
Pert. No. |
w1 |
w'1 |
U1 |
w2 |
w'2 |
U2 |
1 |
11962 |
102031 102061 102091 102121 102151 102181 102211 102241 102271 |
90060 90120 90180 90240 90300 90360 90421 90480 90540 |
42022 42052 42082 42112 42142 42172 42202 42232 42262 |
71971 71941 71911 71881 71851 71821 71791 71761 71731 |
29940 29880 29820 29760 29700 29640 29580 29520 29460 |
Table 3: Starting point of perturbed ELS's.
We now introduce 9 "perturbed" ELS's for each of the four words, using the triples (x, y, z) given in Table 2:
As we are using a skip of 3, some of the perturbation triples used by WRR will not work, namely those for which x+ y = -3 and/or x+ y+ z = -3. We avoid them in our list of triples.
We note that, for all the triples we use, x+y+z = 0. Thus, the perturbation only affects the position of the second and the third letter of the words, but not the position of the first and last.
Table 3 gives the starting point of the perturbed ELS's for the four words and the distances along the text..
The rest of the string is calculated arbitrarily with the 6 remaining letters.
We now calculate the minimal distance L1 between a letter of wl and a letter of w'1. We first note that since the skip is 3, the only non-zero hi's, the nearest integers to |d|/ i
i |
L1(hi) |
L2(hi) |
f(=f') |
1 2 3 4 |
30000 45000 90000 90000 |
10000 15000 30000 30000 |
1 |
Table 4: Distances between w1 and w'1 for various hi's (1/2 rounded up), are h1 = 3, h2 = 2, h3 = 1, h4 = 1. The other h 's contribute zero to (w1,w'1). Here h'i=hi..
We note that the last letter of w 1, (D), has the rank 120011, so that for all non-zero h 's it is the first letter of the row. Similarly, the first letter of w '1, (E), has rank 102001, so that it will also appear as the first letter of the row for all non-zero h 's. The same considerations apply to w2 and w'2, since H will have rank 42001 and I will have rank 72001. The minimal cylindrical distances between the letters of w1 and w'1, denoted by L1(hi) and those between the letters of w2 and w'2, denoted by L2(hi) are given by Table 4.
It can be seen that when the skip is 3, if the distance U along the text between the last letter of the first word w and the first letter of the second word w' is a multiple of 6, (as is the case in our construction) there are four finite cylindrical distances , namely U/3, U/2, U and U.
Using the WRR definitions, we find for the proximity measures :(w, w') = 1.5556 x 10-4 and sigma(w2,w'2)= 4.6667 x 10-4. This is as it should be, since w 2,w'2 are much closer than w and w1'; by any reasonable criterion. Moreover, since in our case there will be only one ELS (and one perturbed ELS for each perturbation triple) for every word considered: we will have
(x,y,z)(w, w'')=(x,y,z)(w, w). |
(4) |
for every perturbation triple.
We now calculate the proximity measures of the perturbed ELS's. They are given in 5. From Table 5 we conclude:
c(w1,w '1)=0 |
(5) |
c(w2,w '2)=1 |
(6) |
As pointed out in Section 5 this is totally counterintuitive.
Pert. No. |
104 x (x,y,z) (w1,w'1) | 104 x (x,y,z) (w2,w'2) |
1 |
1.5545 |
4.6760 |
Table 5: Proximity measures between perturbed ELS's.
[1] Cox, D.R. and D. V. Hinkley. (1974) Theoretical Statistics. Chapman and Hall.
[2] Drosnin, M. (1997) The Bible Code. Weidenfeld and Nicolson.
[3] Kendall, M.G. and A. Stuart. (1978) The Advanced Theory of Statistics. Vol II. Charles Griffin & Co. Third Edition.
[4] McKay, B. (1997) Private communication.
[5] Matheron, G. (1989) Estimating and choosing. Springer-Verlag
[6] Michelson, D. (1987) Codes in the Torah. B'Or Ha'Torah, No.6E, pp.7-39.
[7] Witztum, D., E. Rips, and Y. Rosenberg. (1994) "Equidistant Letter Sequences in the Book of Genesis." Statistical Science, 9, 3, 429-438.