Corrections to Nelson (2023):

DP(norm) and KLD(norm) are not wrong on pi at all

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

14 Feb 2024 12-19-01

1 Introduction

In February 2023, the Journal of Quantitative Linguistics published Nelson (2023), a “review article” discussing dispersion measures with an emphasis on my recent work involving in particular the measures of

  • Deviation of Proportions DP/DPnorm (raw and normalized), as discussed, among other places, in Gries (2008, 2020) and Lijffijt & Gries (2012);
  • the Kullback-Leibler divergence DKLnorm, as discussed in Gries (2020) and in more detail in my forthcoming book Gries (to appear).

While there is a lot to be discussed there, this small paper is intended as a correction of several aspects of Nelson’s paper, focusing for the most part on two false computations which produce results that make DPnorm and DKLnorm seem unable to correctly quantify the distribution/dispersion of the first one million digits of pi. Here is the relevant quote from Nelson (2023:158):


consider a corpus made from the first 1,000,000 digits of pi. Intuition suggests that if we treat this as a corpus of the digits 0 through 9, every digit should get a score near 0 as there are (famously) no predictable biases in the distribution of the digits of pi. In fact, when this corpus is partitioned into 1,000 bins of 1,000 digits, every digit receives a DPnorm score of 0.495 – nearly the midpoint of the range of the score – while receiving a DKLnorm score of 0.9. Figure 2 shows why. In the left frame of the figure, the heights of the bars show the number of times ‘7’ occurs in a part while the numbers along the bottom show the position (in pi) of the part. The heights of the bars resemble a random walk between 80 and 130, an impression reinforced by the symmetry of their distribution in the right frame. This random noise alone is enough to hold these scores far from any value that might intuitively (and probably empirically) be interpreted as unbiased and even.


As Nelson states correctly, both DP/DPnorm and DKLnorm are scaled such that they fall into the interval [0, 1] and are oriented such that,

  • if words in a corpus are distributed evenly/regularly, their values should be at the bottom end of that scale;
  • if words in a corpus are distributed unevenly/clumpily, their values should be at the top end of that scale.

Therefore, it indeed follows that, when presented with 1 million ‘words’/digits of pi that are distributed with “no predictable biases” over 1000 ‘corpus files’/bins, both measures should return values that are close to their theoretical minimums, which also means that the results Nelson reports would, if true, lead anyone to probably never consider DPnorm and DKLnorm as measures of dispersion again. The main purpose of this correction is to demonstrate what the real results are and how that makes at least this part of Nelson’s argumentation collapse. To make my points as transparent and replicably as possible, this paper already contains the R code with which everyone can replicate the results for themselves – all one needs is an installation of R, the package magrittr, and internet access; readers can access a rendered Quarto document with all content, code, and my session info at >https://www.stgries.info/research/2024_STG_OnRNN_JQL.html<.

rm(list=ls(all=TRUE)); library(magrittr)

2 The overall distribution of the digits of pi

To demonstrate the falsity of both of Nelson’s results (and some additional small problems), let’s first download the first one million digits of pi, here from a website from the University of Exeter:

str(pi.1m <-
   "http://newton.ex.ac.uk/research/qsystems/collabs/pi/pi6.txt" %>% 
   scan(what="", quiet=TRUE))
 chr [1:100101] "3." "\f" "1415926535" "8979323846" "2643383279" ...

Given the way, pi is provided there, we need to prepare this a bit to get a vector of 1 million digits:

pi.1m <- pi.1m        %>% # make pi.1m the result of
   grep(                  # retaining only the elements
      "^\\d+$", .,        # that contain only digits
      perl=TRUE,          # using PCRE
      value=TRUE)     %>% # returning the strings w/ digits, then
   paste(collapse="") %>% # pasting it all into 1 string
   strsplit(split="") %>% # splitting up at every digit
   unlist                 # making that a vector
head(pi.1m, 25) # check the result
 [1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4"
[20] "6" "2" "6" "4" "3" "3"

Just to make absolutely sure that these are the right numbers, let’s check these digits against those from another website:

str(pi.1m.check <- scan("https://assets.angio.net/pi1000000.txt", what="", quiet=TRUE))
 chr "3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865"| __truncated__

There, too, we need a little bit of prep to get all the digits into a single vector:

pi.1m.check <-            # make pi.1m.check the result of
   sub(                   # replacing
      "^..", "",          # the 1st two characters by nothing
      pi.1m.check,        # in pi.1m.check
      perl=TRUE)      %>% # using PCRE, then
   strsplit(split="") %>% # splitting up at every digit
   unlist                 # making that a vector
head(pi.1m.check, 25) # check the result
 [1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4"
[20] "6" "2" "6" "4" "3" "3"

Crucially, both vectors contain the same digits so we can proceed with just pi.1m for the rest of the paper; for one sanity check later, we also compute the frequencies of the digits:

all(pi.1m==pi.1m.check) # both vectors are identical
[1] TRUE
rm(pi.1m.check); gc() # memory management
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  659182 35.3    1410772 75.4  1410772 75.4
Vcells 2211269 16.9    8388608 64.0  6846256 52.3
table(pi.1m) # frequencies of digits
pi.1m
     0      1      2      3      4      5      6      7      8      9 
 99959  99758 100026 100229 100230 100359  99548  99800  99985 100106 

Nelson then “partitioned [this corpus] into 1,000 bins of 1,000 digits” but does unfortunately not say how exactly this was done so, for our replication, we will have to guess and explore the two most straightforward options. This is the first, which repeats the sequence of 1000 bin names 1000 times, …

bins.times <- formatC( # make bins.times a character vector:
   # repeat the sequence 1:1000 a thousand times
   rep(1:1000, times=1000),
   width=4,            # make the elements 4 chars wide
   flag="0")           # with leading 0s
bins.times[c(1:3, 999998:1000000)]
[1] "0001" "0002" "0003" "0998" "0999" "1000"

… and this is the second, which repeats the name of each of the 1000 bins 1000 times:

bins.each <- formatC( # make bins.each a character vector:
   # repeat each of the numbers 1:1000 a thousand times
   rep(1:1000, each=1000),
   width=4,           # make the elements 4 chars wide
   flag="0")          # with leading 0s
bins.each[c(1:3, 999998:1000000)]
[1] "0001" "0001" "0001" "1000" "1000" "1000"

Many dispersion measures are straightforwardly computable from what, with regular corpora, would be term-document matrices and what, here, are digit-bin matrices. For each way of defining the bins, we compute one version with observed absolute frequencies and one with row proportions (because these latter ones will represent how much of each word as a proportion of all its uses is observed in each corpus bin):

# compute the term-document / digit-bin matrices
PI.tdm.times.abs <- # make PI.tdm.times.abs the result of
   table(                    # cross-tabulating
      PIsDIGITS =pi.1m,      # pi.1m with
      CORPUSBINS=bins.times) # the bins
PI.tdm.times.abs[,1:4] # check result
         CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
        0  104   83  107   96
        1  102  100  106  100
        2   93   88  102   97
        3  102  113   84   87
        4  106  110   97   92
        5  103  106   97  113
        6  100   95  110  103
        7  115  103  109  101
        8   93  104   96  102
        9   82   98   92  109
PI.tdm.times.rel <- # make PI.tdm.times.rel the result of
   PI.tdm.times.abs %>% # taking PI.tdm.times.abs and
   prop.table(margin=1) # convert to row percentages
PI.tdm.times.rel[,1:4] # check result
         CORPUSBINS
PIsDIGITS         0001         0002         0003         0004
        0 0.0010404266 0.0008303404 0.0010704389 0.0009603938
        1 0.0010224744 0.0010024259 0.0010625714 0.0010024259
        2 0.0009297583 0.0008797713 0.0010197349 0.0009697479
        3 0.0010176695 0.0011274182 0.0008380808 0.0008680123
        4 0.0010575676 0.0010974758 0.0009677741 0.0009178889
        5 0.0010263155 0.0010562082 0.0009665302 0.0011259578
        6 0.0010045405 0.0009543135 0.0011049946 0.0010346767
        7 0.0011523046 0.0010320641 0.0010921844 0.0010120240
        8 0.0009301395 0.0010401560 0.0009601440 0.0010201530
        9 0.0008191317 0.0009789623 0.0009190258 0.0010888458
# compute the term-document / digit-bin matrices
PI.tdm.each.abs <- # make PI.tdm.each.abs the result of
   table(                   # cross-tabulating
      PIsDIGITS =pi.1m,     # pi.1m with
      CORPUSBINS=bins.each) # the bins
PI.tdm.each.abs[,1:4] # check result
         CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
        0   93   89   77  103
        1  116   96   97  120
        2  103  104   96  105
        3  102   86   77  103
        4   93  102  123   87
        5   97  108  110  102
        6   94  106  102   96
        7   95  102   90   90
        8  101  101  108   95
        9  106  106  120   99
PI.tdm.each.rel <- # make PI.tdm.each.rel the result of
   PI.tdm.each.abs %>%  # taking PI.tdm.each.abs and
   prop.table(margin=1) # convert to row percentages
PI.tdm.each.rel[,1:4] # check result
         CORPUSBINS
PIsDIGITS         0001         0002         0003         0004
        0 0.0009303815 0.0008903650 0.0007703158 0.0010304225
        1 0.0011628140 0.0009623288 0.0009723531 0.0012029110
        2 0.0010297323 0.0010397297 0.0009597505 0.0010497271
        3 0.0010176695 0.0008580351 0.0007682407 0.0010276467
        4 0.0009278659 0.0010176594 0.0012271775 0.0008680036
        5 0.0009665302 0.0010761367 0.0010960651 0.0010163513
        6 0.0009442681 0.0010648130 0.0010246313 0.0009643589
        7 0.0009519038 0.0010220441 0.0009018036 0.0009018036
        8 0.0010101515 0.0010101515 0.0010801620 0.0009501425
        9 0.0010588776 0.0010588776 0.0011987293 0.0009889517

For example, the 93 / 0.0009303815 in PI.tdm.each.abs‘s top left cell reflects that the 93 instances of ’0’ in bin “0001” corresponds to 0.0009303815 of the 99959 instances of ‘0’ in the 1m digits of pi.

For both dispersion measures, we then also need the expected, or prior, distribution. In this example, where we created the corpus files/bins as above, these are of course just one thousand instance of 0.001:

exp.prior <-      # make exp.prior the result of
   bins.times %>% # taking the bins,
   table      %>% # tabulating them, &
   prop.table     # making them proportions
# same as rep(0.001, 1000)
exp.prior %>% table
.
0.001 
 1000 

Nelson then shows the “noisy but unbiased dispersion of ‘7’ in pi in his Figure 2 (also p. 158). The right panel of his Figure 2 is fine and replicated here as Figure 1:

hist(                          # plot a histogram of the
   PI.tdm.times.rel["7",]*1E5, # freqs of 7 in the bins of pi
   breaks=50,           # approximating N's bin width
   main="",             # suppress a main heading
   xlab="Count of '7'", # replicate his ...
   ylab="N Bins")       # axis labels

Figure 1: Nelson’s Figure 2 (right)

However, the left panel of his Figure 2 also seems wrong already. This is because its x-axis ranges from 1 to 1 million, presumably because the x-axis is the one million digits of pi, but it does so although – check the quote at the beginning of this paper – Nelson says “the numbers along the bottom show the position (in pi) of the part” (my emphasis), meaning the x-axis should actually go from 1 to 1000. His x-axis labeling error seems confirmed by the fact that Nelson’s y-axis is labeled “Count of ‘7’” and ranges from 0 to 140, suggesting that those are the frequencies of ‘7’ in every x-axis bin (as he described in the quote), and the vertical lines then shown in the plot seem to be the frequencies of ‘7’ summarized in the histogram. But then that is why the current plot is wrong because one cannot have it both ways: only one of these two following plots would be correct:

  • plot 1, where the x-axis goes from 1 to 1m (for each word): This plot would be the kind of dispersion plot many apps provide, where each slot on the x-axis is for one word and each vertical line indicates whether the word in that corpus slot is a word in question (like, here, ‘7’). However, in such a plot, the y-axis would go from 0 (for FALSE, the digit is not ‘7’) to 1 (for TRUE, the digit is ‘7’), not from 0 to 140. Or,
  • plot 2, where the x-axis goes from 1 to 1000 (for each bin) and which, from his description, seems to be the intended one: There, this plot would indicate the frequency of ‘7’ in each bin and then, yes, the y-axis needs to go from 0 to 140. In Figure 1, I am showing the plot he meant to show (but with a main heading I already changed to the averages of the results we will obtain below).
plot(type="h",
   PI.tdm.times.abs["7",],
   main=expression(paste("'7' in ", pi, ": ", italic(DP)[norm], "≈0.0367, ", italic(D)[KLnorm], "≈0.0042")),
   xlab="[This time with the correct x-axis]",
   ylab="Clount of '7'", ylim=c(0, 140))