# Corrections to Nelson (2023):

DP(norm) and KLD(norm) are not wrong on pi at all

# 1 Introduction

In February 2023, the *Journal of Quantitative Linguistics* published Nelson (2023), a “review article” discussing dispersion measures with an emphasis on my recent work involving in particular the measures of

- Deviation of Proportions
*DP*/*DP*_{norm}(raw and normalized), as discussed, among other places, in Gries (2008, 2020) and Lijffijt & Gries (2012); - the Kullback-Leibler divergence
*D*_{KLnorm}, as discussed in Gries (2020) and in more detail in my forthcoming book Gries (to appear).

While there is a lot to be discussed there, this small paper is intended as a correction of several aspects of Nelson’s paper, focusing for the most part on two false computations which produce results that make *DP*_{norm} and *D*_{KLnorm} seem unable to correctly quantify the distribution/dispersion of the first one million digits of pi. Here is the relevant quote from Nelson (2023:158):

consider a corpus made from the first 1,000,000 digits of pi. Intuition suggests that if we treat this as a corpus of the digits 0 through 9, every digit should get a score near 0 as there are (famously) no predictable biases in the distribution of the digits of pi. In fact, when this corpus is partitioned into 1,000 bins of 1,000 digits, every digit receives a DP_{norm} score of 0.495 – nearly the midpoint of the range of the score – while receiving a D_{KLnorm} score of 0.9. Figure 2 shows why. In the left frame of the figure, the heights of the bars show the number of times ‘7’ occurs in a part while the numbers along the bottom show the position (in pi) of the part. The heights of the bars resemble a random walk between 80 and 130, an impression reinforced by the symmetry of their distribution in the right frame. This random noise alone is enough to hold these scores far from any value that might intuitively (and probably empirically) be interpreted as unbiased and even.

As Nelson states correctly, both *DP*/*DP*_{norm} and *D*_{KLnorm} are scaled such that they fall into the interval [0, 1] and are oriented such that,

- if words in a corpus are distributed evenly/regularly, their values should be at the bottom end of that scale;
- if words in a corpus are distributed unevenly/clumpily, their values should be at the top end of that scale.

Therefore, it indeed follows that, when presented with 1 million ‘words’/digits of pi that are distributed with “no predictable biases” over 1000 ‘corpus files’/bins, both measures should return values that are close to their theoretical minimums, which also means that the results Nelson reports would, if true, lead anyone to probably never consider *DP*_{norm} and *D*_{KLnorm} as measures of dispersion again. The main purpose of this correction is to demonstrate what the real results are and how that makes at least this part of Nelson’s argumentation collapse. To make my points as transparent and replicably as possible, this paper already contains the R code with which everyone can replicate the results for themselves – all one needs is an installation of R, the package `magrittr`

, and internet access; readers can access a rendered Quarto document with all content, code, and my session info at >https://www.stgries.info/research/2024_STG_OnRNN_JQL.html<.

# 2 The overall distribution of the digits of pi

To demonstrate the falsity of both of Nelson’s results (and some additional small problems), let’s first download the first one million digits of pi, here from a website from the University of Exeter:

```
str(pi.1m <-
"http://newton.ex.ac.uk/research/qsystems/collabs/pi/pi6.txt" %>%
scan(what="", quiet=TRUE))
```

` chr [1:100101] "3." "\f" "1415926535" "8979323846" "2643383279" ...`

Given the way, pi is provided there, we need to prepare this a bit to get a vector of 1 million digits:

```
pi.1m <- pi.1m %>% # make pi.1m the result of
grep( # retaining only the elements
"^\\d+$", ., # that contain only digits
perl=TRUE, # using PCRE
value=TRUE) %>% # returning the strings w/ digits, then
paste(collapse="") %>% # pasting it all into 1 string
strsplit(split="") %>% # splitting up at every digit
unlist # making that a vector
head(pi.1m, 25) # check the result
```

```
[1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4"
[20] "6" "2" "6" "4" "3" "3"
```

Just to make *absolutely* sure that these are the right numbers, let’s check these digits against those from another website:

` chr "3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865"| __truncated__`

There, too, we need a little bit of prep to get all the digits into a single vector:

```
pi.1m.check <- # make pi.1m.check the result of
sub( # replacing
"^..", "", # the 1st two characters by nothing
pi.1m.check, # in pi.1m.check
perl=TRUE) %>% # using PCRE, then
strsplit(split="") %>% # splitting up at every digit
unlist # making that a vector
head(pi.1m.check, 25) # check the result
```

```
[1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4"
[20] "6" "2" "6" "4" "3" "3"
```

Crucially, both vectors contain the same digits so we can proceed with just `pi.1m`

for the rest of the paper; for one sanity check later, we also compute the frequencies of the digits:

`[1] TRUE`

```
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 659182 35.3 1410772 75.4 1410772 75.4
Vcells 2211269 16.9 8388608 64.0 6846256 52.3
```

```
pi.1m
0 1 2 3 4 5 6 7 8 9
99959 99758 100026 100229 100230 100359 99548 99800 99985 100106
```

Nelson then “partitioned [this corpus] into 1,000 bins of 1,000 digits” but does unfortunately not say how exactly this was done so, for our replication, we will have to guess and explore the two most straightforward options. This is the first, which repeats the sequence of 1000 bin names 1000 times, …

```
bins.times <- formatC( # make bins.times a character vector:
# repeat the sequence 1:1000 a thousand times
rep(1:1000, times=1000),
width=4, # make the elements 4 chars wide
flag="0") # with leading 0s
bins.times[c(1:3, 999998:1000000)]
```

`[1] "0001" "0002" "0003" "0998" "0999" "1000"`

… and this is the second, which repeats the name of each of the 1000 bins 1000 times:

```
bins.each <- formatC( # make bins.each a character vector:
# repeat each of the numbers 1:1000 a thousand times
rep(1:1000, each=1000),
width=4, # make the elements 4 chars wide
flag="0") # with leading 0s
bins.each[c(1:3, 999998:1000000)]
```

`[1] "0001" "0001" "0001" "1000" "1000" "1000"`

Many dispersion measures are straightforwardly computable from what, with regular corpora, would be term-document matrices and what, here, are digit-bin matrices. For each way of defining the bins, we compute one version with observed absolute frequencies and one with row proportions (because these latter ones will represent how much of each word as a proportion of all its uses is observed in each corpus bin):

```
# compute the term-document / digit-bin matrices
PI.tdm.times.abs <- # make PI.tdm.times.abs the result of
table( # cross-tabulating
PIsDIGITS =pi.1m, # pi.1m with
CORPUSBINS=bins.times) # the bins
PI.tdm.times.abs[,1:4] # check result
```

```
CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
0 104 83 107 96
1 102 100 106 100
2 93 88 102 97
3 102 113 84 87
4 106 110 97 92
5 103 106 97 113
6 100 95 110 103
7 115 103 109 101
8 93 104 96 102
9 82 98 92 109
```

```
PI.tdm.times.rel <- # make PI.tdm.times.rel the result of
PI.tdm.times.abs %>% # taking PI.tdm.times.abs and
prop.table(margin=1) # convert to row percentages
PI.tdm.times.rel[,1:4] # check result
```

```
CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
0 0.0010404266 0.0008303404 0.0010704389 0.0009603938
1 0.0010224744 0.0010024259 0.0010625714 0.0010024259
2 0.0009297583 0.0008797713 0.0010197349 0.0009697479
3 0.0010176695 0.0011274182 0.0008380808 0.0008680123
4 0.0010575676 0.0010974758 0.0009677741 0.0009178889
5 0.0010263155 0.0010562082 0.0009665302 0.0011259578
6 0.0010045405 0.0009543135 0.0011049946 0.0010346767
7 0.0011523046 0.0010320641 0.0010921844 0.0010120240
8 0.0009301395 0.0010401560 0.0009601440 0.0010201530
9 0.0008191317 0.0009789623 0.0009190258 0.0010888458
```

```
# compute the term-document / digit-bin matrices
PI.tdm.each.abs <- # make PI.tdm.each.abs the result of
table( # cross-tabulating
PIsDIGITS =pi.1m, # pi.1m with
CORPUSBINS=bins.each) # the bins
PI.tdm.each.abs[,1:4] # check result
```

```
CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
0 93 89 77 103
1 116 96 97 120
2 103 104 96 105
3 102 86 77 103
4 93 102 123 87
5 97 108 110 102
6 94 106 102 96
7 95 102 90 90
8 101 101 108 95
9 106 106 120 99
```

```
PI.tdm.each.rel <- # make PI.tdm.each.rel the result of
PI.tdm.each.abs %>% # taking PI.tdm.each.abs and
prop.table(margin=1) # convert to row percentages
PI.tdm.each.rel[,1:4] # check result
```

```
CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
0 0.0009303815 0.0008903650 0.0007703158 0.0010304225
1 0.0011628140 0.0009623288 0.0009723531 0.0012029110
2 0.0010297323 0.0010397297 0.0009597505 0.0010497271
3 0.0010176695 0.0008580351 0.0007682407 0.0010276467
4 0.0009278659 0.0010176594 0.0012271775 0.0008680036
5 0.0009665302 0.0010761367 0.0010960651 0.0010163513
6 0.0009442681 0.0010648130 0.0010246313 0.0009643589
7 0.0009519038 0.0010220441 0.0009018036 0.0009018036
8 0.0010101515 0.0010101515 0.0010801620 0.0009501425
9 0.0010588776 0.0010588776 0.0011987293 0.0009889517
```

For example, the 93 / 0.0009303815 in `PI.tdm.each.abs`

‘s top left cell reflects that the 93 instances of ’0’ in bin “0001” corresponds to 0.0009303815 of the 99959 instances of ‘0’ in the 1m digits of pi.

For both dispersion measures, we then also need the expected, or prior, distribution. In this example, where we created the corpus files/bins as above, these are of course just one thousand instance of 0.001:

```
exp.prior <- # make exp.prior the result of
bins.times %>% # taking the bins,
table %>% # tabulating them, &
prop.table # making them proportions
# same as rep(0.001, 1000)
exp.prior %>% table
```

```
.
0.001
1000
```

Nelson then shows the “noisy but unbiased dispersion of ‘7’ in pi in his Figure 2 (also p. 158). The right panel of his Figure 2 is fine and replicated here as Figure 1:

```
hist( # plot a histogram of the
PI.tdm.times.rel["7",]*1E5, # freqs of 7 in the bins of pi
breaks=50, # approximating N's bin width
main="", # suppress a main heading
xlab="Count of '7'", # replicate his ...
ylab="N Bins") # axis labels
```

However, the left panel of his Figure 2 also seems wrong already. This is because its *x*-axis ranges from 1 to 1 million, presumably because the *x*-axis is the one million digits of pi, but it does so although – check the quote at the beginning of this paper – Nelson says “the numbers along the bottom show the position (in pi) *of the part*” (my emphasis), meaning the *x*-axis should actually go from 1 to 1000. His *x*-axis labeling error seems confirmed by the fact that Nelson’s *y*-axis is labeled “Count of ‘7’” and ranges from 0 to 140, suggesting that those are the frequencies of ‘7’ in every *x*-axis bin (as he described in the quote), and the vertical lines then shown in the plot seem to be the frequencies of ‘7’ summarized in the histogram. But then that is why the current plot is wrong because one cannot have it both ways: only one of these two following plots would be correct:

- plot 1, where the
*x*-axis goes from 1 to 1m (for each*word*): This plot would be the kind of dispersion plot many apps provide, where each slot on the*x*-axis is for one word and each vertical line indicates whether the word in that corpus slot is a word in question (like, here, ‘7’). However, in such a plot, the*y*-axis would go from 0 (for`FALSE`

, the digit is not ‘7’) to 1 (for`TRUE`

, the digit is ‘7’), not from 0 to 140. Or, - plot 2, where the
*x*-axis goes from 1 to 1000 (for each*bin*) and which, from his description, seems to be the intended one: There, this plot would indicate the frequency of ‘7’ in each bin and then, yes, the*y*-axis needs to go from 0 to 140. In Figure 1, I am showing the plot he meant to show (but with a main heading I already changed to the averages of the results we will obtain below).