Corrections to Nelson (2023):
DP(norm) and KLD(norm) are not wrong on pi at all
1 Introduction
In February 2023, the Journal of Quantitative Linguistics published Nelson (2023), a “review article” discussing dispersion measures with an emphasis on my recent work involving in particular the measures of
- Deviation of Proportions DP/DPnorm (raw and normalized), as discussed, among other places, in Gries (2008, 2020) and Lijffijt & Gries (2012);
- the Kullback-Leibler divergence DKLnorm, as discussed in Gries (2020) and in more detail in my forthcoming book Gries (to appear).
While there is a lot to be discussed there, this small paper is intended as a correction of several aspects of Nelson’s paper, focusing for the most part on two false computations which produce results that make DPnorm and DKLnorm seem unable to correctly quantify the distribution/dispersion of the first one million digits of pi. Here is the relevant quote from Nelson (2023:158):
consider a corpus made from the first 1,000,000 digits of pi. Intuition suggests that if we treat this as a corpus of the digits 0 through 9, every digit should get a score near 0 as there are (famously) no predictable biases in the distribution of the digits of pi. In fact, when this corpus is partitioned into 1,000 bins of 1,000 digits, every digit receives a DPnorm score of 0.495 – nearly the midpoint of the range of the score – while receiving a DKLnorm score of 0.9. Figure 2 shows why. In the left frame of the figure, the heights of the bars show the number of times ‘7’ occurs in a part while the numbers along the bottom show the position (in pi) of the part. The heights of the bars resemble a random walk between 80 and 130, an impression reinforced by the symmetry of their distribution in the right frame. This random noise alone is enough to hold these scores far from any value that might intuitively (and probably empirically) be interpreted as unbiased and even.
As Nelson states correctly, both DP/DPnorm and DKLnorm are scaled such that they fall into the interval [0, 1] and are oriented such that,
- if words in a corpus are distributed evenly/regularly, their values should be at the bottom end of that scale;
- if words in a corpus are distributed unevenly/clumpily, their values should be at the top end of that scale.
Therefore, it indeed follows that, when presented with 1 million ‘words’/digits of pi that are distributed with “no predictable biases” over 1000 ‘corpus files’/bins, both measures should return values that are close to their theoretical minimums, which also means that the results Nelson reports would, if true, lead anyone to probably never consider DPnorm and DKLnorm as measures of dispersion again. The main purpose of this correction is to demonstrate what the real results are and how that makes at least this part of Nelson’s argumentation collapse. To make my points as transparent and replicably as possible, this paper already contains the R code with which everyone can replicate the results for themselves – all one needs is an installation of R, the package magrittr
, and internet access; readers can access a rendered Quarto document with all content, code, and my session info at >https://www.stgries.info/research/2024_STG_OnRNN_JQL.html<.
2 The overall distribution of the digits of pi
To demonstrate the falsity of both of Nelson’s results (and some additional small problems), let’s first download the first one million digits of pi, here from a website from the University of Exeter:
str(pi.1m <-
"http://newton.ex.ac.uk/research/qsystems/collabs/pi/pi6.txt" %>%
scan(what="", quiet=TRUE))
chr [1:100101] "3." "\f" "1415926535" "8979323846" "2643383279" ...
Given the way, pi is provided there, we need to prepare this a bit to get a vector of 1 million digits:
pi.1m <- pi.1m %>% # make pi.1m the result of
grep( # retaining only the elements
"^\\d+$", ., # that contain only digits
perl=TRUE, # using PCRE
value=TRUE) %>% # returning the strings w/ digits, then
paste(collapse="") %>% # pasting it all into 1 string
strsplit(split="") %>% # splitting up at every digit
unlist # making that a vector
head(pi.1m, 25) # check the result
[1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4"
[20] "6" "2" "6" "4" "3" "3"
Just to make absolutely sure that these are the right numbers, let’s check these digits against those from another website:
chr "3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865"| __truncated__
There, too, we need a little bit of prep to get all the digits into a single vector:
pi.1m.check <- # make pi.1m.check the result of
sub( # replacing
"^..", "", # the 1st two characters by nothing
pi.1m.check, # in pi.1m.check
perl=TRUE) %>% # using PCRE, then
strsplit(split="") %>% # splitting up at every digit
unlist # making that a vector
head(pi.1m.check, 25) # check the result
[1] "1" "4" "1" "5" "9" "2" "6" "5" "3" "5" "8" "9" "7" "9" "3" "2" "3" "8" "4"
[20] "6" "2" "6" "4" "3" "3"
Crucially, both vectors contain the same digits so we can proceed with just pi.1m
for the rest of the paper; for one sanity check later, we also compute the frequencies of the digits:
[1] TRUE
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 659182 35.3 1410772 75.4 1410772 75.4
Vcells 2211269 16.9 8388608 64.0 6846256 52.3
pi.1m
0 1 2 3 4 5 6 7 8 9
99959 99758 100026 100229 100230 100359 99548 99800 99985 100106
Nelson then “partitioned [this corpus] into 1,000 bins of 1,000 digits” but does unfortunately not say how exactly this was done so, for our replication, we will have to guess and explore the two most straightforward options. This is the first, which repeats the sequence of 1000 bin names 1000 times, …
bins.times <- formatC( # make bins.times a character vector:
# repeat the sequence 1:1000 a thousand times
rep(1:1000, times=1000),
width=4, # make the elements 4 chars wide
flag="0") # with leading 0s
bins.times[c(1:3, 999998:1000000)]
[1] "0001" "0002" "0003" "0998" "0999" "1000"
… and this is the second, which repeats the name of each of the 1000 bins 1000 times:
bins.each <- formatC( # make bins.each a character vector:
# repeat each of the numbers 1:1000 a thousand times
rep(1:1000, each=1000),
width=4, # make the elements 4 chars wide
flag="0") # with leading 0s
bins.each[c(1:3, 999998:1000000)]
[1] "0001" "0001" "0001" "1000" "1000" "1000"
Many dispersion measures are straightforwardly computable from what, with regular corpora, would be term-document matrices and what, here, are digit-bin matrices. For each way of defining the bins, we compute one version with observed absolute frequencies and one with row proportions (because these latter ones will represent how much of each word as a proportion of all its uses is observed in each corpus bin):
# compute the term-document / digit-bin matrices
PI.tdm.times.abs <- # make PI.tdm.times.abs the result of
table( # cross-tabulating
PIsDIGITS =pi.1m, # pi.1m with
CORPUSBINS=bins.times) # the bins
PI.tdm.times.abs[,1:4] # check result
CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
0 104 83 107 96
1 102 100 106 100
2 93 88 102 97
3 102 113 84 87
4 106 110 97 92
5 103 106 97 113
6 100 95 110 103
7 115 103 109 101
8 93 104 96 102
9 82 98 92 109
PI.tdm.times.rel <- # make PI.tdm.times.rel the result of
PI.tdm.times.abs %>% # taking PI.tdm.times.abs and
prop.table(margin=1) # convert to row percentages
PI.tdm.times.rel[,1:4] # check result
CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
0 0.0010404266 0.0008303404 0.0010704389 0.0009603938
1 0.0010224744 0.0010024259 0.0010625714 0.0010024259
2 0.0009297583 0.0008797713 0.0010197349 0.0009697479
3 0.0010176695 0.0011274182 0.0008380808 0.0008680123
4 0.0010575676 0.0010974758 0.0009677741 0.0009178889
5 0.0010263155 0.0010562082 0.0009665302 0.0011259578
6 0.0010045405 0.0009543135 0.0011049946 0.0010346767
7 0.0011523046 0.0010320641 0.0010921844 0.0010120240
8 0.0009301395 0.0010401560 0.0009601440 0.0010201530
9 0.0008191317 0.0009789623 0.0009190258 0.0010888458
# compute the term-document / digit-bin matrices
PI.tdm.each.abs <- # make PI.tdm.each.abs the result of
table( # cross-tabulating
PIsDIGITS =pi.1m, # pi.1m with
CORPUSBINS=bins.each) # the bins
PI.tdm.each.abs[,1:4] # check result
CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
0 93 89 77 103
1 116 96 97 120
2 103 104 96 105
3 102 86 77 103
4 93 102 123 87
5 97 108 110 102
6 94 106 102 96
7 95 102 90 90
8 101 101 108 95
9 106 106 120 99
PI.tdm.each.rel <- # make PI.tdm.each.rel the result of
PI.tdm.each.abs %>% # taking PI.tdm.each.abs and
prop.table(margin=1) # convert to row percentages
PI.tdm.each.rel[,1:4] # check result
CORPUSBINS
PIsDIGITS 0001 0002 0003 0004
0 0.0009303815 0.0008903650 0.0007703158 0.0010304225
1 0.0011628140 0.0009623288 0.0009723531 0.0012029110
2 0.0010297323 0.0010397297 0.0009597505 0.0010497271
3 0.0010176695 0.0008580351 0.0007682407 0.0010276467
4 0.0009278659 0.0010176594 0.0012271775 0.0008680036
5 0.0009665302 0.0010761367 0.0010960651 0.0010163513
6 0.0009442681 0.0010648130 0.0010246313 0.0009643589
7 0.0009519038 0.0010220441 0.0009018036 0.0009018036
8 0.0010101515 0.0010101515 0.0010801620 0.0009501425
9 0.0010588776 0.0010588776 0.0011987293 0.0009889517
For example, the 93 / 0.0009303815 in PI.tdm.each.abs
‘s top left cell reflects that the 93 instances of ’0’ in bin “0001” corresponds to 0.0009303815 of the 99959 instances of ‘0’ in the 1m digits of pi.
For both dispersion measures, we then also need the expected, or prior, distribution. In this example, where we created the corpus files/bins as above, these are of course just one thousand instance of 0.001:
exp.prior <- # make exp.prior the result of
bins.times %>% # taking the bins,
table %>% # tabulating them, &
prop.table # making them proportions
# same as rep(0.001, 1000)
exp.prior %>% table
.
0.001
1000
Nelson then shows the “noisy but unbiased dispersion of ‘7’ in pi in his Figure 2 (also p. 158). The right panel of his Figure 2 is fine and replicated here as Figure 1:
hist( # plot a histogram of the
PI.tdm.times.rel["7",]*1E5, # freqs of 7 in the bins of pi
breaks=50, # approximating N's bin width
main="", # suppress a main heading
xlab="Count of '7'", # replicate his ...
ylab="N Bins") # axis labels
However, the left panel of his Figure 2 also seems wrong already. This is because its x-axis ranges from 1 to 1 million, presumably because the x-axis is the one million digits of pi, but it does so although – check the quote at the beginning of this paper – Nelson says “the numbers along the bottom show the position (in pi) of the part” (my emphasis), meaning the x-axis should actually go from 1 to 1000. His x-axis labeling error seems confirmed by the fact that Nelson’s y-axis is labeled “Count of ‘7’” and ranges from 0 to 140, suggesting that those are the frequencies of ‘7’ in every x-axis bin (as he described in the quote), and the vertical lines then shown in the plot seem to be the frequencies of ‘7’ summarized in the histogram. But then that is why the current plot is wrong because one cannot have it both ways: only one of these two following plots would be correct:
- plot 1, where the x-axis goes from 1 to 1m (for each word): This plot would be the kind of dispersion plot many apps provide, where each slot on the x-axis is for one word and each vertical line indicates whether the word in that corpus slot is a word in question (like, here, ‘7’). However, in such a plot, the y-axis would go from 0 (for
FALSE
, the digit is not ‘7’) to 1 (forTRUE
, the digit is ‘7’), not from 0 to 140. Or, - plot 2, where the x-axis goes from 1 to 1000 (for each bin) and which, from his description, seems to be the intended one: There, this plot would indicate the frequency of ‘7’ in each bin and then, yes, the y-axis needs to go from 0 to 140. In Figure 1, I am showing the plot he meant to show (but with a main heading I already changed to the averages of the results we will obtain below).
plot(type="h",
PI.tdm.times.abs["7",],
main=expression(paste("'7' in ", pi, ": ", italic(DP)[norm], "≈0.0367, ", italic(D)[KLnorm], "≈0.0042")),
xlab="[This time with the correct x-axis]",
ylab="Clount of '7'", ylim=c(0, 140))
But his plot mixes up the axes: It has the x-axis from plot 1 and the y-axis from plot 2. Thus, it pretends that there are 1 million counts of ‘7’ between 80 and 140, i.e. around 100 million data points rather than the 1 million digits of pi he had us consider.
3 Computing DPnorm correctly
From the digit-bin matrices, we can compute the DP-values in a very R way: by apply
ing to PI.tdm.times.rel
(and PI.tdm.each.rel
), namely each row/digit (MARGIN=1
) an anonymous function, which will make the row an argument internally called obs.post
(for ‘observed/posterior’) and then compute half of the sum of all absolute pairwise differences of obs.post
and exp.prior
, which is the formula for DP Nelson also provided:
DP.times <- apply( # make DP.times the result of applying
PI.tdm.times.rel, MARGIN=1, # to PI.tdm.times.rel, namely each row
FUN=\(obs.post) # an anonymous function that computes DP:
sum(abs(obs.post-exp.prior))/2) # sum(|O-E|) / 2 (see Nelson, eq. (1)
DP.each <- apply( # make DP.each the result of applying
PI.tdm.each.rel, MARGIN=1, # to PI.tdm.each.rel, namely each row
FUN=\(obs.post) # an anonymous function that computes DP:
sum(abs(obs.post-exp.prior))/2) # sum(|O-E|) / 2 (see Nelson, eq. (1)
These two sets of DP-values can then be normalized to the final DPnorm-values:1
0 1 2 3 4 5 6 7 8 9
0.0381 0.0373 0.0380 0.0384 0.0380 0.0388 0.0371 0.0376 0.0373 0.0394
0 1 2 3 4 5 6 7 8 9
0.0364 0.0383 0.0385 0.0370 0.0377 0.0375 0.0387 0.0358 0.0388 0.0373
This is a result that makes a lot of sense given that, as per Nelson himself, “there are (famously) no predictable biases in the distribution of the digits of pi” and the very low DPnorm values are perfectly compatible with that. However, it certainly is not the case that “every digit receives a DPnorm score 0.495 – nearly the midpoint of the range of the score”, that’s not even close to the truth. The DPnorm results he discusses on p. 158 and in Table 1 on p. 161 are wrong: The histogram in Figure 1 and the plot in Figure 2 are in fact perfectly compatible with a “value that might intuitively (and probably empirically) be interpreted as unbiased and even.”
4 Computing DKLnorm correctly
First and for the record, Nelson makes another small mistake in his definition of DKLnorm. On p. 155 of his article, he defines the Kullback-Leibler Divergence with his equation (4), which is shown here:
\[D_{KLnorm} = D(O||E) = \sum_{i=1}^{N} O_i Log_n \frac{O_i}{E_i} \tag{1}\]
To that, he adds “Where (as above) O is a vector of observed proportions and E is the vector of expected proportions” (also p. 155). That is of course not correct because his equation 4 shown here in Equation 1 does not compute the normalized DKL but a regular, unnormalized DKL, something which Nelson implicitly admits in his very next sentence, where he says “InGries [sic] (2022), a normalization of this measure’s score, \(D_{KLnorm} = 1 – e^{−Dkl}\), is proposed to fit the outputs to the unit interval.” In other words, his equation 4 is not already the normalized version, the normalized one we get only after also applying \(D_{KLnorm} = 1 – e^{−Dkl}\).2
Let us now compute the DKLnorm-values from the digit-bin matrices again. For that we first define a function that computes DKL, which I here call KLDs
; to try and replicate Nelson’s procedure, I am using the natural log in the function like he did.
Then, we again apply
this function to every row of PI.tdm.times.rel
and PI.tdm.each.rel
:
KLD.times <- apply( # make KLD.times the result of applying
PI.tdm.times.rel, MARGIN=1, # to PI.tdm.times.rel, namely each row
KLDs, exp.prior) # the function computing the KLD
KLD.times %>% round(4) # check the result
0 1 2 3 4 5 6 7 8 9
0.0045 0.0044 0.0045 0.0046 0.0045 0.0048 0.0043 0.0044 0.0044 0.0049
KLD.each <- apply( # make KLD.each the result of applying
PI.tdm.each.rel, MARGIN=1, # to PI.tdm.each.rel, namely each row
KLDs, exp.prior) # the function computing the KLD
KLD.each %>% round(4) # check the result
0 1 2 3 4 5 6 7 8 9
0.0043 0.0044 0.0046 0.0042 0.0045 0.0043 0.0047 0.0040 0.0047 0.0044
Any reader who does not trust my function can double-check the result for, say, ‘7’, with the following code (for PI.tdm.each.rel
):
0 1 2 3 4 5 6 7 8 9
0.0043 0.0044 0.0046 0.0042 0.0045 0.0043 0.0047 0.0040 0.0047 0.0044
And then we normalize using the above-mentioned strategy:
0 1 2 3 4 5 6 7 8 9
0.0045 0.0044 0.0045 0.0046 0.0045 0.0048 0.0043 0.0044 0.0044 0.0049
0 1 2 3 4 5 6 7 8 9
0.0043 0.0044 0.0046 0.0042 0.0045 0.0043 0.0047 0.0040 0.0047 0.0044
As before with DPnorm, this result is exactly what one would expect DKL or DKLnorm to return for the digits of pi, what Nelson (2023:158) himself formulated as an expectation, namely values that are close to 0 because that is the kind of “value that might intuitively (and probably empirically) be interpreted as unbiased and even”. But here the result is, in a sense, not even ‘only’ half off, this time Nelson’s results are at the exact opposite of the interval of values that DKL can assume. He claims results of 0.9, but the real values are very close to 0.
5 Summary
In sum, Nelson (2023) commits five mistakes:
- the left panel of Figure 2 misrepresents the data studied;
- equation (4) labels DKL as DKLnorm;
- the results reported for measuring the dispersion of the first 1m digits of pi are false for
- DPnorm (p. 158);
- DKLnorm (p. 158);
- the DPnorm results in Table 1 are false.
To reiterate: given the unpatterned first million digits of pi and contra Nelson, both DPnorm and DKLnorm return exactly the results the measures are sensibly claimed to return: values close to 0 indicating even distributions/dispersions. This does of course not prove that either measure works perfectly for all realistic combinations of corpus sizes (in words and in parts) and word frequencies: we definitely need more studies that (i) explore how dispersion measures we already have ‘react’ to different scenarios, (ii) develop new measures and, importantly, (iii) validate them against other, external data (see, e.g., Biber et al. 2016, Burch et al. 2017, Gries 2022). Given the importance of dispersion as a corpus-linguistic quantity, it would certainly be great if corpus linguistics as a field devoted as much energy to dispersion measures as it has to association measures, but this is of course requires that the computations are done correctly and replicably, which was the main point of this short note.
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] STGmisc_1.0 Rcpp_1.0.12 magrittr_2.0.3
loaded via a namespace (and not attached):
[1] digest_0.6.34 fastmap_1.1.1 xfun_0.42 knitr_1.45
[5] htmltools_0.5.7 rmarkdown_2.25 cli_3.6.2 rstudioapi_0.15.0
[9] tools_4.3.2 entropy_1.3.1 evaluate_0.23 yaml_2.3.8
[13] rlang_1.1.3 jsonlite_1.8.8 htmlwidgets_1.6.4
6 References
- Biber, Douglas, Randi Reppen, Erin Schnur, & Romy Ghanem. 2016. On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics 21(4). 439-464.
- Burch, Brent, Jesse Egbert, & Douglas Biber. 2017. Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science 3(2). 189-216.
- Gries, Stefan Th. 2008. Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13(4). 403-437.
- Gries, Stefan Th. 2020. Analyzing dispersion. In Magali Paquot & Stefan Th. Gries (eds.). A practical handbook of corpus linguistics, 99-118. Berlin & New York: Springer.
- Gries, Stefan Th. 2022b. What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies. 5(2). 171-205.
- Lijffijt, Jefrey & Stefan Th. Gries. 2012. Correction to “Dispersions and adjusted frequencies in corpora”. International Journal of Corpus Linguistics 17(1). 147-149.
- Nelson, Robert N. 2023. Too Noisy at the Bottom: Why Gries’ (2008, 2020) Dispersion Measures Cannot Identify Unbiased Distributions of Words. Journal of Quantitative Linguistics 30(2). 153-166.
Footnotes
In this very specific case, where all digit bins are equally large – 1/1000 – the normalization formulae from Gries (2008) and Lijffijt & Gries (2012) amount to the same results; the R code I give here is the one that would work in the more general case where not all corpus files are equally large.↩︎
Also, the paper of mine he cites for DKLnorm as a dispersion measure used the binary log, not the natural log Nelson’s equation uses.↩︎