1 Introduction

For decades, much work in cognitive linguistics, psycholinguistics, and corpus-linguistic has placed a great emphasis on the frequencies with which words and other linguistic elements occur in language; such work argued, among other things, that

  • (log) word frequency is correlated with cognitive entrenchment and ‘widespreadedness’ of those words in speakers’ minds and the speech community;
  • therefore, words with different frequency would behave differently with regard to their
    • acquisition;
    • use;
    • processing;
    • change over time; …

More recently, interest was revived in the notion of dispersion, i.e. measures that are probabilistically related to frequency but quantify the evenness/clumpiness/burstiness with which a word is distributed in a corpus. Here we will compute a measure of dispersion for a handful of words in the International Corpus of English - the British Component (ICE-GB). This measure is called DP (for Deviation of Proportions), for which you need the following information:

  • for each corpus part, its relative size: \(expected=\frac{corpus~part~size~in~words}{overall~corpus~size}\);
  • for each corpus part, the relative frequency of the word: \(observed=\frac{freq~of~word~in~part}{freq~of~word~in~corpus}\).

For example, if you have a corpus that consists of three parts A, B, and C, and the word in question occurs in it 6 times such that

  • the corpus parts have the following sizes: 4000, 3000, and 3000 words;
  • the corpus parts contain the word in question 1, 2, and 3 times respectively;

then:

  • \(expected_A=\frac{4000}{10000}=0.4\), \(expected_B=\frac{3000}{10000}=0.3\), \(expected_C=\frac{3000}{10000}=0.3\),
  • \(observed_A=\frac{1}{6}=0.1667\), \(observed_B=\frac{2}{6}=0.333\), and \(observed_C=\frac{3}{6}=0.5\).

In R:

corpus.part.sizes <- c(4000, 3000, 3000); (expected <- corpus.part.sizes/sum(corpus.part.sizes))
## [1] 0.4 0.3 0.3
(observed <- 1:3/6)
## [1] 0.1666667 0.3333333 0.5000000

The measure DP is then computed as follows:

sum(abs(observed-expected))/2
## [1] 0.2333333

DP falls into the range of [0,1]:

  • values closer to 0 mean that a word is very evenly distributed in a corpus (i.e. very much as one would expect from the corpus part sizes);
  • values closer to 1 mean that a word is very unevenly distributed in a corpus (i.e. often clumped into a very small number of corpus parts).

(Some might find it desirable to actually use 1-DP because the majority of other dispersion measures has that kind orientation, i.e. one where low and high values mean ‘uneven’ and ‘even distribution’ respectively.)

1.1 Step 1: things we need to do

If we want to determine the dispersion of several words in a corpus, We need to

  1. define the words of interest;
  2. for each corpus (file), we
    1. load it;
    2. homogenize upper-/lower-case spelling;
    3. extract the words from each corpus file and store ‘some such information’ somehow/somewhere;
  3. from that somehow/somewhere we compute the size of each file;
  4. for each word of interest, we retrieve their frequencies in files and compute DP for them.

1.2 Step 2: functions we will need for that

The above tasks require the following functions:

  1. define the words of interest: c;
  2. for each corpus file: dir or some file-choosing function to define the corpus files and then a for-loop where we
    1. load it: scan
    2. convert it to lower case tolower;
    3. extract the words from each corpus file: gsub or gregexpr/regmatches or exact.matches.2
    4. store the words somehow/somewhere (<-);
  3. from that somehow/somewhere we compute expected, i.e. the size of each file: length;
  4. for each word of interest (with a for-loop)
    1. we compute observed by retrieving its frequencies in all files from the somehow/somewhere: [...]
    2. we compute DP: math functions.

That means, before we enter

  • the 1st for-loop, we need a collector that will collect the corpus;
  • the 2nd for-loop, we need a collector that will collect the DP-values.

1.3 Step 3: pseudocode

Let’s break this down. As you will see, there are multiple ways of doing this, I am outlining only the first one here and we might develop the other ones later:

  1. we define a character vector words.of.interest with some words of interest
  2. we define the corpus files, define a collector for the corpus (a list) start a for-loop over each file;
    1. we load each corpus file;
    2. convert it to lower case;
    3. we extract the words:
      1. either delete everything before and including the { and everything after and including the };
      2. or we pick out the word from between the { and };
    4. we store the words per file in the collector;
  3. from the collector, we compute each file’s relative size
  4. for each word of interest (with a for-loop)
    1. we compute observed by counting its frequencies in all files from the collector;
    2. we compute DP: math functions.

2 Implementation 1

rm(list=ls(all=TRUE))

2.1 Preparation

We define the words of interest:

words.of.interest <- c("the", "on", "with", "of", "for", "at", "forward", "pigmeat", "ozzie",  "accidie")

We unzip and define the corpus files from the International Corpus of English, the British Component:

unzip("files/ICEGB_sampled.zip", # unzip this zip archive
      exdir="files")             # into the files folder/directory

head(corpus.files <- dir( # make corpus.files the content of
   "files/ICEGB_sampled", # this directory
   recursive=TRUE,        # browse all sub-folders
   full.names=TRUE)       # return full names
)                         # end of head
## [1] "files/ICEGB_sampled/S1A-001.COR" "files/ICEGB_sampled/S1A-002.COR"
## [3] "files/ICEGB_sampled/S1A-003.COR" "files/ICEGB_sampled/S1A-004.COR"
## [5] "files/ICEGB_sampled/S1A-005.COR" "files/ICEGB_sampled/S1A-006.COR"

We define a collector structure corpus to collect ‘the whole corpus’; this will be a list that will contain as many character vectors as the corpus has files:

corpus <- vector(               # make corpus a
   mode="list",                 # list
   length=length(corpus.files)) # with 1 part per corpus file
# name each part by the corpus file
names(corpus) <- sub(".COR$", "", basename(corpus.files), perl=TRUE)
# check the structure
head(corpus)
## $`S1A-001`
## NULL
## 
## $`S1A-002`
## NULL
## 
## $`S1A-003`
## NULL
## 
## $`S1A-004`
## NULL
## 
## $`S1A-005`
## NULL
## 
## $`S1A-006`
## NULL

2.2 Processing the corpus files

Then, we loop over each file name and

  • load each file:
system.time({ # measure the time of everything that follows
for(i in seq(corpus.files)) { # access each corpus file
   # we read in each corpus file
   current.corpus.file <- tolower(scan( # make current.corpus file the lower case version of
      corpus.files[i],  # the i-th/current corpus file
      what=character(), # as a character vector
      sep="\n",         # with linebreaks as separators between vector elements
      quiet=TRUE))      # no quote characters and no feedback about the number of elements read
  • use gregexpr to find everything after a { till before the next }:
   # we find where the words are ...
   locations.of.words <- gregexpr( # make the locations of the words the result of gregexpring
      "(?<={)[^}]+",               # for not-} after a {
      current.corpus.file,         # in current.corpus.file
      perl=TRUE)                   # using Perl-compatible regular expressions
  • use regmatches to retrieve those words:
   # ... and get them
   current.words <- regmatches( # make current.words the matches
      current.corpus.file,      # in current.corpus.file
      locations.of.words)       # that the gregexpr just found
  • change this to a lowercase vector and save the corpus file into the corresponding list element:
   # we save the words in the file in the list collecting the whole corpus
   corpus[[i]] <-       tolower(unlist(current.words))
  • print a progress report:
   cat("\f", i/length(corpus.files)) # output to the screen the % of files dealt w/ now
} # end of for
}) # approximately 11 seconds
##  0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044 0.046 0.048 0.05 0.052 0.054 0.056 0.058 0.06 0.062 0.064 0.066 0.068 0.07 0.072 0.074 0.076 0.078 0.08 0.082 0.084 0.086 0.088 0.09 0.092 0.094 0.096 0.098 0.1 0.102 0.104 0.106 0.108 0.11 0.112 0.114 0.116 0.118 0.12 0.122 0.124 0.126 0.128 0.13 0.132 0.134 0.136 0.138 0.14 0.142 0.144 0.146 0.148 0.15 0.152 0.154 0.156 0.158 0.16 0.162 0.164 0.166 0.168 0.17 0.172 0.174 0.176 0.178 0.18 0.182 0.184 0.186 0.188 0.19 0.192 0.194 0.196 0.198 0.2 0.202 0.204 0.206 0.208 0.21 0.212 0.214 0.216 0.218 0.22 0.222 0.224 0.226 0.228 0.23 0.232 0.234 0.236 0.238 0.24 0.242 0.244 0.246 0.248 0.25 0.252 0.254 0.256 0.258 0.26 0.262 0.264 0.266 0.268 0.27 0.272 0.274 0.276 0.278 0.28 0.282 0.284 0.286 0.288 0.29 0.292 0.294 0.296 0.298 0.3 0.302 0.304 0.306 0.308 0.31 0.312 0.314 0.316 0.318 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 0.342 0.344 0.346 0.348 0.35 0.352 0.354 0.356 0.358 0.36 0.362 0.364 0.366 0.368 0.37 0.372 0.374 0.376 0.378 0.38 0.382 0.384 0.386 0.388 0.39 0.392 0.394 0.396 0.398 0.4 0.402 0.404 0.406 0.408 0.41 0.412 0.414 0.416 0.418 0.42 0.422 0.424 0.426 0.428 0.43 0.432 0.434 0.436 0.438 0.44 0.442 0.444 0.446 0.448 0.45 0.452 0.454 0.456 0.458 0.46 0.462 0.464 0.466 0.468 0.47 0.472 0.474 0.476 0.478 0.48 0.482 0.484 0.486 0.488 0.49 0.492 0.494 0.496 0.498 0.5 0.502 0.504 0.506 0.508 0.51 0.512 0.514 0.516 0.518 0.52 0.522 0.524 0.526 0.528 0.53 0.532 0.534 0.536 0.538 0.54 0.542 0.544 0.546 0.548 0.55 0.552 0.554 0.556 0.558 0.56 0.562 0.564 0.566 0.568 0.57 0.572 0.574 0.576 0.578 0.58 0.582 0.584 0.586 0.588 0.59 0.592 0.594 0.596 0.598 0.6 0.602 0.604 0.606 0.608 0.61 0.612 0.614 0.616 0.618 0.62 0.622 0.624 0.626 0.628 0.63 0.632 0.634 0.636 0.638 0.64 0.642 0.644 0.646 0.648 0.65 0.652 0.654 0.656 0.658 0.66 0.662 0.664 0.666 0.668 0.67 0.672 0.674 0.676 0.678 0.68 0.682 0.684 0.686 0.688 0.69 0.692 0.694 0.696 0.698 0.7 0.702 0.704 0.706 0.708 0.71 0.712 0.714 0.716 0.718 0.72 0.722 0.724 0.726 0.728 0.73 0.732 0.734 0.736 0.738 0.74 0.742 0.744 0.746 0.748 0.75 0.752 0.754 0.756 0.758 0.76 0.762 0.764 0.766 0.768 0.77 0.772 0.774 0.776 0.778 0.78 0.782 0.784 0.786 0.788 0.79 0.792 0.794 0.796 0.798 0.8 0.802 0.804 0.806 0.808 0.81 0.812 0.814 0.816 0.818 0.82 0.822 0.824 0.826 0.828 0.83 0.832 0.834 0.836 0.838 0.84 0.842 0.844 0.846 0.848 0.85 0.852 0.854 0.856 0.858 0.86 0.862 0.864 0.866 0.868 0.87 0.872 0.874 0.876 0.878 0.88 0.882 0.884 0.886 0.888 0.89 0.892 0.894 0.896 0.898 0.9 0.902 0.904 0.906 0.908 0.91 0.912 0.914 0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.936 0.938 0.94 0.942 0.944 0.946 0.948 0.95 0.952 0.954 0.956 0.958 0.96 0.962 0.964 0.966 0.968 0.97 0.972 0.974 0.976 0.978 0.98 0.982 0.984 0.986 0.988 0.99 0.992 0.994 0.996 0.998 1
##    user  system elapsed 
##  12.164   0.000  12.164
  • and then we check the results:
object.size(corpus) # 28,282,272
## 28278272 bytes

We’re not doing any further cleaning here (for things such as numbers, pauses (“<,>”), unclear words (“”), etc.).

2.3 Computing expected

From this we can now compute expected for all words of interest, i.e. the file sizes: they will be the lengths of each list element:

expected <- sapply( # make expected the result of applying
   corpus,          # to each list element within corpus
   length)          # the function length
# and then convert those into relative lengths/frequencies
head(expected <- expected/sum(expected))
##     S1A-001     S1A-002     S1A-003     S1A-004     S1A-005     S1A-006 
## 0.002047279 0.002018379 0.002133981 0.002135846 0.001980156 0.001919558

2.4 Computing observed and DP

Now we can turn to observed for each word of interest and, then, DP. First, we create a collector for the DP-values:

DP.values <- rep(NA, length(words.of.interest)) # create a collector for the DP-values
   names(DP.values) <- words.of.interest        # use the words for its names

Then, we loop over each word of interest (calling it i) and

  • use sapply to look into each list element and use an anonymous function to check how many (sum) words in the list element are (==) i;
for (i in words.of.interest) { # for each word of interest
   observed <- sapply(         # make observed the result of applying to
      corpus,                  # to each list element within corpus
      # an anonymous function that takes the list element & checks which of its words is the current
      function (currlistelem) sum(currlistelem==i))
  • change those frequencies into relative frequencies (by dividing by i’s frequency);
   observed <- observed/sum(observed) # change that into relative lengths/frequencies, too
  • compute and store DP:
   DP.values[i] <- sum(abs(observed-expected))/2 # compute and store DP
}

Let’s check the result:

DP.values # check result
##       the        on      with        of       for        at   forward   pigmeat 
## 0.1557128 0.1704746 0.1745771 0.1978785 0.1988233 0.2018111 0.7386465 0.9966447 
##     ozzie   accidie 
## 0.9976535 0.9977551

3 Implementation 2

3.1 Preparation

In terms of programming this, in some sense that was the easiest way to do this, but in fact it is terrible in terms of R programming and memory management. Why do we store the whole corpus like we did?! Why do we store multiple occurrences of the same word when we can store it once with a frequency? This what we change now.

The collector structure corpus is actually set up in the same way as before, an empty list with an element for each corpus file:

corpus <- vector(               # make corpus a
   mode="list",                 # list
   length=length(corpus.files)) # with 1 part per corpus file
# name each part by the corpus file
names(corpus) <- sub(".COR$", "", basename(corpus.files), perl=TRUE)
# check the structure
head(corpus)
## $`S1A-001`
## NULL
## 
## $`S1A-002`
## NULL
## 
## $`S1A-003`
## NULL
## 
## $`S1A-004`
## NULL
## 
## $`S1A-005`
## NULL
## 
## $`S1A-006`
## NULL

3.2 Processing the corpus files

And the loop etc. is also the same, with one tiny exception: Now we don’t save all the words themselves, we save the frequency table:

system.time({ # measure the time of everything that follows
for(i in seq(corpus.files)) { # access each corpus file
   # we read in each corpus file
   current.corpus.file <- tolower(scan( # make current.corpus file the lower case version of
      corpus.files[i],  # the i-th/current corpus file
      what=character(), # as a character vector
      sep="\n",         # with linebreaks as separators between vector elements
      quiet=TRUE))      # no quote characters and no feedback about the number of elements read

   # we find where the words are ...
   locations.of.words <- gregexpr( # make the lcoations of the words the result of gregexpring
      "(?<={)[^}]+",               # for not-} after a {
      current.corpus.file,         # in current.corpus.file
      perl=TRUE)                   # using Perl-compatible regular expressions
   # ... and get them
   current.words <- regmatches( # make current.words the matches
      current.corpus.file,      # in current.corpus.file
      locations.of.words)       # that the gregexpr just found

   # we save the frequency list of the words in the file in the list collecting the whole corpus # <---
   corpus[[i]] <- table(tolower(unlist(current.words)))                                          # <---
   # this was                                                                                    # <---
   # corpus[[i]] <-       tolower(unlist(current.words))                                         # <---

   cat("\f", i/length(corpus.files)) # output to the screen the % of files dealt w/ now
} # end of for
}) # approximately 13 seconds
##  0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044 0.046 0.048 0.05 0.052 0.054 0.056 0.058 0.06 0.062 0.064 0.066 0.068 0.07 0.072 0.074 0.076 0.078 0.08 0.082 0.084 0.086 0.088 0.09 0.092 0.094 0.096 0.098 0.1 0.102 0.104 0.106 0.108 0.11 0.112 0.114 0.116 0.118 0.12 0.122 0.124 0.126 0.128 0.13 0.132 0.134 0.136 0.138 0.14 0.142 0.144 0.146 0.148 0.15 0.152 0.154 0.156 0.158 0.16 0.162 0.164 0.166 0.168 0.17 0.172 0.174 0.176 0.178 0.18 0.182 0.184 0.186 0.188 0.19 0.192 0.194 0.196 0.198 0.2 0.202 0.204 0.206 0.208 0.21 0.212 0.214 0.216 0.218 0.22 0.222 0.224 0.226 0.228 0.23 0.232 0.234 0.236 0.238 0.24 0.242 0.244 0.246 0.248 0.25 0.252 0.254 0.256 0.258 0.26 0.262 0.264 0.266 0.268 0.27 0.272 0.274 0.276 0.278 0.28 0.282 0.284 0.286 0.288 0.29 0.292 0.294 0.296 0.298 0.3 0.302 0.304 0.306 0.308 0.31 0.312 0.314 0.316 0.318 0.32 0.322 0.324 0.326 0.328 0.33 0.332 0.334 0.336 0.338 0.34 0.342 0.344 0.346 0.348 0.35 0.352 0.354 0.356 0.358 0.36 0.362 0.364 0.366 0.368 0.37 0.372 0.374 0.376 0.378 0.38 0.382 0.384 0.386 0.388 0.39 0.392 0.394 0.396 0.398 0.4 0.402 0.404 0.406 0.408 0.41 0.412 0.414 0.416 0.418 0.42 0.422 0.424 0.426 0.428 0.43 0.432 0.434 0.436 0.438 0.44 0.442 0.444 0.446 0.448 0.45 0.452 0.454 0.456 0.458 0.46 0.462 0.464 0.466 0.468 0.47 0.472 0.474 0.476 0.478 0.48 0.482 0.484 0.486 0.488 0.49 0.492 0.494 0.496 0.498 0.5 0.502 0.504 0.506 0.508 0.51 0.512 0.514 0.516 0.518 0.52 0.522 0.524 0.526 0.528 0.53 0.532 0.534 0.536 0.538 0.54 0.542 0.544 0.546 0.548 0.55 0.552 0.554 0.556 0.558 0.56 0.562 0.564 0.566 0.568 0.57 0.572 0.574 0.576 0.578 0.58 0.582 0.584 0.586 0.588 0.59 0.592 0.594 0.596 0.598 0.6 0.602 0.604 0.606 0.608 0.61 0.612 0.614 0.616 0.618 0.62 0.622 0.624 0.626 0.628 0.63 0.632 0.634 0.636 0.638 0.64 0.642 0.644 0.646 0.648 0.65 0.652 0.654 0.656 0.658 0.66 0.662 0.664 0.666 0.668 0.67 0.672 0.674 0.676 0.678 0.68 0.682 0.684 0.686 0.688 0.69 0.692 0.694 0.696 0.698 0.7 0.702 0.704 0.706 0.708 0.71 0.712 0.714 0.716 0.718 0.72 0.722 0.724 0.726 0.728 0.73 0.732 0.734 0.736 0.738 0.74 0.742 0.744 0.746 0.748 0.75 0.752 0.754 0.756 0.758 0.76 0.762 0.764 0.766 0.768 0.77 0.772 0.774 0.776 0.778 0.78 0.782 0.784 0.786 0.788 0.79 0.792 0.794 0.796 0.798 0.8 0.802 0.804 0.806 0.808 0.81 0.812 0.814 0.816 0.818 0.82 0.822 0.824 0.826 0.828 0.83 0.832 0.834 0.836 0.838 0.84 0.842 0.844 0.846 0.848 0.85 0.852 0.854 0.856 0.858 0.86 0.862 0.864 0.866 0.868 0.87 0.872 0.874 0.876 0.878 0.88 0.882 0.884 0.886 0.888 0.89 0.892 0.894 0.896 0.898 0.9 0.902 0.904 0.906 0.908 0.91 0.912 0.914 0.916 0.918 0.92 0.922 0.924 0.926 0.928 0.93 0.932 0.934 0.936 0.938 0.94 0.942 0.944 0.946 0.948 0.95 0.952 0.954 0.956 0.958 0.96 0.962 0.964 0.966 0.968 0.97 0.972 0.974 0.976 0.978 0.98 0.982 0.984 0.986 0.988 0.99 0.992 0.994 0.996 0.998 1
##    user  system elapsed 
##  12.863   0.021  12.884
object.size(corpus) # 24,078,424
## 24074424 bytes

This took maybe a bit more time (but very little), but needs &approx 14% less RAM – with bigger data, you have a trade-off decision to make.

3.3 Computing expected

Before we had all the words of a corpus file in each list element, so we used length to get the size of a file. Now we have frequencies in the list elements, so we sum:

expected <- sapply( # make expected the result of applying
   corpus,          # to each list element within corpus
   length)          # the function length
# and then convert those into relative lengths/frequencies
head((expected <- expected/sum(expected)))
##     S1A-001     S1A-002     S1A-003     S1A-004     S1A-005     S1A-006 
## 0.001493844 0.001399911 0.001533236 0.001621109 0.001530206 0.001469603

3.4 Computing observed and DP

Before we had all the words of a corpus file in each list element, so we used sum(...==i) (within sapply) to sum up how often each word was in each corpus file. Now we have frequencies in the list elements so we use the subsetting function [ (also within sapply) to retrieve the already computed frequency of each word in each corpus file. However, since the absence of a word will return NA, we then change NA to 0; the rest is as before.

DP.values <- rep(NA, length(words.of.interest)) # create a collector for the DP-values
   names(DP.values) <- words.of.interest        # use the words for its names
for (i in words.of.interest) { # for each word of interest
   observed <- sapply(         # make observed the result of applying to
      corpus,                  # to each list element within corpus
      "[",                     # the subsetting function                        # <---
      i)                       # namely subsetting for exactly the current word # <---
   observed[is.na(observed)] <- 0 # change NA for unattested words into 0       # <---
   observed <- observed/sum(observed) # change that into relative lengths/frequencies, too
   DP.values[i] <- sum(abs(observed-expected))/2 # compute and store DP
}
DP.values # check result
##       the        on      with        of       for        at   forward   pigmeat 
## 0.1290733 0.1824955 0.1710262 0.1677500 0.1865404 0.2179838 0.7345712 0.9961609 
##     ozzie   accidie 
## 0.9977123 0.9976729

4 Excursus: What if you want dispersions for all words?

This is how you could compute dispersions for all words using the above logic. First, you’d have to make sure that R knows you now want to results for all words, i.e. words.of.interest must contain each word type in the corpus. This is how that can be done quickly:

words.of.interest <- sort( # make words.of.interest
   unique(                 # the unique type
      unlist(              # of the unlisted
         sapply(           # list when you apply
            corpus,        # to each frequency table in corpus
            names))))      # the function names

Or we do this with the pipe:

library(magrittr)
words.of.interest <- # make words.of.interest by
   corpus        %>% # taking the list corpus
   sapply(names) %>% # retrieving the names from the tables
   unlist        %>% # unlisting the whole thing
   unique        %>% # get only unique word types
   sort              # & sort them alphabetically

We create our collector again with the new size:

DP.values <- rep(NA, length(words.of.interest)) # create a collector for the DP-values
   names(DP.values) <- words.of.interest        # use the words for its names

Then, we run the above loop, which takes around 6 minutes on a pretty fast AMD Ryzen with 64GB of fast RAM:

system.time({ # measure how long this takes
for (i in words.of.interest) { # for each word of interest
   observed <- sapply(         # make observed the result of applying to
      corpus,                  # to each list element within corpus
      "[",                     # the subsetting function
      i)                       # namely subsetting for exactly the current word
   observed[is.na(observed)] <- 0 # change NA for unattested words into 0
   observed <- observed/sum(observed) # change that into relative lengths/frequencies, too
   DP.values[i] <- sum(abs(observed-expected))/2 # compute and store DP
}}) # approximately 6 minutes
##    user  system elapsed 
## 365.169   0.133 365.311
summary(DP.values) # check result
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1193  0.9954  0.9975  0.9890  0.9979  0.9988
DP.values[c("the", "snort")]
##       the     snort 
## 0.1290733 0.9974214

And now how one would really do this? One would first create a term-document matrix tdm.

words <- unname(unlist(sapply(corpus, function(af) rep(names(af), af), USE.NAMES=FALSE)))
files <- rep(
   names(corpus),
   sapply(corpus, sum))
tdm <- table(words, files) # tdm[53400:53409, 1:10]

But how big is that thing?!

object.size(tdm) # 124,463,744
## 124463744 bytes

Whoa, this is more than 5 times as big as our smaller version of corpus before … So how long does handling this take now?!

system.time({ # measure how long this takes
DP.values <- apply(
   tdm,
   1,
   function (af) sum(abs((af/sum(af))-expected))/2 )
}) # approximately 1 second
##    user  system elapsed 
##   0.942   0.040   0.982
summary(DP.values) # check result
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1193  0.9954  0.9975  0.9890  0.9979  0.9988
DP.values[c("the", "snort")]
##       the     snort 
## 0.1290733 0.9974214

What a trade-off: using a representation of the data that is 5.16 as great as the smallest we could do, we could speed up processing by a factor of several hundreds … Which also means, when people tell you R is slow, often it’s just their code that is not always optimal …

Housekeeping:

unlink("files/ICEGB_sampled", recursive=TRUE)
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   compiler 
## [8] base     
## 
## other attached packages:
## [1] magrittr_2.0.3
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.31   R6_2.5.1        jsonlite_1.8.4  evaluate_0.20  
##  [5] cachem_1.0.7    rlang_1.0.6     cli_3.6.0       rstudioapi_0.14
##  [9] jquerylib_0.1.4 bslib_0.4.2     rmarkdown_2.20  tools_4.2.2    
## [13] xfun_0.37       yaml_2.3.7      fastmap_1.1.1   htmltools_0.5.4
## [17] knitr_1.42      sass_0.4.5