Not just frequency:

Keyness should integrate frequency, association, and dispersion

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

09 May 2024 08-18-09

rm(list=ls(all=TRUE))
sapply(c("data.table", "magrittr", "Rcpp"),
   library, character.only=TRUE, logical.return=TRUE, quietly=TRUE)

1 Introduction

1.1 General introduction

This is how one would compute G2 for the word type of faith in lower-cased Brown D (T) vs. the rest of Brown (R):

(Table1.obs <- matrix(c(43, 34547, 68, 978821), nrow=2,
   dimnames=list(FAITH=c("yes", "no"), CORPUS=c("TARGET", "REFERENCE"))))
     CORPUS
FAITH TARGET REFERENCE
  yes     43        68
  no   34547    978821
temp <- chisq.test(Table1.obs, correct=FALSE) # for expecteds & residuals
c("G-squared"="*"(                                   # multiply
   2 * sign(temp$residuals[1,1]),                    # m by ...
   sum(Table1.obs * log(Table1.obs/temp$expected)))) # ... this sum
G-squared
 147.0412 

1.2 Overview of the present paper

2 Methods

2.1 Data

Let’s create the small example corpus:

x.tar <- data.frame(
   WORD=rep(c("a","b","a","x","z","i","a","c","d","f","g","h","x","a","c",
              "b","f","g","i","z","x","a","c","e","g","i","x","z"),
            c(1,4,1,2,1,1,1,1,1,1,1,3,2,1,1,1,1,1,2,1,2,1,1,2,1,1,3,1)),
   PART=rep(c("tar1","tar2","tar3","tar4"), c(9,10,10,11)))
x.ref <- data.frame(
   WORD=rep(c("a","b","c","a","h","y","x","a","b","d","e","f","e","y","x",
              "a","b","d","e","g","i","y","x","a","b","d","e","g","x"),
            c(1,1,3,1,1,1,2,1,1,1,1,3,1,1,1,1,1,1,1,1,2,1,2,1,1,1,1,1,5)),
   PART=rep(c("ref1","ref2","ref3","ref4"), c(11,10,10,9)))
x <- rbind(x.tar, x.ref)

Let’s compute a word-by-corpus matrix with absolute frequencies:

WORD.by.CORPUS.abs <- with(x, table( # make WORD.by.CORPUS.abs a table
   WORD=WORD,                     # of the word types
   CORP=substr(PART, 1,3))[,2:1]) # the first letters of the corpus parts

Show it as it is shown in the paper:

WORD.by.CORPUS.abs %>% t %>% addmargins
     WORD
CORP   a  b  c  d  e  f  g  h  i  x  y  z Sum
  tar  5  5  3  1  2  2  3  3  4  9  0  3  40
  ref  5  4  3  3  4  3  2  1  2 10  3  0  40
  Sum 10  9  6  4  6  5  5  4  6 19  3  3  80

2.2 The three component of keyness

2.2.1 The frequency component

For the frequency component, we

  • take each word type ever attested in T or R;
  • take its frequency in T (which might be 0), add 1, and compute the binary log of that;
  • just for record-keeping,
    • take its frequency in R (which might be 0), add 1, and compute the binary log of that;
    • take its frequency in T and R combined, add 1 (just for homogeneity), and compute the binary log of that;
  • min-max transform both vectors of logged values,
  • begin to store everything in a data frame results:
results <- data.frame(
   WORD=rownames(WORD.by.CORPUS.abs),  # word types
   FREQTAR=WORD.by.CORPUS.abs[,"tar"], # their freqs in T
   FREQREF=WORD.by.CORPUS.abs[,"ref"], # their freqs in R
   KEYFREQTAR=WORD.by.CORPUS.abs[,"tar"]  %>% "+"(1) %>% log2 %>% zero2one,
   KEYFREQALL=rowSums(WORD.by.CORPUS.abs) %>% "+"(1) %>% log2 %>% zero2one)
results # show results
  WORD FREQTAR FREQREF KEYFREQTAR KEYFREQALL
a    a       5       5  0.7781513  0.6285430
b    b       5       4  0.7781513  0.5693234
c    c       3       3  0.6020600  0.3477088
d    d       1       3  0.3010300  0.1386469
e    e       2       4  0.4771213  0.3477088
f    f       2       3  0.4771213  0.2519296
g    g       3       2  0.6020600  0.2519296
h    h       3       1  0.6020600  0.1386469
i    i       4       2  0.6989700  0.3477088
x    x       9      10  1.0000000  1.0000000
y    y       0       3  0.0000000  0.0000000
z    z       3       0  0.6020600  0.0000000

2.2.2 The association component

Since we will use the KLD to compute associations between word and corpus, we first define a small function to compute the KLD:

KLD <- function (posterior, prior) {
    if (sum(posterior) > 1) { posterior <- posterior / sum(posterior) }
    if (sum(prior) > 1) { prior <- prior / sum(prior) }
    logged.fractions <- log2(posterior / prior)
       logged.fractions[posterior == 0] <- 0
    contributions.to.KLD <- posterior * logged.fractions
    return(sum(contributions.to.KLD))
}

There are two possible directions of association one could compute:

  • one quantifies how much the distribution of a word over T and R diverges from the corpus sizes, i.e., in a sense,
    • how much each word changes the proportional distribution of the two corpora;
    • how much a word type of interest changes the probability that one is looking at T;
    • how much better you can predict whether you’re looking at T when you know the word;
  • one quantifies how much the presence or absence of a word in T diverges from presence or absence of a word in T and R combined, i.e., in a sense,
    • how much the corpus being T changes the proportional distribution of each word, or, typically,
    • how much T increases the probability of occurrence of a word.
    • how much better you can predict the presence or absence of a word when you know you’re looking at T.

The function obviously allows both directions of computation, but in this paper, we will use the former computation, for which we create a word-by-corpus matrix WORD.by.CORPUS.rel with row proportions, i.e. with proportions that represent what proportions of each word type show up in T and R:

(WORD.by.CORPUS.rel <- WORD.by.CORPUS.abs %>% prop.table(1)) %>% round(4)
    CORP
WORD    tar    ref
   a 0.5000 0.5000
   b 0.5556 0.4444
   c 0.5000 0.5000
   d 0.2500 0.7500
   e 0.3333 0.6667
   f 0.4000 0.6000
   g 0.6000 0.4000
   h 0.7500 0.2500
   i 0.6667 0.3333
   x 0.4737 0.5263
   y 0.0000 1.0000
   z 1.0000 0.0000
results$KEYASSOC <- apply( # make results$KEYASSOC the result of applying
   WORD.by.CORPUS.rel,     # to the table of row proportions
   1,                      # row by row
   KLD, colSums(WORD.by.CORPUS.abs)) # the function KLD w/ the col sums as the prior

After that, we

  • normalize the KLD-values with the odds-to-probabilities transformation (KLD/1+KLD);
  • multiply the normalized KLD with
    • +1 if the KLD represents an attraction to T;
    • -1 if the KLD represents an attraction to R;
  • divide the normalized KLDs by their max to ‘stretch the value range’ of word types key for
    • T such that it exhausts the complete range of (0, 1];
    • R such that it exhausts the range of [corresponding minimum, 0]:
attr.to.T <- sign("-"( # make attracted.to.T the sign of difference of
   WORD.by.CORPUS.rel[,"tar"],                      # the observed proportion in T
   prop.table(colSums(WORD.by.CORPUS.abs))["tar"])) # the proportion of T in T+R
results$KEYASSOC %<>% KLD.norm %>% "*"(attr.to.T) %>% "/"(max(.))
results
  WORD FREQTAR FREQREF KEYFREQTAR KEYFREQALL     KEYASSOC
a    a       5       5  0.7781513  0.6285430  0.000000000
b    b       5       4  0.7781513  0.5693234  0.017690016
c    c       3       3  0.6020600  0.3477088  0.000000000
d    d       1       3  0.3010300  0.1386469 -0.317520657
e    e       2       4  0.4771213  0.3477088 -0.151065640
f    f       2       3  0.4771213  0.2519296 -0.056458719
g    g       3       2  0.6020600  0.2519296  0.056458719
h    h       3       1  0.6020600  0.1386469  0.317520657
i    i       4       2  0.6989700  0.3477088  0.151065640
x    x       9      10  1.0000000  1.0000000 -0.003990255
y    y       0       3  0.0000000  0.0000000 -1.000000000
z    z       3       0  0.6020600  0.0000000  1.000000000

2.2.3 The dispersion component

For dispersion, we first need to compute each word type’s dispersion in each of the two corpora T and R. Here are the relevant steps for T: we

  • compute two word-by-part matrix for T:
    • one with absolute frequencies (which will mostly be used to get the prior: the corpus part sizes);
    • one with relative frequencies (the posteriors: namely the proportions of word types across the corpus part);
  • compute the KLD for each word type;
  • normalize the KLD-values with the odds-to-probabilities transformation (KLD/1+KLD);
  • min-max transform the values
  • subtract them from 1:
WORD.by.PART.abs <- with(x.tar, table(WORD, PART))  # compute word-by-corpus matrix (abs. freqs)
WORD.by.PART.rel <- prop.table(WORD.by.PART.abs, 1) # compute word-by-corpus matrix row props.)
KEYDISPTAR <- apply(                   # make KEYDISPTAR the result of appylying to
   WORD.by.PART.rel,                   # WORD.by.PART.rel (for T)
   1,                                  # row by row
   KLD, colSums(WORD.by.PART.abs)) %>% # the function KLD w/ the col sums as the prior
   KLD.norm %>% zero2one %>% "-"(1,.)

Then we do the same for R:

WORD.by.PART.abs <- with(x.ref, table(WORD, PART))  # compute word-by-corpus matrix (abs. freqs)
WORD.by.PART.rel <- prop.table(WORD.by.PART.abs, 1) # compute word-by-corpus matrix row props.)
KEYDISPREF <- apply(                   # make KEYDISPREF the result of appylying to
   WORD.by.PART.rel,                   # WORD.by.PART.rel (for R)
   1,                                  # row by row
   KLD, colSums(WORD.by.PART.abs)) %>% # the function KLD w/ the col sums as the prior
   KLD.norm %>% zero2one %>% "-"(1,.)

To add all these results to results, we first add two placeholder columns KEYDISPTAR and KEYDISPREF, which for now only contain 0s:

results$KEYDISPREF <- results$KEYDISPTAR <- rep(0, nrow(WORD.by.CORPUS.abs))

Then, we insert the computed dispersion values for all word types attested in T and/or R:

results$KEYDISPTAR[match(names(KEYDISPTAR), results$WORD)] <- KEYDISPTAR
results$KEYDISPREF[match(names(KEYDISPREF), results$WORD)] <- KEYDISPREF

After that, we compute for each word type the difference between its dispersion in T minus its dispersion in R and again ‘stretch them’ by dividing by their max. This is because then

  • high values will represent word types that are evenly distributed in T but clumpily distributed or unattested in R (i.e. words that are dispersionally key for T);
  • low values will represent word types that are evenly distributed in R but clumpily distributed or unattested in T (i.e. words that are dispersionally key for R).
results$KEYDISP <- (results$KEYDISPTAR - results$KEYDISPREF) %>% "/"(max(.))
results
  WORD FREQTAR FREQREF KEYFREQTAR KEYFREQALL     KEYASSOC KEYDISPTAR KEYDISPREF     KEYDISP
a    a       5       5  0.7781513  0.6285430  0.000000000 1.00000000 0.47246210  0.77953532
b    b       5       4  0.7781513  0.5693234  0.017690016 0.14721358 1.00000000 -1.26015049
c    c       3       3  0.6020600  0.3477088  0.000000000 0.70088274 0.02414894  1.00000000
d    d       1       3  0.3010300  0.1386469 -0.317520657 0.00000000 0.52624922 -0.77763106
e    e       2       4  0.4771213  0.3477088 -0.151065640 0.02826716 0.47788111 -0.66438821
f    f       2       3  0.4771213  0.2519296 -0.056458719 0.29422753 0.00000000  0.43477587
g    g       3       2  0.6020600  0.2519296  0.056458719 0.70088274 0.22375505  0.70504487
h    h       3       1  0.6020600  0.1386469  0.317520657 0.00000000 0.02414894 -0.03568455
i    i       4       2  0.6989700  0.3477088  0.151065640 0.61605922 0.00000000  0.91034203
x    x       9      10  1.0000000  1.0000000 -0.003990255 0.96546219 0.66863773  0.43861333
y    y       0       3  0.0000000  0.0000000 -1.000000000 0.00000000 0.59877181 -0.88479667
z    z       3       0  0.6020600  0.0000000  1.000000000 0.65487307 0.00000000  0.96769672

2.3 What to do with those values

2.3.1 Keeping dimensions separate

We first can add for each word type how many of the two dimensions of association and dispersion represent the word as key for T:

results$KEYONHOWMANY <- "+"(        # add
   pmax(0, sign(results$KEYASSOC)), # 1 of KEYASSOC is >0
   pmax(0, sign(results$KEYDISP)))  # 1 of KEYDISP is >0
results
  WORD FREQTAR FREQREF KEYFREQTAR KEYFREQALL     KEYASSOC KEYDISPTAR KEYDISPREF     KEYDISP KEYONHOWMANY
a    a       5       5  0.7781513  0.6285430  0.000000000 1.00000000 0.47246210  0.77953532            1
b    b       5       4  0.7781513  0.5693234  0.017690016 0.14721358 1.00000000 -1.26015049            1
c    c       3       3  0.6020600  0.3477088  0.000000000 0.70088274 0.02414894  1.00000000            1
d    d       1       3  0.3010300  0.1386469 -0.317520657 0.00000000 0.52624922 -0.77763106            0
e    e       2       4  0.4771213  0.3477088 -0.151065640 0.02826716 0.47788111 -0.66438821            0
f    f       2       3  0.4771213  0.2519296 -0.056458719 0.29422753 0.00000000  0.43477587            1
g    g       3       2  0.6020600  0.2519296  0.056458719 0.70088274 0.22375505  0.70504487            2
h    h       3       1  0.6020600  0.1386469  0.317520657 0.00000000 0.02414894 -0.03568455            1
i    i       4       2  0.6989700  0.3477088  0.151065640 0.61605922 0.00000000  0.91034203            2
x    x       9      10  1.0000000  1.0000000 -0.003990255 0.96546219 0.66863773  0.43861333            1
y    y       0       3  0.0000000  0.0000000 -1.000000000 0.00000000 0.59877181 -0.88479667            0
z    z       3       0  0.6020600  0.0000000  1.000000000 0.65487307 0.00000000  0.96769672            2

2.3.2 Amalgamations

The first two different amalgamations can be implemented very straightforwardly as follows:

  • the former weighs the association component by the frequency of the word in T (by multiplication) and adds it to the dispersion component;
  • the latter weighs both the association and dispersion components by frequency (by multiplication):
results$AMALGAM1 <- with(results, "+"( # add
   "*"(KEYASSOC,    # the product of KEYASSOC
       KEYFREQTAR), # and KEYFREQTAR
   KEYDISP))        # to KEYDISP
results$AMALGAM2 <- with(results, "*"( # multiply
   KEYASSOC + KEYDISP, # the sum of KEYASSOC & KEYDISP
   KEYFREQTAR))        # by KEYFREQTAR
results
  WORD FREQTAR FREQREF KEYFREQTAR KEYFREQALL     KEYASSOC KEYDISPTAR KEYDISPREF     KEYDISP KEYONHOWMANY
a    a       5       5  0.7781513  0.6285430  0.000000000 1.00000000 0.47246210  0.77953532            1
b    b       5       4  0.7781513  0.5693234  0.017690016 0.14721358 1.00000000 -1.26015049            1
c    c       3       3  0.6020600  0.3477088  0.000000000 0.70088274 0.02414894  1.00000000            1
d    d       1       3  0.3010300  0.1386469 -0.317520657 0.00000000 0.52624922 -0.77763106            0
e    e       2       4  0.4771213  0.3477088 -0.151065640 0.02826716 0.47788111 -0.66438821            0
f    f       2       3  0.4771213  0.2519296 -0.056458719 0.29422753 0.00000000  0.43477587            1
g    g       3       2  0.6020600  0.2519296  0.056458719 0.70088274 0.22375505  0.70504487            2
h    h       3       1  0.6020600  0.1386469  0.317520657 0.00000000 0.02414894 -0.03568455            1
i    i       4       2  0.6989700  0.3477088  0.151065640 0.61605922 0.00000000  0.91034203            2
x    x       9      10  1.0000000  1.0000000 -0.003990255 0.96546219 0.66863773  0.43861333            1
y    y       0       3  0.0000000  0.0000000 -1.000000000 0.00000000 0.59877181 -0.88479667            0
z    z       3       0  0.6020600  0.0000000  1.000000000 0.65487307 0.00000000  0.96769672            2
    AMALGAM1   AMALGAM2
a  0.7795353  0.6065964
b -1.2463850 -0.9668222
c  1.0000000  0.6020600
d -0.8732143 -0.3296735
e -0.7364648 -0.3890704
f  0.4078382  0.1805032
g  0.7390364  0.4584708
h  0.1554819  0.1696822
i  1.0159324  0.7418921
x  0.4346231  0.4346231
y -0.8847967  0.0000000
z  1.5697567  1.1846715

The Euclidean distance can be computed as follows:

results$EUCLID <- with(results, sqrt(
   KEYFREQTAR^2 +
   KEYASSOC^2   +
   KEYDISP^2))

Finally, the Mahalanobis distance could be computed with the standard R function as follows:

results$MAHAL <- results[,c("KEYASSOC", "KEYDISP")] %>%
   mahalanobis(x=., center=colMeans(.), cov=cov(.))

The most meaningful way to then return the output might be sorted (in decreasing order by the number of dimensions on which a word is key for T (because one really only wants those words that are key on both association and dispersion) and then one amalagamation score (e.g., the Mahalanobis distance):

(results <- results[
   order(results$KEYONHOWMANY, results$MAHAL, decreasing=TRUE),])
  WORD FREQTAR FREQREF KEYFREQTAR KEYFREQALL     KEYASSOC KEYDISPTAR KEYDISPREF     KEYDISP KEYONHOWMANY
z    z       3       0  0.6020600  0.0000000  1.000000000 0.65487307 0.00000000  0.96769672            2
i    i       4       2  0.6989700  0.3477088  0.151065640 0.61605922 0.00000000  0.91034203            2
g    g       3       2  0.6020600  0.2519296  0.056458719 0.70088274 0.22375505  0.70504487            2
b    b       5       4  0.7781513  0.5693234  0.017690016 0.14721358 1.00000000 -1.26015049            1
c    c       3       3  0.6020600  0.3477088  0.000000000 0.70088274 0.02414894  1.00000000            1
h    h       3       1  0.6020600  0.1386469  0.317520657 0.00000000 0.02414894 -0.03568455            1
a    a       5       5  0.7781513  0.6285430  0.000000000 1.00000000 0.47246210  0.77953532            1
f    f       2       3  0.4771213  0.2519296 -0.056458719 0.29422753 0.00000000  0.43477587            1
x    x       9      10  1.0000000  1.0000000 -0.003990255 0.96546219 0.66863773  0.43861333            1
y    y       0       3  0.0000000  0.0000000 -1.000000000 0.00000000 0.59877181 -0.88479667            0
d    d       1       3  0.3010300  0.1386469 -0.317520657 0.00000000 0.52624922 -0.77763106            0
e    e       2       4  0.4771213  0.3477088 -0.151065640 0.02826716 0.47788111 -0.66438821            0
    AMALGAM1   AMALGAM2    EUCLID     MAHAL
z  1.5697567  1.1846715 1.5162167 4.9497286
i  1.0159324  0.7418921 1.1576280 0.9480818
g  0.7390364  0.4584708 0.9288445 0.5871551
b -1.2463850 -0.9668222 1.4811521 4.3208884
c  1.0000000  0.6020600 1.1672516 1.6284844
h  0.1554819  0.1696822 0.6815930 1.0199254
a  0.7795353  0.6065964 1.1014512 0.9053943
f  0.4078382  0.1805032 0.6479679 0.2963551
x  0.4346231  0.4346231 1.0919696 0.2076996
y -0.8847967  0.0000000 1.3352397 4.8916118
d -0.8732143 -0.3296735 0.8922715 1.2367820
e -0.7364648 -0.3890704 0.8317916 1.0078934

Here’s a visual representation of all three dimensions:

  • the association component on the x-axis;
  • the dispersion component on the y-axis;
  • the frequency component in the font size:
plot(type="n",
   xlab="Association component of keyness", x=results$KEYASSOC,
   ylab="Dispersion component of keyness" , y=results$KEYDISP)
grid(); abline(h=0, lty=2); abline(v=0, lty=2)
text(x=results$KEYASSOC, y=results$KEYDISP,
     labels=results$WORD, cex=0.5+results$KEYDISPTAR)
Figure 1: Frequency, association, and dispersion keyness for the small example corpus

2.4 Bins

This is how one might create frequency, association, and dispersion bins such that each word type is grouped in one bin in the three-dimensional cube; here, I show for every word which combination of association and dispersion components values it scores:

results$KEYASSOCbin <- cut(
   results$KEYASSOC,
   breaks=c(-100, seq(0, 1, 0.1)),
   include.lowest=TRUE,
   labels=-1:9)
results$KEYDISPbin <- cut(
   results$KEYDISP,
   breaks=c(-100, seq(0, 1, 0.1)),
   include.lowest=TRUE,
   labels=-1:9)
print(tapply(results$WORD,
   list(ASSOC=results$KEYASSOCbin, DISP=results$KEYDISPbin),
   paste, collapse=", "), na.print=".")
     DISP
ASSOC -1        0 1 2 3 4      5 6 7   8 9
   -1 "y, d, e" . . . . "f, x" . . "a" . "c"
   0  "b"       . . . . .      . . "g" . .
   1  .         . . . . .      . . .   . "i"
   2  .         . . . . .      . . .   . .
   3  "h"       . . . . .      . . .   . .
   4  .         . . . . .      . . .   . .
   5  .         . . . . .      . . .   . .
   6  .         . . . . .      . . .   . .
   7  .         . . . . .      . . .   . .
   8  .         . . . . .      . . .   . .
   9  .         . . . . .      . . .   . "z"

3 Case study: ‘learned’ in Brown

We clear memory, source the function Keyness3D, and load the Brown corpus, which here already comes in a format that facilitates its use with Keyness3D:

rm(list=ls(all=TRUE))
source("Keyness3D.r")
head(BROWN.df <- readRDS("input/BROWN.df.RDS"))
    WORD PART
1    the  a01
2 fulton  a01
3 county  a01
4  grand  a01
5   jury  a01
6   said  a01

The learned category is represented in the corpus by part names beginning with “j” so we make all of those the target corpus tar and everything else the reference corpus ref; after than, we can apply the function to the two corpora:

tar <- droplevels(BROWN.df[substr(BROWN.df$PART, 1, 1)=="j",])
ref <- droplevels(BROWN.df[substr(BROWN.df$PART, 1, 1)!="j",])
results <- Keyness3D(tar, ref) # takes approximately 1 second

The top 50 based on word types’ association to T:

set.seed(1)
(top50.assc <- results$WORD[order(
   results$KEYASSOC,
   sample(nrow(results)),
   decreasing=TRUE)] %>% head(50))
 [1] "brucellosis"        "biopsy"             "respondent's"
 [4] "height-to-diameter" "optics"             "zero-magnitude"
 [7] "unpaired"           "gyro-stabilized"    "ebb"
[10] "classifying"        "synergistic"        "nonequivalent"
[13] "celso"              "butchered"          "iodinate"
[16] "volts"              "jurisprudentially"  "exogamy"
[19] "bereavements"       "argon"              "2.405"
[22] "rumscheidt"         "electrolysis"       "epitomize"
[25] "nakamura"           "poland's"           "agriculture's"
[28] "haupts'"            "dubin"              "proteolytic"
[31] "categorizing"       "nonspecifically"    "misnamed"
[34] "oxygens"            "plastering"         "echelons"
[37] "3,450"              "**zq"               "no-valued"
[40] "cardiomegaly"       "geatish"            "glycerolized"
[43] "interference-like"  "disentangle"        "solvents"
[46] "discolors"          "torsion"            "scalar"
[49] "tangent"            "diffusely"         

The top 50 based on word types’ dispersion in T relative to R:

(top50.disp <- results$WORD[order(
   results$KEYDISP,
   decreasing=TRUE)] %>% head(50))
 [1] "results"      "such"         "may"          "these"        "1"
 [6] "2"            "relatively"   "various"      "possible"     "similar"
[11] "method"       "amount"       "conditions"   "however"      "distribution"
[16] "assumed"      "basis"        "due"          "types"        "essentially"
[21] "therefore"    "appears"      "af"           "whereas"      "differences"
[26] "are"          "methods"      "per"          "has"          "cases"
[31] "thus"         "considerable" "described"    "which"        "extent"
[36] "used"         "ratio"        "addition"     "defined"      "related"
[41] "values"       "permit"       "isolated"     "cannot"       "necessary"
[46] "latter"       "3"            "experimental" "same"         "certain"     

The top 50 based on word types’ first amalgamation score:

(top50.amal1 <- results$WORD[order(
   results$AMALGAM1,
   decreasing=TRUE)] %>% head(50))
 [1] "results"      "af"           "1"            "distribution" "2"
 [6] "such"         "relatively"   "these"        "various"      "may"
[11] "conditions"   "method"       "assumed"      "differences"  "similar"
[16] "experimental" "essentially"  "types"        "whereas"      "defined"
[21] "possible"     "appears"      "values"       "amount"       "methods"
[26] "isolated"     "however"      "described"    "measurements" "basis"
[31] "therefore"    "analysis"     "cases"        "systems"      "calculated"
[36] "data"         "due"          "thus"         "occurring"    "parameters"
[41] "q"            "related"      "sample"       "follows"      "thermal"
[46] "variables"    "detected"     "3"            "extent"       "proportional"

On the relation between the association (x-axis) and the dispersion (y-axis) components of keyness, with point size reflecting the frequency component:

length(intersect(top50.assc, top50.disp)) # no overlap
[1] 0
plot(pch=16, col="#00000010", cex=0.5+1.5*results$KEYFREQTAR,
   xlab="Association component", x=results$KEYASSOC,
   ylab="Dispersion component",  y=results$KEYDISP); grid()
text(0.6, -0.8, paste(
   "Spearman's rho",
   round(cor(results$KEYASSOC, results$KEYDISP, method="spearman"), 4), sep="="))

These results seem quite a bit better than those of G2:

G2 <- function (a.2by2.table) {
   temp <- chisq.test(a.2by2.table, correct=FALSE)
   output <- a.2by2.table * log(a.2by2.table/temp$expected)
   return(2 * sum(output, na.rm=TRUE) * sign(t(temp$residuals)[1,1]))
}
WORD.by.CORPUS.abs <- table(BROWN.df$WORD, substr(BROWN.df$PART, 1, 1)=="j")[,2:1]
G.squareds <- apply(
   WORD.by.CORPUS.abs, 1, \(af) {
      G2(cbind(af, colSums(WORD.by.CORPUS.abs)-af))})
G.squareds %>% sort(TRUE) %>% names %>% head(50)
 [1] "af"            "of"            "is"            "anode"
 [5] "t"             "1"             "data"          "index"
 [9] "the"           "2"             "surface"       "cells"
[13] "system"        "stress"        "function"      "by"
[17] "q"             "dictionary"    "rate"          "reaction"
[21] "temperature"   "in"            "platform"      "sections"
[25] "information"   "analysis"      "results"       "values"
[29] "staining"      "which"         "binomial"      "elections"
[33] "cell"          "are"           "sample"        "be"
[37] "onset"         "c"             "shear"         "systems"
[41] "number"        "these"         "**zg"          "emission"
[45] "wage"          "curve"         "bronchial"     "used"
[49] "questionnaire" "operator"     

Here are words scoring relatively high on both the association and the dispersion components:

plot(type="n",                         # plot nothing
   xlab="Association component", xlim=c(0.2, 1), x=results$KEYASSOC,
   ylab="Dispersion component",  ylim=c(0.2, 1), y=results$KEYDISP)
grid() # add a grid and then the words (w/ their sizes reflecting frequency):
text(results$KEYASSOC, results$KEYDISP, results$WORD, cex=0.5+results$KEYFREQTAR)

4 Concluding remarks

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] data.table_1.15.4 STGmisc_1.0       Rcpp_1.0.12       magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] digest_0.6.35     fastmap_1.1.1     xfun_0.43         knitr_1.46
 [5] htmltools_0.5.8.1 rmarkdown_2.26    cli_3.6.2         rstudioapi_0.16.0
 [9] tools_4.4.0       evaluate_0.23     yaml_2.3.8        rlang_1.1.3
[13] jsonlite_1.8.8    htmlwidgets_1.6.4