1 Introduction

In corpus linguistics, the method of key words is often used to find the words that are used disproportionately more often in one (part of a) corpus than in some other (part of the same) corpus. It is often used in applied linguistics contexts to identify words that are characteristic for a particular genre or topic area; the paper by Leech & Fallon is an application of this method to different varieties of English in the hope of identifying words that are characteristic for a particular culture. We will generate two frequency lists – one for BrE, one for AmE (both of the 1990s) – on our way to determining for each word in both corpora which corpus/variety it is more characteristic of.

1.1 Step 1: things we need to do

We need to

  1. load each corpus (file) and homogenize upper-/lower-case spelling;
  2. clean up the corpus files by deleting tags/annotation;
  3. extract the words from each corpus (file);
  4. determine how often each word is used in each corpus (file);
  5. find words that are particularly characteristic of each corpus (file).

1.2 Step 2: functions we will need for that

The above tasks require the following functions:

  1. we load each corpus file and convert it to lower case: scan and tolower;
  2. we clean up the corpus files by deleting tags/annotation: gsub;
  3. extract the words from each corpus file: strsplit and unlist and nzchar;
  4. determine how often each word is used in each corpus (file): table + …;
  5. find words that are particularly characteristic of each corpus (file): just math functions;
  6. we visualize this results: plot and text.

1.3 Step 3: pseudocode

Let’s break this down:

  1. we load each corpus file and convert it to lower case: scan and tolower;
  2. we clean up the corpus files by deleting tags/annotation: gsub: we
    1. find things between angular brackets;
    2. replace them by nothing;
  3. extract the words from each corpus (file): we
    1. need to decide what we consider to be not-words material (either theory- or data-driven);
    2. split the corpus lines up whenever one or more of those things is found;
  4. determine how often each word is used in each corpus (file): counting things is done with table, but we need to count each word in both corpora:
    1. table will need a vector with all words so that it can count those;
    2. table will need an equally long vector saying in which each word form occurred, meaning we need to repeat each corpus name as often as there are words in the corpus;
  5. we compute some sort of coefficient that expresses which disfluency ‘prefers to precede’ which coarse-grained POS tag; this should remind you of the difference coefficient from last week;
  6. we visualize these results: plot and text.

2 Implementation

2.1 Task 1: load each corpus file

We load each corpus file into a separate vector:

head(frown <- tolower(     # show the head of frown, when you set to lower case the result of
   scan("files/FROWN.txt", # loading <files/FROWN.txt>
        what=character(),  # which contains text, not numbers
        sep="\n",          # elements are separated by line breaks
        quote="",          # there are no quotes in there
        comment.char=""))) # and no comment characters
## [1] "<#frown:a01\\><h_><p_>after 35 straight veto victories, intense lobbying fails president with election in offing<p/>"                                                                                                                              
## [2] "<p_>by elaine s. povich<p/>"                                                                                                                                                                                                                       
## [3] "<p_>chicago tribune<p/>"                                                                                                                                                                                                                           
## [4] "<h/> <p_>washington - despite intense white house lobbying, congress has voted to override the veto of a cable television regulation bill, dealing president bush the first veto defeat of his presidency just four weeks before the election.<p/>"
## [5] "<p_>monday night, the senate overrode the veto 74-25. the same margin by which the upper house approved the bill last month and comfortably above the two-thirds majority needed.<p/>"                                                             
## [6] "<p_>not one senator changed sides, a blow to bush's prestige after he had heavily lobbied republican senators, urging them not to embarrass him this close to the election.<p/>"
head(flob <- tolower(      # show the head of flob, when you set to lower case the result of
   scan("files/FLOB.txt",  # loading <files/FLOB.txt>
        what=character(),  # which contains text, not numbers
        sep="\n",          # elements are separated by line breaks
        quote="",          # there are no quotes in there
        comment.char=""))) # and no comment characters
## [1] "<#flob:a01\\><h_><p_>labour pledges reversal of nhs hospital opt-outs<p/>"                                                                                                                                                                                                                             
## [2] "<p_>by stephen castle<p/>"                                                                                                                                                                                                                                                                             
## [3] "<p_>political correspondent<p/>"                                                                                                                                                                                                                                                                       
## [4] "<h/> <p_>robin cook, labour's health spokesman, yesterday repeated party opposition to the internal market in the national health service and said there had been <quote_>\"no secret pacts with health service <}_><-|>manager<+|>managers<}/>\"<quote/> to maintain hospital trusts.<p/>"            
## [5] "<p_>speaking to prospective labour parliamentary candidates in london, mr cook said his party <quote_>\"will bring back into the local nhs all those hospitals that have opted out\"<quote/>. \"if there is an election in november and we win office we will stop any hospital in the pipeline.\"<p/>"
## [6] "<p_>he and his colleagues are concerned that managers have told some ngs staff that a labour government would accept trust status as a <tf_>fait accompli<tf/>. however, mr cook said tory plans for an internal market demonstrated the division between the values of the two parties.<p/>"

2.2 Task 2: clean up corpus files

We remove tags and other kinds of annotation (paragraph marks), which are between angular brackets:

(head(frown <- gsub( # make frown what you get when you replace
   pattern="<.*?>",  # annotation between angular brackets
   replacement="",   # with nothing
   frown,            # in frown
   perl=TRUE)))      # using Perl-compatible regular expressions
## [1] "after 35 straight veto victories, intense lobbying fails president with election in offing"                                                                                                                                            
## [2] "by elaine s. povich"                                                                                                                                                                                                                   
## [3] "chicago tribune"                                                                                                                                                                                                                       
## [4] " washington - despite intense white house lobbying, congress has voted to override the veto of a cable television regulation bill, dealing president bush the first veto defeat of his presidency just four weeks before the election."
## [5] "monday night, the senate overrode the veto 74-25. the same margin by which the upper house approved the bill last month and comfortably above the two-thirds majority needed."                                                         
## [6] "not one senator changed sides, a blow to bush's prestige after he had heavily lobbied republican senators, urging them not to embarrass him this close to the election."
(head(flob <- gsub(  # make flob what you get when you replace
   pattern="<.*?>", # annotation between angular brackets
   replacement="",  # with nothing
   flob,            # in flob
   perl=TRUE)))     # using Perl-compatible regular expressions
## [1] "labour pledges reversal of nhs hospital opt-outs"                                                                                                                                                                                                                              
## [2] "by stephen castle"                                                                                                                                                                                                                                                             
## [3] "political correspondent"                                                                                                                                                                                                                                                       
## [4] " robin cook, labour's health spokesman, yesterday repeated party opposition to the internal market in the national health service and said there had been \"no secret pacts with health service managermanagers\" to maintain hospital trusts."                                
## [5] "speaking to prospective labour parliamentary candidates in london, mr cook said his party \"will bring back into the local nhs all those hospitals that have opted out\". \"if there is an election in november and we win office we will stop any hospital in the pipeline.\""
## [6] "he and his colleagues are concerned that managers have told some ngs staff that a labour government would accept trust status as a fait accompli. however, mr cook said tory plans for an internal market demonstrated the division between the values of the two parties."

2.3 Task 3: extract the words

We split up the elements of each corpus into words at some character strings we don’t want to keep/consider as word material, which different users might define differently. For that, we first identify all characters in each corpus:

noquote(                       # show w/out quotes
   character.table <- sort(    # character.table, the result of sorting
      unique(                  # the unique elements of the
         unlist(               # unlisted
            strsplit(          # result of splitting up
               c(frown, flob), # the combination of both corpora
               ""              # at each character
))))) # close strsplit, unlist, unique, sort, noquote
##  [1] \035   '      -             !      "      #      $      %      &     
## [11] (      )      *      ,      .      /      :      ;      ?      @     
## [21] [      \\     ]      _      {      |      }      ´      £      +     
## [31] =      >      §      °      \u0096 0      ½      1      2      3     
## [41] 4      5      6      7      8      9      a      ä      b      c     
## [51] d      e      é      f      g      h      i      j      k      l     
## [61] m      n      o      ó      ò      ô      ö      õ      p      q     
## [71] r      s      t      u      ü      v      w      x      y      z

A nicer version might use magrittr’s pipe, but one needs to be aware of one small potential hiccup (of how to assign something so that it’s available in the global environment):

c(frown, flob)  %>%
   strsplit("") %>% unlist     %>%
   unique       %>% sort       %>%
   "<<-"("character.table", .) %>% # note the arrow!
   noquote

Let’s decide to discard all non-alphanumeric characters in this vector. We create a character vector split.expression that contains all characters we want to split on:

(split.expression <- paste0(  # make split.expression the result of pasting w/ nothing
   "[^0-9",                   # a character class: NOT the digits &
   paste0(                    # the result of pasting together w/ nothing
      character.table[47:80], # these letters
      collapse=""             # with nothing in between
   ),                         # close inner paste0
   "]+"                       # end of character class, one or more of those
))                            # close paste0 & output
## [1] "[^0-9aäbcdeéfghijklmnoóòôöõpqrstuüvwxyz]+"

We split up each corpus vector separately into its words:

head(frown.words <- unlist( # make frown.words the result of unlisting
   strsplit(                # the result of splitting up
      frown,                # this vector frown
      split.expression,     # at our character class
      perl=TRUE)            # using Perl-compatible regular expressions
))                          # close unlist, close head
## [1] "after"     "35"        "straight"  "veto"      "victories" "intense"
head(flob.words <- unlist( # make flob.words the result of unlisting
   strsplit(               # the result of splitting up
      flob,                # this vector frown
      split.expression,    # at our character class
      perl=TRUE)           # using Perl-compatible regular expressions
))                         # close unlist, close head
## [1] "labour"   "pledges"  "reversal" "of"       "nhs"      "hospital"

Quick check:

table(nchar(frown.words))
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##   5491  52271 168143 206603 163647 112540  87742  81822  58582  42233  28082 
##     11     12     13     14     15     16     17     18     19     20     21 
##  16220   8974   4862   2116    789    267    135     41     18     14      5 
##     22     23     25     30 
##      5      5      2      1
table(nchar(flob.words))
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##   5731  47231 176592 211298 163045 110258  85330  79560  56864  41759  27346 
##     11     12     13     14     15     16     17     18     19     20     21 
##  15868   8227   4494   1758    655    240     85     50     34     14     10 
##     22     23     24     25     26     27     29     30 
##      4      6      4      1      1      3      2      1

We eliminate the thousands of empty character vectors:

frown.words <- frown.words[nzchar(frown.words)] # subset frown.words to those 1+ character long
# and we check:
table(nchar(frown.words))
## 
##      1      2      3      4      5      6      7      8      9     10     11 
##  52271 168143 206603 163647 112540  87742  81822  58582  42233  28082  16220 
##     12     13     14     15     16     17     18     19     20     21     22 
##   8974   4862   2116    789    267    135     41     18     14      5      5 
##     23     25     30 
##      5      2      1
flob.words <- flob.words[nzchar(flob.words)] # subset flob.words to those 1+ character long
# and we check:
table(nchar(flob.words))
## 
##      1      2      3      4      5      6      7      8      9     10     11 
##  47231 176592 211298 163045 110258  85330  79560  56864  41759  27346  15868 
##     12     13     14     15     16     17     18     19     20     21     22 
##   8227   4494   1758    655    240     85     50     34     14     10      4 
##     23     24     25     26     27     29     30 
##      6      4      1      1      3      2      1

2.4 Task 4: cross-tabulate words and corpora

Since we have the corpora ‘handy’, let’s quickly save each corpus’s frequency list into a file. Who knows when we might need those again … (I do, in session 10.)

flob.table <- sort(table(flob.words), decreasing=TRUE)
cat("WORD\tFREQ",
    paste(names(flob.table), flob.table, sep="\t"),
    sep="\n", file="120_06_freq_flob.csv")
frown.table <- sort(table(frown.words), decreasing=TRUE)
cat("WORD\tFREQ",
    paste(names(frown.table), frown.table, sep="\t"),
    sep="\n", file="120_06_freq_frown.csv")

In order to cross-tabulate words and corpora, we need two equally long vectors, one with all words, the other with stating for each word which corpus it’s from. We create the former from frown.words and flob.words and the latter with rep from the lengths of frown.words and flob.words:

# define a vector with all words
length(all.words <- c( # show the length of all.words, the combination
   frown.words,        # of this corpus &
   flob.words))        # of this corpus
## [1] 2065859
# define a vector that says which corpus each word is from
length(all.corps <- c(    # show the length of all.corps, the combination
   rep(                   # of repeating
      "frown",            # "frown" as often as ...
      length(frown.words) # FROWN has words
   ),                     # and
   rep(                   # of repeating
      "flob",             # "flob" as often as ...
      length(flob.words)  # FLOB has words
   )))                    # close rep, c, & length
## [1] 2065859

We create a table – essentially a term-corpus matrix (by analogy to the notion of a term-document matrix used in information retrieval):

dim(tcm <- table( # show the dimensions of the table tcm, from tabulating
   all.words,     # all words from
   all.corps))    # both corpora
## [1] 64626     2
# take a peek
tcm[1:10,1:2]
##            all.corps
## all.words   flob frown
##   0          140   143
##   00          10    16
##   000        334   351
##   0000         1     0
##   00001        0     2
##   00011        0     2
##   0004         0     1
##   000ft        2     3
##   000l         0     1
##   000strong    1     0

I hope you were immediately thinking that we should maybe fix the whole numbering issue, maybe like this:

all.words <- gsub(
   "[0-9]+",
   "#",
   all.words,
   perl=TRUE)
dim(tcm <- table( # show the dimensions of the table tcm, from tabulating
   all.words,     # all words from
   all.corps))    # both corpora
## [1] 62168     2
tcm[1:10,1:2] # take a peek
##          all.corps
## all.words  flob frown
##   #       11248 12459
##   #a         48    41
##   #a#        10    28
##   #aa         1     0
##   #al         0     1
##   #alphai     0     1
##   #am        15     0
##   #as         1     0
##   #b         23    34
##   #b#         7    31

2.5 Task 5: quantify which word prefers which corpus

Since here the two corpora are nearly identically large (they differ by about half a percent only), we could again compute the difference coefficient, which would express which word is preferred how much in which corpus. As before, it would be computed like this and its values falls into the interval [-1,1]:

\[diff. coeff.=\frac{freq_{word~in~FROWN} - freq_{word~in~FLOB}}{freq_{word~in~FROWN} + freq_{word~in~FLOB}}\]

numerator <-   tcm[,2]-tcm[,1] # compute pairwise differences between columns
denominator <- tcm[,2]+tcm[,1] # or rowSums(tcm)
difference.coefficients <- numerator/denominator
# check result:
head(sort(difference.coefficients), 20) # useless
##      #aa      #am      #as      #ba      #bn      #bs      #cc      #cm 
##       -1       -1       -1       -1       -1       -1       -1       -1 
##      #cu      #d#     #deg #degreem      #dn      #du      #e#      #el 
##       -1       -1       -1       -1       -1       -1       -1       -1 
##      #f#      #ff       #g      #gt 
##       -1       -1       -1       -1
tail(sort(difference.coefficients), 20) # useless
##         zodiac         zombie zoroastrianism       zorthian             zs 
##              1              1              1              1              1 
##      zubchenko         zubero          zubro      zucchinis     zuckerbrot 
##              1              1              1              1              1 
##      zuckerman           zude         zuniga          zunis        zurawik 
##              1              1              1              1              1 
##             zw        zwiener       zwilling         zychik         zyklon 
##              1              1              1              1              1
set.seed(sum(utf8ToInt("haribo"))); sort(sample(difference.coefficients, 50))
##         fairyland                kp marshallsmarshals            scuffs 
##       -1.00000000       -1.00000000       -1.00000000       -1.00000000 
##       illuminates            dervla            bikers              wail 
##       -1.00000000       -1.00000000       -1.00000000       -1.00000000 
##              olea        discoverer        gallowgate        obviouosly 
##       -1.00000000       -1.00000000       -1.00000000       -1.00000000 
##             looka     disinvestment            stalls            chimes 
##       -1.00000000       -1.00000000       -1.00000000       -1.00000000 
##          trotting            embryo            jagger             shore 
##       -0.66666667       -0.47368421       -0.33333333       -0.23809524 
##             audit          assuming        approaches            summed 
##       -0.15789474       -0.13513514       -0.12500000       -0.07692308 
##           starkly           abstain     demonstrators              dark 
##        0.00000000        0.00000000        0.00000000        0.05637982 
##         mechanism    experimentally           academy            vanish 
##        0.13846154        0.14285714        0.23809524        0.25000000 
##       adventurous             chops              mono             bangs 
##        0.25000000        0.40000000        0.50000000        0.50000000 
##    psychoanalytic         nonrandom             jazzy             prowl 
##        0.57142857        1.00000000        1.00000000        1.00000000 
##        hypobiosis          deluding           timbres      ingratiating 
##        1.00000000        1.00000000        1.00000000        1.00000000 
##          histones        communique       retransmits        supertonic 
##        1.00000000        1.00000000        1.00000000        1.00000000 
##       prokaryotic             iliad 
##        1.00000000        1.00000000

But we also compute the odds ratios again, but this time we log them as well (to make the numerical ‘preference territory’ identically big for each corpus, namely 0 to ∞), but note also that we must ‘discount’ them to avoid numerical trouble with 0s, which is why we add 0.5 to each cell frequency):

numerator <- (tcm[,1]+0.5)/(tcm[,2]+0.5) # compute pairwise ratios between columns
denominator <- sum(tcm[,1])/sum(tcm[,2])
summary(logged.odds.ratios <- log(numerator/denominator))
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -6.452530 -1.094373  0.004239 -0.024785  1.102852  6.106798
sort(sample(logged.odds.ratios, 50))
##     prosecuted      yellowish    apparatuses     intestines      overvalue 
##   -2.560709952   -1.941670744   -1.941670744   -1.605198507   -1.605198507 
##    ruttenstein     okefenokee       raimunda    grandstands      squealing 
##   -1.605198507   -1.094372884   -1.094372884   -1.094372884   -1.094372884 
##     sevzapkino         deltaz pennsylvanians       encumber        rimless 
##   -1.094372884   -1.094372884   -1.094372884   -1.094372884   -1.094372884 
##           bakr  teledemocracy       intruded      shameless         lately 
##   -1.094372884   -1.094372884   -0.843058455   -0.506586219   -0.247075023 
##       accuracy       survival         debate    stimulation        burglar 
##   -0.196431290   -0.180189634   -0.165866564   -0.162814680    0.004239405 
##         concur          boned         trough     assumption       replaced 
##    0.004239405    0.004239405    0.129402548    0.132620572    0.136507374 
##        company          worst     resentment           fist         supple 
##    0.154899680    0.188088846    0.255553833    0.283824267    0.456224529 
##     depictions      chemistry     clemmensen   patriotiques      epitomize 
##    0.851537265    0.989092808    1.102851694    1.102851694    1.102851694 
##    bulletother        flecked         weeded         eileen            soy 
##    1.102851694    1.102851694    1.102851694    1.362362889    1.613677318 
##     monarchist      dishcloth           flue    consumerist       realised 
##    1.613677318    1.613677318    1.950149554    1.950149554    4.879436728

2.6 Task 6: we visualize

How about a plot that reflects both

  • the frequency of each word in both corpora;
  • the preferential behavior of each word with regard to the corpora?

2.6.1 Plots using the difference coefficients

We put the (log of the) former on the x-axis and the latter on the y-axis:

plot(type="n",                                 # plot nothing
   xlab="Binary log of word frequency",        # w/ this x-axis label
   xlim=c(0, 17),                              # w/ these x-axis limits
   x=log2(rowSums(tcm)),                       # & these x-axis values
   ylab="Diff. coeffs. (<0: FROWN; >0: FLOB)", # w/ this y-axis label
   ylim=c(-1, 1),                              # w/ this y-axis label
   y=difference.coefficients)                  # w/ this y-axis label
grid() # add a grey grid
text(                       # plot text
   log2(rowSums(tcm)),      # at these x-axis coordinates
   difference.coefficients, # at these y-axis coordinates
   labels=rownames(tcm),    # the disfluencies
   font=3, cex=0.8)         # italicized and 20% smaller
# add a dashed horizontal line at 'neutrality' (the corpus sizes)
abline(h=(colSums(tcm)[1] - colSums(tcm)[2]) / sum(colSums(tcm)), lty=2)