1 Introduction

Much of corpus linguistics is based on the distributional hypothesis, here in the form provided by Harris (1970:785f.):

[i]f we consider words or morphemes A and B to be more different in meaning than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C. […], difference of meaning correlates with difference of distribution.

In other words, the idea behind this is that distributional similarity reflects functional similarity – semantic, discourse-functional, or other kinds of similarity. That implies words that are semantically similar tend to occur in similar lexical and grammatical contexts. For instance, the collocates – the words you find ‘around’ – the word cat will be more similar to the words you find around the word dog than to the collocates of the word ethereal. This distributional hypothesis has been used particularly much in studies of near synonymy, i.e. for sets of words with extremely similar meanings/functions. Would you be able to explain to a learner of English when to use fast vs. quick vs. rapid vs. swift? When to use fatal vs. lethal vs. deadly vs. mortal? Most likely not … The way a corpus linguist would try to tease apart these synonyms is by their collocates; maybe in all these adjective cases specifically by looking at the nouns they might modify within a noun phrase. That is what we will do here for two pairs of -ic/-ical adjectives, electric(al) and historic(al).

1.1 Step 1: things we need to do

We need to

  1. define the corpus files;
  2. for each corpus (file), we
    1. load it;
    2. homogenize upper-/lower-case spelling;
    3. make sure we search only the lines that contain actual sentence material (being very cautious);
    4. extract matches for electric(al) and historic(al) when followed by a noun from each corpus file and store them somehow/somewhere;
  3. from all the matches, we extract the adjectives and the nouns;
  4. we identify which collocates of each adjective pair are distinctive for which adjective;
  5. we
    1. load a frequency list of a large reference corpus (the 100m-word BNC);
    2. look up the frequencies of all collocates of each -ic and each -ical adjective;
    3. compare the average frequency of the -ic adjective collocates to that of the -ical adjective collocates.

1.2 Step 2: functions we will need for that

The above tasks require the following functions:

  1. define the corpus files (dir, maybe with grep);
  2. for each corpus (file), we
    1. load it (scan);
    2. homogenize upper-/lower-case spelling (tolower);
    3. make sure we search only the lines that contain actual sentence material (being very cautious) (grep);
    4. extract matches for electric(al) and historic(al) when followed by a noun from each corpus file (exact.matches.2) and store them somehow/somewhere (<-);
  3. from all the matches, we extract the adjectives and the nouns (gsub or exact.matches.2);
  4. we identify which collocates of each adjective pair are distinctive for which adjective (math functions);
  5. we
    1. load a frequency list of a large reference corpus (the 100m-word BNC) (read.table);
    2. look up the frequencies of all collocates of each -ic and each -ical adjective (match);
    3. compare the average frequency of the -ic adjective collocates to that of the -ical adjective collocates (tapply and median).

1.3 Step 3: pseudocode

Let’s break this down:

  1. define the corpus files (dir, maybe with grep);
  2. we define a collector for the matches, and then do a for-loop where we
    1. load it (scan);
    2. homogenize upper-/lower-case spelling (tolower);
    3. make sure we search only the lines that contain actual sentence material (being very cautious) (grep);
    4. extract matches for electric(al) and historic(al) when followed by a noun from each corpus file (exact.matches.2) and store them somehow/somewhere (<-);
  3. from all the matches, we extract the adjectives and the nouns:
    1. either we replace everything we don’t want by nothing (gsub); or
    2. we pick the words out of the concordance match (exact.matches.2);
  4. we identify which collocates of each adjective pair are distinctive for which adjective (using table and math functions; this should remind you of the difference coefficient/(log) odds ratio part from last week);
  5. we
    1. load a frequency list of a large reference corpus (the 100m-word BNC) (read.table);
    2. look up the frequencies of all collocates of each -ic and each -ical adjective (match);
    3. compare the average frequency of the -ic adjective collocates to that of the -ical adjective collocates (tapply and median).
  6. we visualize these results: plot and text.

2 Implementation

2.1 Preparation

# clear memory
rm(list=ls(all=TRUE))
source("https://www.stgries.info/exact.matches.2.r") # get exact.matches.2

We define the corpus files:

corpus.files <- dir(
   "files",
   pattern="sgml_",
   full.names=TRUE)[1:4]

We define a collector structure for the results, an empty character vector:

all.matches <- character()

2.2 Processing the corpus files

Then, we loop over each file name and

  • we load each file with scan and set it to lower case (tolower):
for (i in seq(corpus.files)) { # access each corpus file
   # load each of the corpus files
   current.corpus.file <- tolower( # make current.corpus.file the lower case
      scan(                        # of what you load
         corpus.files[i],          # from the i-th corpus path
         what=character(),         # which is a file with character strings
         sep="\n",                 # separated by line breaks,
         quote="",                 # with no quote characters and
         comment.char="",          # no comment characters
         quiet=TRUE))              # suppress feedback
  • use grep to find only the sentence lines in the files:
   # use only the sentence-tagged lines of the corpus file
   current.sentences <- grep( # find
      "<s n=",                # the sentence number tags
      current.corpus.file,    # in current.corpus.file
      perl=TRUE,              # using Perl-compatible regular expressions
      value=TRUE)             # retrieve the whole line
  • use gsub to delete all tags that are not word or punctuation mark tags:
   # filter out unwanted annotation
   current.sentences <- gsub("(?x) # make current.sentences the result of replacing
      <                       # an opening angular bracket
      (?!                     # after which there is NOT ------------+
      [wc]\\s                 # a w or c followed by a space         |
      (...|...-...|n=.*?)     # some POS or sentence number tag      |
      )                       # end of after which there is NOT -----+
      .*?>",                  # but after which there is anything else
      "",                # (replacing all this) by nothing
      current.sentences, # in current.sentences
      perl=TRUE)         # using Perl-compatible regular expressions
   # alternative search expression: "<(?![wc] (...|...-...)).*?>[^<]*"
  • find the adjective-noun pairs (exact.matches.2):
   # retrieve all matches for each -ic/-ical pair with tags
   current.matches <- exact.matches.2( # look for
      "(?x)                                # set free-spacing
      <w\\s(aj0|aj0-...)>                  # an adjective tag (possibly as a portmanteau tag)
      (elect|histo)ric(al)?                # electric(al)? or historic(al)?
      \\s                                  # a space
      <w\\s(n..|n..-...)>                  # a noun tag (possibly as a portmanteau tag)
      [^<]+",                              #
      current.sentences)[[1]]              # in current.sentences. save only exact matches
  • collect them in the collector vectors:
   # add to previous matches
   all.matches <- c(all.matches, current.matches) # collect
  • print a progress report:
   cat("\f", i/length(corpus.files)) # output to the screen the % of files dealt w/ now
} # end of for: access each corpus file
##  0.25 0.5 0.75 1
  • and then we check the results:
object.size(all.matches)
## 18296 bytes

2.3 Processing the matches

We extract the adjectives from all.matches:

all.adjectives <-          # make all.adjectives the result
   sub("<.*?>([^<]+) <.*", # of replacing the whole string, but memorize stuff between ">" and " <"
       "\\1",              # with the memorized stuff
       all.matches,        # in all.matches
       perl=TRUE)          # using Perl-compatible regular expressions

We extract the nouns from all.matches:

all.nouns <- trimws( # make all.nouns the result of trimming whitespace from
   sub("^.*>",       # what you get when you replace everything till the last ">"
       "",           # with nothing
       all.matches,  # in all.matches
       perl=TRUE))   # using Perl-compatible regular expressions

Alternatively, you could have done this:

qwe <- strsplit(all.matches, "<.*?>", perl=TRUE)
all.adjectives <- trimws(sapply(qwe, "[", 2))
all.nouns <- trimws(sapply(qwe, "[", 3))

Let’s compile and check the results:

results <- data.frame(        # makes results a data frame w/
   ADJECTIVES=all.adjectives, # a column with all the adjectives
   NOUNS     =all.nouns)      # a column with all the nouns
results.split <- split( # makes results.split a list by splitting up
   results,             # results
   substr(results$ADJECTIVES, 1, 3)) # dep. on the 1st 4 chars of the adj.
lapply(           # apply to each element of
   results.split, # the split-up list
   head,          # the function head &
   10)            # show the first 10 rows
## $ele
##    ADJECTIVES   NOUNS
## 1    electric    fire
## 3    electric windows
## 4    electric   motor
## 5    electric cookers
## 8    electric blanket
## 9    electric heating
## 11   electric company
## 12   electric   shock
## 14   electric   board
## 16   electric   house
## 
## $his
##    ADJECTIVES        NOUNS
## 2    historic         town
## 6    historic        towns
## 7    historic         core
## 10   historic         city
## 13   historic organization
## 15   historic       status
## 24   historic        truth
## 29   historic     wildlife
## 33   historic       trends
## 40 historical    awareness

2.4 Computing distinctive collocates

Let’s compute a term-by-adjective matrix for each adjective pair:

# for electric(al)
dim(tam.ele <- table( # show the dimensions of the table tam.ele, from tabulating
   results.split$ele$NOUNS,       # all nouns after electric(al)
   results.split$ele$ADJECTIVES)) # both electric(al) adjectives
## [1] 89  2
# for historic(al)
dim(tam.his <- table( # show the dimensions of the table tam.his, from tabulating
   results.split$his$NOUNS,       # all nouns after historic(al)
   results.split$his$ADJECTIVES)) # both historic(al) adjectives
## [1] 56  2
# take a peek
head(tam.ele <- tam.ele[order(tam.ele[,1]/rowSums(tam.ele), -rowSums(tam.ele)),])
##            
##             electric electrical
##   energy           0          5
##   fault            0          5
##   apparatus        0          4
##   desk             0          4
##   goods            0          4
##   supplier         0          4
head(tam.his <- tam.his[order(tam.his[,1]/rowSums(tam.his), -rowSums(tam.his)),])
##               
##                historic historical
##   account             0          4
##   context             0          4
##   data                0          4
##   event               0          4
##   figure              0          4
##   significance        0          3

We compute what we know well by now, the logged odds ratios:

# for electric(al)
numer.ele <- (tam.ele[,1]+0.5)/(tam.ele[,2]+0.5) # compute pairwise ratios between columns
denom.ele <- sum(tam.ele[,1])/sum(tam.ele[,2])
summary(logged.ors.ele <- log(numer.ele/denom.ele))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.56413 -1.77567  0.93238  0.03467  0.93238  2.54181
# for historic(al)
numer.his <- (tam.his[,1]+0.5)/(tam.his[,2]+0.5) # compute pairwise ratios between columns
denom.his <- sum(tam.his[,1])/sum(tam.his[,2])
summary(logged.ors.his <- log(numer.his/denom.his))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.5294 -0.9416 -0.9416  0.1782  1.7664  3.5010

But because our data set is so small, the usual plot we might consider suffers from so much overplotting that it’s useless. For now, we just sort by the association, i.e. the log odds values:

sort(logged.ors.ele)
##      energy       fault   apparatus        desk       goods    supplier 
## -2.56413069 -2.56413069 -2.36346000 -2.36346000 -2.36346000 -2.36346000 
##     systems   appliance      charge     charges    circuits  contractor 
## -2.36346000 -2.11214557 -1.77567333 -1.77567333 -1.77567333 -1.77567333 
##    engineer engineering     goodies    impulses  insulators      panels 
## -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.77567333 
##        pole   potential    products    register       shops       staff 
## -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.77567333 
##      theory      things     tracing       wires      checks        gear 
## -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.26484771 -1.26484771 
##       trade  appliances   equipment       meter       power        cars 
## -1.26484771 -1.26484771 -1.26484771 -0.67706104 -0.67706104  0.03443528 
##       shock        fire       cable      cables       chair     circuit 
##  0.68106244  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687 
##       clock     cookers      cutter      fibres       field       flymo 
##  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687 
##      gadget      garage     heating         hob       house   immersion 
##  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687 
##      lights         man      motors       pumps      razors     rollers 
##  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687 
##     running      shears      shower      supply    toasters       track 
##  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687 
##      trains       valve         van       water  wheelchair       bills 
##  0.93237687  0.93237687  0.93237687  0.93237687  0.93237687  1.44320249 
##     company     current     ecology       fires        lamp     toaster 
##  1.44320249  1.44320249  1.44320249  1.44320249  1.44320249  1.44320249 
##       board         car       light      shaver    blankets     windows 
##  1.77967473  1.77967473  1.77967473  1.77967473  2.03098916  2.03098916 
##      cooker       drill       motor     blanket        bill 
##  2.23165985  2.23165985  2.23165985  2.39871394  2.54181478
sort(logged.ors.his)
##        account        context           data          event         figure 
##     -1.5293952     -1.5293952     -1.5293952     -1.5293952     -1.5293952 
##   significance       accident   appreciation         budget           cord 
##     -1.2780808     -0.9416085     -0.9416085     -0.9416085     -0.9416085 
##         issues    linguistics          links      monuments        objects 
##     -0.9416085     -0.9416085     -0.9416085     -0.9416085     -0.9416085 
##           ones    portraiture    progression     pythagoras          rates 
##     -0.9416085     -0.9416085     -0.9416085     -0.9416085     -0.9416085 
##         record        records     references       research        romance 
##     -0.9416085     -0.9416085     -0.9416085     -0.9416085     -0.9416085 
##       sequence       sidestep      structure         theory          visit 
##     -0.9416085     -0.9416085     -0.9416085     -0.9416085     -0.9416085 
##      awareness       question reconstruction          truth      buildings 
##     -0.4307829     -0.4307829     -0.4307829      0.1570037      0.6678294 
##         centre          chair            day     enterprise       landmark 
##      1.7664417      1.7664417      1.7664417      1.7664417      1.7664417 
##         leases   organization        origins        setting          sites 
##      1.7664417      1.7664417      1.7664417      1.7664417      1.7664417 
##         status          title          trend         trends       wildlife 
##      1.7664417      1.7664417      1.7664417      1.7664417      1.7664417 
##           york      character           town          towns           core 
##      1.7664417      2.2772673      2.2772673      2.2772673      3.3758796 
##           city 
##      3.5010427

2.5 Is -ical in wider/more common use?

Now let’s test Marchand’s claim that -ical forms are in wider common use. We load the complete frequency list of the whole British National Corpus:

summary(freq.list.all <- read.table( # make freq.list.all the result of reading
   "files/corp_bnc_freql.txt",       # this file
   header=TRUE,                      # which has a header in the 1st row
   sep=" ",                          # uses spaces as column separators
   quote="",                         # uses no quotes
   comment.char=""))                 # & no comments
##    FREQUENCY           WORD               POS                FILES       
##  Min.   :      1   Length:938971      Length:938971      Min.   :   1.0  
##  1st Qu.:      1   Class :character   Class :character   1st Qu.:   1.0  
##  Median :      1   Mode  :character   Mode  :character   Median :   1.0  
##  Mean   :    107                                         Mean   :  18.4  
##  3rd Qu.:      4                                         3rd Qu.:   3.0  
##  Max.   :6187267                                         Max.   :4120.0
head(freq.list.all)        # check the input
##   FREQUENCY   WORD POS FILES
## 1         1   !*?* unc     1
## 2       602      % nn0   113
## 3         1  %/100 unc     1
## 4         3  %/day unc     1
## 5         1   %295 unc     1
## 6         1 %5,000 unc     1
object.size(freq.list.all) # 64,280,352
## 64280352 bytes

That’s a big file to process, let’s reduce the memory footprint to what we need, namely only the nouns:

where.are.the.nouns <- grep( # find
   "^nn\\d$",                # noun tags
   freq.list.all$POS,        # in the tag columns
   perl=TRUE)                # using Perl-compatible regular expressions
summary(freq.list.n <- freq.list.all[ # make freq.list.n freq.list.all, but only
   where.are.the.nouns,               # the rows with nouns
   -3]                                # and not the POS column anymore
)
##    FREQUENCY             WORD               FILES        
##  Min.   :     1.00   Length:229954      Min.   :   1.00  
##  1st Qu.:     1.00   Class :character   1st Qu.:   1.00  
##  Median :     1.00   Mode  :character   Median :   1.00  
##  Mean   :    86.57                      Mean   :  23.65  
##  3rd Qu.:     5.00                      3rd Qu.:   3.00  
##  Max.   :153679.00                      Max.   :3931.00
object.size(freq.list.n) # 18551280 # less than 30% of original
## 18551280 bytes
rm(freq.list.all)

Let’s find out how frequent the nouns after electric(al) are:

collo.type.ele <- unique( # find the unique
   results.split$ele)     # of electric(al) w/ nouns
tapply(                       # apply to 
   freq.list.n$FREQUENCY[     # the frequencies from the BNC a subsetting, namely
      match(                   # retrieve the frequencies
         collo.type.ele$NOUNS, # of the noun collocates of electric(al)
         freq.list.n$WORD)],   # via cross-referencing the BNC nounbs
   collo.type.ele$ADJECTIVES, # group the frequencies by the 2 adjectives
   median,     # compute the median while
   na.rm=TRUE) # omitting missing data/frequencies
##   electric electrical 
##     1069.0     3229.5

Let’s find out how frequent the nouns after historic(al) are:

collo.type.his <- unique( # find the unique
   results.split$his)     # of historic(al) w/ nouns
tapply(                       # apply to 
   freq.list.n$FREQUENCY[     # the frequencies from the BNC a subsetting, namely
      match(                   # retrieve the frequencies
         collo.type.his$NOUNS, # of the noun collocates of historic(al)
         freq.list.n$WORD)],   # via cross-referencing the BNC nounbs
   collo.type.his$ADJECTIVES, # group the frequencies by the 2 adjectives
   median,     # compute the median while
   na.rm=TRUE) # omitting missing data/frequencies
##   historic historical 
##       4090       4292

This suggests that Marchand’s claim

  • is strongly supported for electric(al): 1069 (for electric) is less than 3229.5 (for electrical);
  • is weakly supported for historic(al): 4090 (for historic) is more than 4292 (for historical).

But maybe “-ical forms are in wider common use” pertains to dispersion (the topic of next session) so let’s do a check of this by looking not at frequency but the number of files in which the noun collocates are observed:

tapply(                       # apply to 
   freq.list.n$FILES[         # the ranges from the BNC a subsetting, namely
      match(                   # retrieve the frequencies
         collo.type.ele$NOUNS, # of the noun collocates of electric(al)
         freq.list.n$WORD)],   # via cross-referencing the BNC nounbs
   collo.type.ele$ADJECTIVES, # group the frequencies by the 2 adjectives
   median,     # compute the median while
   na.rm=TRUE) # omitting missing data/frequencies
##   electric electrical 
##      485.5     1054.0
tapply(                       # apply to 
   freq.list.n$FILES[         # the ranges from the BNC a subsetting, namely
      match(                   # retrieve the frequencies
         collo.type.his$NOUNS, # of the noun collocates of historic(al)
         freq.list.n$WORD)],   # via cross-referencing the BNC nounbs
   collo.type.his$ADJECTIVES, # group the frequencies by the 2 adjectives
   median,     # compute the median while
   na.rm=TRUE) # omitting missing data/frequencies
##   historic historical 
##     1029.5     1002.0

This suggests that Marchand’s claim

  • is supported for electric(al): 485.5 (for electric) is less than 1054 (for electrical);
  • is not supported for historic(al): 1029.5 (for historic) is more than 1002 (for historical).

This also means that distinguishing between frequency and dispersion is important.

Housekeeping:

sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  compiler  methods  
## [8] base     
## 
## other attached packages:
## [1] magrittr_2.0.3
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.31   R6_2.5.1        jsonlite_1.8.4  evaluate_0.20  
##  [5] cachem_1.0.6    rlang_1.0.6     cli_3.6.0       rstudioapi_0.14
##  [9] jquerylib_0.1.4 bslib_0.4.2     rmarkdown_2.20  tools_4.2.2    
## [13] xfun_0.37       yaml_2.3.7      fastmap_1.1.0   htmltools_0.5.4
## [17] knitr_1.42      sass_0.4.5