1 Introduction

Many ordering phenomena in English involve a tendency to have things that are supposedly easier to process precede things that are supposedly harder to process; two ways in which processing ease is often reflected is such that

  • shorter things are supposed to be easier to process than longer things;
  • more frequent things are supposed to be easier to process than less frequent things.

In this session, we will determine whether the ordering of two prenominal adjectives (as in nice green car) is compatible with the short-before-long and the frequent-before-rare(r) preferences; we will use the same BNC SGML files and the same overall BNC frequency list from session 8.

1.1 Step 1: things we need to do

We need to

  1. define the corpus files;
  2. for each corpus (file), we
    1. load it;
    2. homogenize upper-/lower-case spelling;
    3. make sure we search only the lines that contain actual sentence material (being very cautious);
    4. extract matches for adjective-adjective-noun sequences and store them somehow/somewhere;
  3. from all the matches, we extract the first and the second adjectives;
  4. for each adjective in each slot,
    1. we determine its length in characters;
    2. make a pairwise comparison of all adjective pairs;
  5. we
    1. load a frequency list of a large reference corpus (the 100m-word BNC);
    2. look up the frequencies of each first and each second adjective;
    3. make a pairwise comparison of all adjective pairs.

1.2 Step 2: functions we will need for that

The above tasks require the following functions:

  1. define the corpus files (dir, maybe with grep);
  2. for each corpus (file), we
    1. load it (scan);
    2. homogenize upper-/lower-case spelling (tolower);
    3. make sure we search only the lines that contain actual sentence material (being very cautious) (grep);
    4. extract matches for adjective-adjective sequences before nouns from each corpus file (exact.matches.2) and store them somehow/somewhere (<-);
  3. from all the matches, we extract the first and the second adjectives (gsub or exact.matches.2 or strsplit);
  4. for each adjective in each slot,
    1. we determine its length in characters (nchar);
    2. make a pairwise comparison of all adjective pairs (math functions);
  5. we
    1. load a frequency list of a large reference corpus (the 100m-word BNC) (read.table);
    2. look up the frequencies of each first and each second adjective (match);
    3. make a pairwise comparison of all adjective pairs (math functions).

1.3 Step 3: pseudocode

Let’s break this down some more:

  1. define the corpus files (dir, maybe with grep);
  2. for each corpus file: dir or some file-choosing function to define the corpus files, we define a collector for the matches, and then do a for-loop where we
    1. load it (scan);
    2. homogenize upper-/lower-case spelling (tolower);
    3. make sure we search only the lines that contain actual sentence material (being very cautious) (grep);
    4. extract matches for adjective-adjective sequences before nouns from each corpus file (exact.matches.2) and store them somehow/somewhere (<-);
  3. from all the matches, we extract the first and the second adjectives:
    1. either we replace everything we don’t want by nothing gsub; or
    2. we pick the words out of the concordance match (exact.matches.2); or
    3. we use the tags for splitting up the matches (strsplit);
  4. for each adjective in each slot,
    1. we determine its length in characters (nchar);
    2. make a pairwise comparison of all adjective pairs (math functions);
  5. we
    1. load a frequency list of a large reference corpus (the 100m-word BNC) (`read.table``);
    2. look up the frequencies of each first and each second adjective (match);
    3. make a pairwise comparison of all adjective pairs (math functions).

2 Implementation

2.1 Preparation

# clear memory
rm(list=ls(all=TRUE))
source("https://www.stgries.info/exact.matches.2.r") # get exact.matches.2

We define the corpus files:

corpus.files <- dir(
   "files",
   pattern="sgml_",
   full.names=TRUE)[1:4]

We define a collector structure called all.AdjAdj.be4N for the results for each adjective-adjective-noun triple:

all.AdjAdj.be4N <- character()

2.2 Processing the corpus files

Then, we loop over each file name and

  • we load each file with scan and set it to lower case (tolower):
for (i in seq(corpus.files)) { # access each corpus file
   # load each of the corpus files
   current.corpus.file <- tolower( # make current.corpus.file the lower case
      scan(                        # of what you load
         corpus.files[i],          # from the i-th corpus path
         what=character(),         # which is a file with character strings
         sep="\n",                 # separated by line breaks,
         quote="",                 # with no quote characters and
         comment.char="",          # no comment characters
         quiet=TRUE))              # suppress feedback
  • use grep to find only the sentence lines in the files:
   # use only the sentence-tagged lines of the corpus file
   current.sentences <- grep( # find
      "<s n=",                # the sentence number tags
      current.corpus.file,    # in current.corpus.file
      perl=TRUE,              # using Perl-compatible regular expressions
      value=TRUE)             # retrieve the whole line
  • use gsub to delete all tags that are not word or punctuation mark tags:
   # filter out unwanted annotation
   current.entences <- gsub(  # make current.sentences the result of replacing
      "(?x)                   # set free-spacing
      <                       # an opening angular bracket
      (?!                     # after which there is NOT ------------+
      [wc]\\s                 # a w or c followed by a space         |
      (...|...-...|n=.*?)     # some POS or sentence number tag      |
      )                       # end of after which there is NOT -----+
      .*?>",                  # but after which there is anything else
      "",                # (replacing all this) by nothing
      current.sentences, # in current.sentences
      perl=TRUE)         # using Perl-compatible regular expressions
  • find the adjective-adjective-noun triples (exact.matches.2) and …
   # retrieve all matches for each -ic/-ical pair with tags
   current.all.AdjAdj.be4N <- exact.matches.2( # look for
      "(?x)                   # set free-spacing
      <w\\saj.>               # a 1st adjective tag
      [^<]*?                  # the 1st adjective
      <w\\saj.>               # a 1st adjective tag
      [^<]*                   # the 2nd adjective
      (?=<w\\sn..>)",         # before the noun tag
      current.sentences)[[1]] # in current.sentences. save only exact matches
   # very efficient regex:
   # (?<=<w AJ0>)[^<]+<w AJ0>[^<]+?(?= ?<w N))
  • collect them in the collector vector all.AdjAdj.be4N:
   # add to previous matches of adjective doublets
   all.AdjAdj.be4N <- c(all.AdjAdj.be4N, current.all.AdjAdj.be4N)
  • print a progress report:
   cat("\f", i/length(corpus.files)) # output to the screen the % of files dealt w/ now
} # end of for: access each corpus file
##  0.25 0.5 0.75 1

Let’s check the results:

object.size(all.AdjAdj.be4N)
## 54208 bytes
head(all.AdjAdj.be4N)
## [1] "<w aj0>one-time <w aj0>polish "  "<w aj0>poor <w aj0>old "        
## [3] "<w aj0>small <w aj0>unworthy "   "<w aj0>northern <w aj0>anglian "
## [5] "<w aj0>little <w aj0>old "       "<w aj0>thick <w aj0>woollen "

2.3 Processing the matches

From this output, we now need to extract the two adjectives in parallel/aligned vectors. We can do this with searching, splitting, or replacing, depending on which regex we find easiest to write.

2.3.1 Adjectives from searching

From all.AdjAdj.be4N, we extract the adjectives; this is how to do this w/ exact.matches.2:

# pick out the adjective when 1 adjective tag is on the left & another is on the right
adj.1.e <- trimws(exact.matches.2(
   "(?<=<w aj.>)[^<]+(?= <w aj)",
   all.AdjAdj.be4N)[[1]])
# pick out the adjective when 1 adjective tag is on the left & a noun tag is on the right
adj.2.e <- trimws(exact.matches.2(
   "(?<= <w aj.>).*",
   all.AdjAdj.be4N)[[1]])

2.3.2 Adjectives from splitting

From all.AdjAdj.be4N, we extract the adjectives; this is how to do this w/ strsplit:

splitty <- strsplit( # make splitty the result of splitting
   all.AdjAdj.be4N,  # the triples
   " ?<w ...>",      # at the tags
   perl=TRUE)        # using Perl=compatible regular expressions
adj.1.s <- sapply( # make adj.1.the result of applying
   splitty,        # to splitty
   "[",            # the extraction/subsetting function
   2)              # always picking the 2nd thing
   adj.1.s <- trimws(adj.1.s) # trim remaining whitespace
adj.2.s <- sapply( # make adj.2.the result of applying
   splitty,        # to splitty
   "[",            # the extraction/subsetting function
   3)              # always picking the 3rd thing
   adj.2.s <- trimws(adj.2.s) # trim remaining whitespace

2.3.3 Adjectives from replacing

From all.AdjAdj.be4N, we extract the adjectives; this is how to do this w/ gsub:

adj.1.g <- trimws(gsub(    # make adj.1 the result of replacing
   "^<.*?>",               # the tag at the beginning
   "",                     # with nothing
   gsub(                   # in what you get when replacing
      " <w a.*",           # the first adj tag that has a space in front of it till the end
      "",                  # with nothing
      all.AdjAdj.be4N, # in all.AdjAdj.be4N
      perl=TRUE),          # using Perl-compatible regular expressions
   perl=TRUE))             # using Perl-compatible regular expressions
adj.2.g <- trimws(gsub(    # make adj.2 the result of replacing
   " <w .*$",              # a word tag till the end
   "",                     # with nothing
   gsub(                   # in what you get when replacing
      "^.*aj.>",           # from the beginning till the last adj tag
      "",                  # with nothing
      all.AdjAdj.be4N, # in all.AdjAdj.be4N
      perl=TRUE),          # using Perl-compatible regular expressions
   perl=TRUE))             # using Perl-compatible regular expressions

Let’s check the results:

head(data.frame(
   ORIG=all.AdjAdj.be4N,
   ADJ1=adj.1.e,
   ADJ2=adj.2.e),
10)
##                                ORIG     ADJ1     ADJ2
## 1    <w aj0>one-time <w aj0>polish  one-time   polish
## 2           <w aj0>poor <w aj0>old      poor      old
## 3     <w aj0>small <w aj0>unworthy     small unworthy
## 4   <w aj0>northern <w aj0>anglian  northern  anglian
## 5         <w aj0>little <w aj0>old    little      old
## 6      <w aj0>thick <w aj0>woollen     thick  woollen
## 7    <w aj0>vertical <w aj0>boring  vertical   boring
## 8     <w aj0>gross <w aj0>domestic     gross domestic
## 9        <w aj0>new <w ajc>younger       new  younger
## 10 <w ajs>clearest <w aj0>possible  clearest possible

2.4 Is adj. 1 longer than adj. 2?

Let’s compute the lengths of the adjectives:

summary(adj.1.lengths <- nchar(adj.1.e))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   6.792   8.000  18.000
summary(adj.2.lengths <- nchar(adj.2.e))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   8.000   7.901   9.000  17.000
summary(length.diffs <- adj.1.lengths-adj.2.lengths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -10.000  -3.000  -1.000  -1.109   1.000   9.000
hist(length.diffs)
   abline(v=mean(length.diffs), lty=2)

A better plot:

plot(main="Adjective lengths dep. on position", type="n", axes=FALSE, # plot nothing with this heading
     xlab="Adjective position", xlim=c(1,2), x=0, # this x-axis
     ylab="Adjective length", ylim=c(0, 20), y=0) # this y-axis
   grid(); axis(1, at=1:2); axis(2) # add a grid and some axes
   segments(1, adj.1.lengths, # plot arrows from left (1) and adjective length 1
            2, adj.2.lengths, # to the right (2) and adjective length 2
            col=ifelse(adj.1.lengths<adj.2.lengths, # if 1 < 2
                     "#0000FF30",                   # make the arrow blue
                     "#FF000030"))                  # otherwise red
   segments(1, mean(adj.1.lengths), # plot the overall mean length of adj 1
            2, mean(adj.2.lengths), # and the overall mean length of adj 2
            lwd=4)                  # with a heavy/bold line
   text(1.5, 20, wilcox.test(adj.1.lengths, adj.2.lengths, alternative="less")$p.value, pos=1, col="grey")

There is a weak short-before-long preference.

2.5 Is adj. 1 more ‘common’ than adj. 2?

We load the complete frequency list of the whole British National Corpus:

summary(freq.list.all <- read.table( # make freq.list.all the result of reading
   "files/corp_bnc_freql.txt",       # this file
   header=TRUE,                      # which has a header in the 1st row
   sep=" ",                          # uses spaces as column separators
   quote="",                         # uses no quotes
   comment.char=""))                 # & no comments
##    FREQUENCY           WORD               POS                FILES       
##  Min.   :      1   Length:938971      Length:938971      Min.   :   1.0  
##  1st Qu.:      1   Class :character   Class :character   1st Qu.:   1.0  
##  Median :      1   Mode  :character   Mode  :character   Median :   1.0  
##  Mean   :    107                                         Mean   :  18.4  
##  3rd Qu.:      4                                         3rd Qu.:   3.0  
##  Max.   :6187267                                         Max.   :4120.0
head(freq.list.all)        # check the input
##   FREQUENCY   WORD POS FILES
## 1         1   !*?* unc     1
## 2       602      % nn0   113
## 3         1  %/100 unc     1
## 4         3  %/day unc     1
## 5         1   %295 unc     1
## 6         1 %5,000 unc     1
object.size(freq.list.all) # 64,280,352
## 64280352 bytes

That’s a big file to process, let’s reduce the memory footprint to what we need, namely only the nouns (which will also make things faster):

where.are.the.adjs <- grep( # find
   "^aj\\d$",               # adjective tags
   freq.list.all$POS,       # in the tag columns
   perl=TRUE)               # using Perl-compatible regular expressions
summary(freq.list.a <- freq.list.all[ # make freq.list.n freq.list.all, but only
   where.are.the.adjs,                # the rows with adjectives
   -3]                                # and not the POS column anymore
)
##    FREQUENCY             WORD               FILES        
##  Min.   :     1.00   Length:107657      Min.   :   1.00  
##  1st Qu.:     1.00   Class :character   1st Qu.:   1.00  
##  Median :     1.00   Mode  :character   Median :   1.00  
##  Mean   :    58.19                      Mean   :  19.19  
##  3rd Qu.:     4.00                      3rd Qu.:   3.00  
##  Max.   :129451.00                      Max.   :3910.00
object.size(freq.list.a) # 9,138,440 # less than 15% of original
## 9138440 bytes
rm(freq.list.all)

Let’s find out how frequent the adjectives are and also right away how dispersed they are:

# make adj.1.X the X from the BNC, but only those where the 1st adjective matches the word from the BNC
adj.1.freqs <- freq.list.a$FREQUENCY[match(adj.1.e,  freq.list.a$WORD)]
adj.2.freqs <- freq.list.a$FREQUENCY[match(adj.2.e,  freq.list.a$WORD)]
adj.1.disps <- freq.list.a$FILES[    match(adj.1.e,  freq.list.a$WORD)]
adj.2.disps <- freq.list.a$FILES[    match(adj.2.e,  freq.list.a$WORD)]

Let’s see what the results are for the (logged) frequencies:

summary(freq.diffs <- log2(adj.1.freqs)-log2(adj.2.freqs))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -15.1022  -1.3586   0.7904   0.9499   3.2553  14.4230       45
hist(freq.diffs, xlim=c(-17, 17))
   abline(v=mean(freq.diffs, na.rm=TRUE), lty=2) # dashed line at the mean