1 Introduction

In the past weeks, we have been dealing with the four basic corpus-linguistic methods – frequency lists, dispersion, association/keywords, and concordancing – and we have been doing that from the perspective of base R. In this final script, we are going to explore a few things that go beyond that and that synthesize several things we have done already with functionality that other packages offer; that of course means that I cannot discuss these other packages much – the point here is to give you a flavor of a more comprehensive application. What we will do is a explore the question of which of several factors determine the order of two adjectives used prenominally in the writing of learners of English in the (untagged) Swedish component of ICLE. In addition, we will also aim to make the code a bit more readable using the pipe operator %>% from the package dplyr.

The factors we will study are

Finally, we will quickly check whether the presence of any such effects might be related to the proficiency of the learners who wrote the essays in the corpus; that proficiency will be operationalized via

We will proceed as follows:

  1. Since the learner corpus is not tagged, if we want to find Adj-Adj-N sequences, we will first POS-tag the learner data (using the libraries NLP and openNLP);
  2. we will then use the tags to retrieve Adj-Adj-N sequences (w/ exact.matches.2) plus we will do a bit of post-processing the data to get them into shape for the subsequent analyses;
  3. then, we will study the above variables:
    • length (which we will get from our data w/ nchar);
    • frequency (which we will get from frequency lists compiled in Session 5 of the course);
    • sentiment values (which we will get from the package syuzhet).
  4. finally, we will compute for each learner essay the readability score and the lexical diversity score (which we will get from the package quanteda) to see whether those are related to the above predictors.

2 Execution/methods

2.1 Preparation

We clear memory and load the first things we need:

rm(list=ls(all=TRUE))
library(dplyr)
source("https://www.stgries.info/exact.matches.2.r") # get exact.matches.2

We unzip and define the corpus files from the learner corpus:

unzip("files/ICLE2-SW.zip", # unzip this zip archive
      exdir="files")        # into the files folder/directory

corpus.files <- dir( # make tag.sources the content of the directory
   "files/ICLE2-SW", # <files/ICLE2-SW>
   recursive=TRUE,   # browse all sub-folders
   full.names=TRUE)  # keep full path information

2.2 Tag each corpus file and retrieve Adj-Adj-N sequences

To tag the corpus texts and immediately also do the concordancing, we first set up a cluster of cores/threads:

library(doParallel)       # load the required library
registerDoParallel(       # register a cluster that R knows about
   cl <-                  # under the name cl
      makePSOCKcluster(   # which is a cluster that
         detectCores()-1, # will utilize all cores/threads you have available but 1
         outfile=""))     # but still returns printed progress reports to the main console

Then, we employ the following multi-step strategy: we

  • load the packages openNLP and NLP;
  • load each file in the usual way (w/ scan) and convert to a string (in the NLP sense);
  • generate the required annotators and find the tags for each word;
  • generate tag-word pairs with angular brackets (w/ sprintf) and merge them into a new tagged corpus file (w/ paste);
  • we use exact.matches.2 to find adjective-adjective-noun sequences with context;
  • if there are matches, we add the file name to the output and collect the results in all.matches:
library(NLP); library(openNLP) # load the required libraries
system.time({ # time the following process
all.matches <- # make tagged.corpus the result of doing w/ a cluster
   foreach(i=seq(corpus.files),
           .combine=c,
           .packages=c("dplyr", "NLP", "openNLP")
           ) %dopar% { cat(".")

      curr.string <-                                       # make curr.string the result of
         scan(corpus.files[i], what=character(), sep="\n", # loading the current file
              quote="", comment.char="", quiet=TRUE) %>%   # w/ the usual corpus file loading settings
         "["(-1)                                     %>%   # removing the first element
         as.String                                         # making it a 'string'

      # generate annotators that openNLP requires
      sent_token_annotator <- Maxent_Sent_Token_Annotator()
      word_token_annotator <- Maxent_Word_Token_Annotator()
      pos_tag_annotator <- Maxent_POS_Tag_Annotator()

      # assign part-of-speech annotation
      pos.annotation <- annotate(
         curr.string,
         pos_tag_annotator,
         annotate(
            curr.string,
            list(sent_token_annotator,
                 word_token_annotator)))

      # generate words & tags
      curr.words.pointer <- subset(pos.annotation, type=="word")
      curr.tags <- sapply(curr.words.pointer$features, "[[", "POS")
      curr.words <- curr.string[curr.words.pointer]

      # generate tagged corpus file:
      sprintf("<%s>%s",          # put angular brackets 'thing'1 immediately before 'thing'2:
                 curr.tags,      # ;thing; 1
                 curr.words) %>% # 'thing; 2
      paste(collapse=" ")    ->  # paste together all word-tag pairs and dump the string into
      current.tagged.file        # current.tagged.file

      # retrieve Adj-Adj-N sequences, ...
      curr.matches <- exact.matches.2(
         "(?x)             # free-spacing
         <JJ[RS]?>         # something tagged as an adjective
         [^<]+             # that adjective
         <JJ[RS]?>         # something tagged as an adjective
         [^<]+             # that adjective
         <N[^>]+?>         # something tagged as a noun
         [^<]+             # that noun
         ",
         current.tagged.file,        # in current.tagged.file
         characters.around=250)[[4]] # return up to 250 characters around the match

      if (length(curr.matches)>0) {          # if there are matches,
         curr.matches <-                     # make curr.matches
            paste(basename(corpus.files[i]), # the result of prefixing the file name
                  curr.matches,              # in front of curr.matches
                  sep="\t")                  # with a tabstop in-between
      }

      curr.matches # 'return' to all.matches
   }
})
##    user  system elapsed
##   0.266   0.031  39.161
stopCluster(cl)
detach(package:doParallel); detach(package:NLP); detach(package:openNLP)
invisible(gc())

2.3 Postprocessing for subsequent analysis

We first output the results into a spreadsheet-like tab-delimited .csv file with the usual kind of cat application:

cat("CASE\tFILE\tPRECEDING\tMATCH\tSUBSEQUENT",
    paste(seq(all.matches),
          all.matches,
          sep="\t"),
    sep="\n",
    file="120_10_conc1.csv")

But then, we load it again into a data frame x for some post-processing (w/ read.table):

summary(x <- read.table(
   "120_10_conc1.csv",
   header=TRUE, sep="\t",
   quote="", comment.char="", stringsAsFactors=FALSE))
##       CASE           FILE            PRECEDING            MATCH
##  Min.   :  1.0   Length:639         Length:639         Length:639
##  1st Qu.:160.5   Class :character   Class :character   Class :character
##  Median :320.0   Mode  :character   Mode  :character   Mode  :character
##  Mean   :320.0
##  3rd Qu.:479.5
##  Max.   :639.0
##   SUBSEQUENT
##  Length:639
##  Class :character
##  Mode  :character
##
##
## 

Specifically, we want to get rid of the part-of-speech tags that we now don’t need anymore, plus it would be nice to have easy access to the adjectives. Thus, we first delete the tags in the preceding and subsequent context, where this is straightforward:

x$PRECEDING <- gsub( # make x$PRECEDING the result of replacing,
   "(?x)             # with free-spacing,
   <                 # an opening angular bracket
   [^>]*?            # 0 or more characters that are not closing angular brackets, till
   >",               # the closing angular bracket
   "",               # with nothing
   x$PRECEDING,      # in x$PRECEDING
   perl=TRUE)        # using Perl-compatible regular expressions
x$SUBSEQUENT <- gsub("<[^>]*?>", "", x$SUBSEQUENT, perl=TRUE) # and the same for the subsequent context

Now we do the same for the column x$MATCH, but with a slight twist:

  • we delete the tag of adjective 1 (i.e. replace it with nothing) because that one already has a tab in front of it;
  • we replace the other two tags with tabstops so we get separate columns for adjective 1, adjective 2, and the noun:
x$MATCH <- gsub("^<[^>]*?>", "", x$MATCH ,perl=TRUE)
x$MATCH <- gsub(" ?<[^>]*?>", "\t", x$MATCH ,perl=TRUE)
   names(x)[4] <- "ADJ1\tADJ2\tN"

After that, we save the file again so that now we have a neatly organized spreadsheet for further processing:

write.table(x, file="120_10_conc2.csv",
            quote=FALSE, row.names=FALSE, sep="\t")

2.4 Analysis of several predictors

Now we explore the effects of the above-mentioned predictors. We load the file again so we have the right columns lined up:

str(x <- read.table(
   "120_10_conc2.csv",
   header=TRUE, sep="\t",
   quote="", comment.char=""))
## 'data.frame':    639 obs. of  7 variables:
##  $ CASE      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ FILE      : chr  "SWUG2001.txt" "SWUG2002.txt" "SWUG2003.txt" "SWUG2003.txt" ...
##  $ PRECEDING : chr  "will be seen as a \"problem \" or not . The question is then , should we fit them in on an integrating or an as"| __truncated__ "BS>most polluted capital in the world . But who cares when Thailand 's economic growth is on the increase and i"| __truncated__ "y should share their jobs , their social security benefits etc . with foreigners . Many Swedes ask themselves w"| __truncated__ "So between a choice of integration and assimilation I would say that integration is the best way to go , even t"| __truncated__ ...
##  $ ADJ1      : chr  "lazy" "great" "swedish" "different" ...
##  $ ADJ2      : chr  "uninterested" "many" "cultural" "cultural" ...
##  $ N         : chr  "Swede" "people" "heritage" "groups" ...
##  $ SUBSEQUENT: chr  "would probably go for the latter . Because would it not be lovely if everybody behaved like us ? We could go on"| __truncated__ "in Bangkok can consume more than they could earlier , eventhough they need to wear muzzles to protect themselve"| __truncated__ "if they keep on integrating foreigners into Swedish society ? Why should native Swede have to adapt to a Turk '"| __truncated__ "in society . One should also keep in mind that Sweden is not an isolated island and that it has always thrived "| __truncated__ ...

2.4.1 Is there a length effect?

Obviously, we begin by computing the lengths of the adjectives in characters:

x$ADJ1LENGTH <- nchar(as.character(x$ADJ1))
x$ADJ2LENGTH <- nchar(as.character(x$ADJ2))

A quick monofactorial plot suggests that there is a significant length effect: there seems to be a greater number of cases where adjective 2 is longer than adjective 1:

plot(main="Adjective lengths before the N", pch=16, col="#00000030",
   xlab="Length of adjective 1 in chars", xlim=c(1, 17), x=jitter(x$ADJ1LENGTH, 1.5),
   ylab="Length of adjective 2 in chars", ylim=c(1, 17), y=jitter(x$ADJ2LENGTH, 1.5))
   grid(); abline(0, 1)

But is it significant? How do we even check this? Back to 104: a one-tailed Wilcoxon-test or the equivalent one-sample signed rank test – do you even remember those?

summary(x$ADJ1LENGTH-x$ADJ2LENGTH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
## -11.000  -4.000  -2.000  -1.759   1.000  10.000
wilcox.test(x$ADJ1LENGTH-x$ADJ2LENGTH, mu=0, correct=FALSE, alternative="less") # same as
##
##  Wilcoxon signed rank test
##
## data:  x$ADJ1LENGTH - x$ADJ2LENGTH
## V = 36245, p-value < 2.2e-16
## alternative hypothesis: true location is less than 0
(qwe <- wilcox.test(x$ADJ1LENGTH, x$ADJ2LENGTH, paired=TRUE, correct=FALSE, alternative="less"))
##
##  Wilcoxon signed rank test
##
## data:  x$ADJ1LENGTH and x$ADJ2LENGTH
## V = 36245, p-value < 2.2e-16
## alternative hypothesis: true location shift is less than 0

There is a significant difference such that adjective 2 is longer (V=3.6245^{4}, p=3.8800048^{-37}).

2.4.2 Is there a frequency effect?

We load two frequency lists that we created before in the course: one from the Frown corpus and one from the FLOB corpus:

freq.frown <- read.delim("120_06_freq_frown.csv", stringsAsFactors=TRUE)
freq.flob  <- read.delim("120_06_freq_flob.csv" , stringsAsFactors=TRUE)

We merge the two data frames, each of which represents a frequency per 1 million words, into one (w/ rbind) and then compute the combined frequencies per million words by average:

freqs <- rbind(freq.frown, freq.flob)
freqs.pmw <- tapply(
   freqs$FREQ,
   freqs$WORD,
   sum) / 2

Then, we do OCD housekeeping and delete the original frequency lists, but add the frequencies we just generated to the data frame for each of the two adjectives:

rm(freq.frown, freq.flob)
x$ADJ1FREQ <- freqs.pmw[x$ADJ1]
x$ADJ2FREQ <- freqs.pmw[x$ADJ2]

A quick monofactorial plot seems to suggest that there is no significant frequency effect:

plot(main="Adjective frequencies before the N", pch=16, col="#00000030",
   xlab="Freq pmw (log2ed) of adjective 1", xlim=c(-1.5, 15.5), x=jitter(log2(x$ADJ1FREQ)),
   ylab="Freq pmw (log2ed) of adjective 2", ylim=c(-1.5, 15.5), y=jitter(log2(x$ADJ2FREQ)))
   grid(); abline(0, 1)

This is unexpected, given that usually, length and frequency are inversely correlated. However, in our current data, the correlations between logged frequency pmw and length are surprisingly small: -0.619 for adjective 1 and -0.487 for adjective 1. Is there a significant difference as expected?

summary(x$ADJ1FREQ-x$ADJ2FREQ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
## -1416.0   -55.0   287.5   430.1   836.5  2350.5     165
wilcox.test(x$ADJ1FREQ-x$ADJ2FREQ, mu=0, correct=FALSE, alternative="greater") # same as
##
##  Wilcoxon signed rank test
##
## data:  x$ADJ1FREQ - x$ADJ2FREQ
## V = 89502, p-value < 2.2e-16
## alternative hypothesis: true location is greater than 0
(qwe <- wilcox.test(x$ADJ1FREQ, x$ADJ2FREQ, paired=TRUE, correct=FALSE, alternative="greater"))
##
##  Wilcoxon signed rank test
##
## data:  x$ADJ1FREQ and x$ADJ2FREQ
## V = 89502, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0

A one-tailed Wilcoxon-test or the equivalent one-sample signed rank test show no significant difference in the expected direction of adjective 1 being more frequent (V=8.95015^{4}, p=4.3954769^{-29}).

2.4.3 Is there a Polyanna effect?

We load the package syuzhet, which provides easy access to sentiment values:

library(syuzhet)

Then, we retrieve sentiment values for our adjectives and add them to the data frame for each of the two adjectives:

x$ADJ1SENT <- sapply(x$ADJ1, get_sent_values)
x$ADJ2SENT <- sapply(x$ADJ2, get_sent_values)

A quick monofactorial plot suggests that there is no significant Polyanna effect:

plot(main="Adjective sentiment values", pch=16, col="#00000030",
   xlab="Sentiment score of adjective 1", xlim=c(-1.25, 1.25), x=jitter(x$ADJ1SENT, 3),
   ylab="Sentiment score of adjective 2", ylim=c(-1.25, 1.25), y=jitter(x$ADJ2SENT, 3))
   grid(); abline(0, 1)