In the past weeks, we have been dealing with the four basic
corpus-linguistic methods – frequency lists, dispersion,
association/keywords, and concordancing – and we have been doing that
from the perspective of base R. In this final script, we are going to
explore a few things that go beyond that and that synthesize several
things we have done already with functionality that other packages
offer; that of course means that I cannot discuss these other packages
much – the point here is to give you a flavor of a more comprehensive
application. What we will do is a explore the question of which of
several factors determine the order of two adjectives used prenominally
in the writing of learners of English in the (untagged) Swedish
component of ICLE.
In addition, we will also aim to make the code a bit more readable using
the pipe operator %>%
from the package
dplyr
.
The factors we will study are
Finally, we will quickly check whether the presence of any such effects might be related to the proficiency of the learners who wrote the essays in the corpus; that proficiency will be operationalized via
We will proceed as follows:
NLP
and openNLP
);exact.matches.2
) plus we will do a bit of post-processing
the data to get them into shape for the subsequent analyses;nchar
);syuzhet
).quanteda
) to see whether those are related to the
above predictors.We clear memory and load the first things we need:
rm(list=ls(all=TRUE))
library(dplyr)
source("https://www.stgries.info/exact.matches.2.r") # get exact.matches.2
We unzip and define the corpus files from the learner corpus:
unzip("files/ICLE2-SW.zip", # unzip this zip archive
exdir="files") # into the files folder/directory
corpus.files <- dir( # make tag.sources the content of the directory
"files/ICLE2-SW", # <files/ICLE2-SW>
recursive=TRUE, # browse all sub-folders
full.names=TRUE) # keep full path information
To tag the corpus texts and immediately also do the concordancing, we first set up a cluster of cores/threads:
library(doParallel) # load the required library
registerDoParallel( # register a cluster that R knows about
cl <- # under the name cl
makePSOCKcluster( # which is a cluster that
detectCores()-1, # will utilize all cores/threads you have available but 1
outfile="")) # but still returns printed progress reports to the main console
Then, we employ the following multi-step strategy: we
openNLP
and NLP
;scan
) and convert
to a string (in the NLP
sense);sprintf
) and merge them into a new tagged corpus file (w/
paste
);exact.matches.2
to find adjective-adjective-noun
sequences with context;all.matches
:library(NLP); library(openNLP) # load the required libraries
system.time({ # time the following process
all.matches <- # make tagged.corpus the result of doing w/ a cluster
foreach(i=seq(corpus.files),
.combine=c,
.packages=c("dplyr", "NLP", "openNLP")
) %dopar% { cat(".")
curr.string <- # make curr.string the result of
scan(corpus.files[i], what=character(), sep="\n", # loading the current file
quote="", comment.char="", quiet=TRUE) %>% # w/ the usual corpus file loading settings
"["(-1) %>% # removing the first element
as.String # making it a 'string'
# generate annotators that openNLP requires
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
# assign part-of-speech annotation
pos.annotation <- annotate(
curr.string,
pos_tag_annotator,
annotate(
curr.string,
list(sent_token_annotator,
word_token_annotator)))
# generate words & tags
curr.words.pointer <- subset(pos.annotation, type=="word")
curr.tags <- sapply(curr.words.pointer$features, "[[", "POS")
curr.words <- curr.string[curr.words.pointer]
# generate tagged corpus file:
sprintf("<%s>%s", # put angular brackets 'thing'1 immediately before 'thing'2:
curr.tags, # ;thing; 1
curr.words) %>% # 'thing; 2
paste(collapse=" ") -> # paste together all word-tag pairs and dump the string into
current.tagged.file # current.tagged.file
# retrieve Adj-Adj-N sequences, ...
curr.matches <- exact.matches.2(
"(?x) # free-spacing
<JJ[RS]?> # something tagged as an adjective
[^<]+ # that adjective
<JJ[RS]?> # something tagged as an adjective
[^<]+ # that adjective
<N[^>]+?> # something tagged as a noun
[^<]+ # that noun
",
current.tagged.file, # in current.tagged.file
characters.around=250)[[4]] # return up to 250 characters around the match
if (length(curr.matches)>0) { # if there are matches,
curr.matches <- # make curr.matches
paste(basename(corpus.files[i]), # the result of prefixing the file name
curr.matches, # in front of curr.matches
sep="\t") # with a tabstop in-between
}
curr.matches # 'return' to all.matches
}
})
## user system elapsed
## 0.266 0.031 39.161
stopCluster(cl)
detach(package:doParallel); detach(package:NLP); detach(package:openNLP)
invisible(gc())
We first output the results into a spreadsheet-like tab-delimited
.csv file with the usual kind of cat
application:
cat("CASE\tFILE\tPRECEDING\tMATCH\tSUBSEQUENT",
paste(seq(all.matches),
all.matches,
sep="\t"),
sep="\n",
file="120_10_conc1.csv")
But then, we load it again into a data frame x
for some
post-processing (w/ read.table
):
summary(x <- read.table(
"120_10_conc1.csv",
header=TRUE, sep="\t",
quote="", comment.char="", stringsAsFactors=FALSE))
## CASE FILE PRECEDING MATCH
## Min. : 1.0 Length:639 Length:639 Length:639
## 1st Qu.:160.5 Class :character Class :character Class :character
## Median :320.0 Mode :character Mode :character Mode :character
## Mean :320.0
## 3rd Qu.:479.5
## Max. :639.0
## SUBSEQUENT
## Length:639
## Class :character
## Mode :character
##
##
##
Specifically, we want to get rid of the part-of-speech tags that we now don’t need anymore, plus it would be nice to have easy access to the adjectives. Thus, we first delete the tags in the preceding and subsequent context, where this is straightforward:
x$PRECEDING <- gsub( # make x$PRECEDING the result of replacing,
"(?x) # with free-spacing,
< # an opening angular bracket
[^>]*? # 0 or more characters that are not closing angular brackets, till
>", # the closing angular bracket
"", # with nothing
x$PRECEDING, # in x$PRECEDING
perl=TRUE) # using Perl-compatible regular expressions
x$SUBSEQUENT <- gsub("<[^>]*?>", "", x$SUBSEQUENT, perl=TRUE) # and the same for the subsequent context
Now we do the same for the column x$MATCH
, but with a
slight twist:
x$MATCH <- gsub("^<[^>]*?>", "", x$MATCH ,perl=TRUE)
x$MATCH <- gsub(" ?<[^>]*?>", "\t", x$MATCH ,perl=TRUE)
names(x)[4] <- "ADJ1\tADJ2\tN"
After that, we save the file again so that now we have a neatly organized spreadsheet for further processing:
write.table(x, file="120_10_conc2.csv",
quote=FALSE, row.names=FALSE, sep="\t")
Now we explore the effects of the above-mentioned predictors. We load the file again so we have the right columns lined up:
str(x <- read.table(
"120_10_conc2.csv",
header=TRUE, sep="\t",
quote="", comment.char=""))
## 'data.frame': 639 obs. of 7 variables:
## $ CASE : int 1 2 3 4 5 6 7 8 9 10 ...
## $ FILE : chr "SWUG2001.txt" "SWUG2002.txt" "SWUG2003.txt" "SWUG2003.txt" ...
## $ PRECEDING : chr "will be seen as a \"problem \" or not . The question is then , should we fit them in on an integrating or an as"| __truncated__ "BS>most polluted capital in the world . But who cares when Thailand 's economic growth is on the increase and i"| __truncated__ "y should share their jobs , their social security benefits etc . with foreigners . Many Swedes ask themselves w"| __truncated__ "So between a choice of integration and assimilation I would say that integration is the best way to go , even t"| __truncated__ ...
## $ ADJ1 : chr "lazy" "great" "swedish" "different" ...
## $ ADJ2 : chr "uninterested" "many" "cultural" "cultural" ...
## $ N : chr "Swede" "people" "heritage" "groups" ...
## $ SUBSEQUENT: chr "would probably go for the latter . Because would it not be lovely if everybody behaved like us ? We could go on"| __truncated__ "in Bangkok can consume more than they could earlier , eventhough they need to wear muzzles to protect themselve"| __truncated__ "if they keep on integrating foreigners into Swedish society ? Why should native Swede have to adapt to a Turk '"| __truncated__ "in society . One should also keep in mind that Sweden is not an isolated island and that it has always thrived "| __truncated__ ...
Obviously, we begin by computing the lengths of the adjectives in characters:
x$ADJ1LENGTH <- nchar(as.character(x$ADJ1))
x$ADJ2LENGTH <- nchar(as.character(x$ADJ2))
A quick monofactorial plot suggests that there is a significant length effect: there seems to be a greater number of cases where adjective 2 is longer than adjective 1:
plot(main="Adjective lengths before the N", pch=16, col="#00000030",
xlab="Length of adjective 1 in chars", xlim=c(1, 17), x=jitter(x$ADJ1LENGTH, 1.5),
ylab="Length of adjective 2 in chars", ylim=c(1, 17), y=jitter(x$ADJ2LENGTH, 1.5))
grid(); abline(0, 1)
But is it significant? How do we even check this? Back to 104: a one-tailed Wilcoxon-test or the equivalent one-sample signed rank test – do you even remember those?
summary(x$ADJ1LENGTH-x$ADJ2LENGTH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.000 -4.000 -2.000 -1.759 1.000 10.000
wilcox.test(x$ADJ1LENGTH-x$ADJ2LENGTH, mu=0, correct=FALSE, alternative="less") # same as
##
## Wilcoxon signed rank test
##
## data: x$ADJ1LENGTH - x$ADJ2LENGTH
## V = 36245, p-value < 2.2e-16
## alternative hypothesis: true location is less than 0
(qwe <- wilcox.test(x$ADJ1LENGTH, x$ADJ2LENGTH, paired=TRUE, correct=FALSE, alternative="less"))
##
## Wilcoxon signed rank test
##
## data: x$ADJ1LENGTH and x$ADJ2LENGTH
## V = 36245, p-value < 2.2e-16
## alternative hypothesis: true location shift is less than 0
There is a significant difference such that adjective 2 is longer (V=3.6245^{4}, p=3.8800048^{-37}).
We load two frequency lists that we created before in the course: one from the Frown corpus and one from the FLOB corpus:
freq.frown <- read.delim("120_06_freq_frown.csv", stringsAsFactors=TRUE)
freq.flob <- read.delim("120_06_freq_flob.csv" , stringsAsFactors=TRUE)
We merge the two data frames, each of which represents a frequency
per 1 million words, into one (w/ rbind
) and then compute
the combined frequencies per million words by average:
freqs <- rbind(freq.frown, freq.flob)
freqs.pmw <- tapply(
freqs$FREQ,
freqs$WORD,
sum) / 2
Then, we do OCD housekeeping and delete the original frequency lists, but add the frequencies we just generated to the data frame for each of the two adjectives:
rm(freq.frown, freq.flob)
x$ADJ1FREQ <- freqs.pmw[x$ADJ1]
x$ADJ2FREQ <- freqs.pmw[x$ADJ2]
A quick monofactorial plot seems to suggest that there is no significant frequency effect:
plot(main="Adjective frequencies before the N", pch=16, col="#00000030",
xlab="Freq pmw (log2ed) of adjective 1", xlim=c(-1.5, 15.5), x=jitter(log2(x$ADJ1FREQ)),
ylab="Freq pmw (log2ed) of adjective 2", ylim=c(-1.5, 15.5), y=jitter(log2(x$ADJ2FREQ)))
grid(); abline(0, 1)
This is unexpected, given that usually, length and frequency are inversely correlated. However, in our current data, the correlations between logged frequency pmw and length are surprisingly small: -0.619 for adjective 1 and -0.487 for adjective 1. Is there a significant difference as expected?
summary(x$ADJ1FREQ-x$ADJ2FREQ)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1416.0 -55.0 287.5 430.1 836.5 2350.5 165
wilcox.test(x$ADJ1FREQ-x$ADJ2FREQ, mu=0, correct=FALSE, alternative="greater") # same as
##
## Wilcoxon signed rank test
##
## data: x$ADJ1FREQ - x$ADJ2FREQ
## V = 89502, p-value < 2.2e-16
## alternative hypothesis: true location is greater than 0
(qwe <- wilcox.test(x$ADJ1FREQ, x$ADJ2FREQ, paired=TRUE, correct=FALSE, alternative="greater"))
##
## Wilcoxon signed rank test
##
## data: x$ADJ1FREQ and x$ADJ2FREQ
## V = 89502, p-value < 2.2e-16
## alternative hypothesis: true location shift is greater than 0
A one-tailed Wilcoxon-test or the equivalent one-sample signed rank test show no significant difference in the expected direction of adjective 1 being more frequent (V=8.95015^{4}, p=4.3954769^{-29}).
We load the package syuzhet, which provides easy access to sentiment values:
library(syuzhet)
Then, we retrieve sentiment values for our adjectives and add them to the data frame for each of the two adjectives:
x$ADJ1SENT <- sapply(x$ADJ1, get_sent_values)
x$ADJ2SENT <- sapply(x$ADJ2, get_sent_values)
A quick monofactorial plot suggests that there is no significant Polyanna effect:
plot(main="Adjective sentiment values", pch=16, col="#00000030",
xlab="Sentiment score of adjective 1", xlim=c(-1.25, 1.25), x=jitter(x$ADJ1SENT, 3),
ylab="Sentiment score of adjective 2", ylim=c(-1.25, 1.25), y=jitter(x$ADJ2SENT, 3))
grid(); abline(0, 1)