Many ordering phenomena in English involve a tendency to have things that are supposedly easier to process precede things that are supposedly harder to process; two ways in which processing ease is often reflected is such that
In this session, we will determine whether the ordering of two prenominal adjectives (as in nice green car) is compatible with the short-before-long and the frequent-before-rare(r) preferences; we will use the same BNC SGML files and the same overall BNC frequency list from session 8.
We need to
The above tasks require the following functions:
dir
, maybe with
grep
);scan
);tolower
);grep
);exact.matches.2
) and store them
somehow/somewhere (<-
);gsub
or exact.matches.2
or
strsplit
);nchar
);read.table
);match
);Let’s break this down some more:
dir
, maybe with
grep
);dir
or some file-choosing
function to define the corpus files, we define a collector for the
matches, and then do a for
-loop where we
scan
);tolower
);grep
);exact.matches.2
) and store them
somehow/somewhere (<-
);gsub
; orexact.matches.2
); orstrsplit
);nchar
);match
);# clear memory
rm(list=ls(all=TRUE))
source("https://www.stgries.info/exact.matches.2.r") # get exact.matches.2
We define the corpus files:
corpus.files <- dir(
"files",
pattern="sgml_",
full.names=TRUE)[1:4]
We define a collector structure called all.AdjAdj.be4N
for the results for each adjective-adjective-noun triple:
all.AdjAdj.be4N <- character()
Then, we loop over each file name and
scan
and set it to lower case
(tolower
):for (i in seq(corpus.files)) { # access each corpus file
# load each of the corpus files
current.corpus.file <- tolower( # make current.corpus.file the lower case
scan( # of what you load
corpus.files[i], # from the i-th corpus path
what=character(), # which is a file with character strings
sep="\n", # separated by line breaks,
quote="", # with no quote characters and
comment.char="", # no comment characters
quiet=TRUE)) # suppress feedback
grep
to find only the sentence lines in the
files: # use only the sentence-tagged lines of the corpus file
current.sentences <- grep( # find
"<s n=", # the sentence number tags
current.corpus.file, # in current.corpus.file
perl=TRUE, # using Perl-compatible regular expressions
value=TRUE) # retrieve the whole line
gsub
to delete all tags that are not word or
punctuation mark tags: # filter out unwanted annotation
current.entences <- gsub( # make current.sentences the result of replacing
"(?x) # set free-spacing
< # an opening angular bracket
(?! # after which there is NOT ------------+
[wc]\\s # a w or c followed by a space |
(...|...-...|n=.*?) # some POS or sentence number tag |
) # end of after which there is NOT -----+
.*?>", # but after which there is anything else
"", # (replacing all this) by nothing
current.sentences, # in current.sentences
perl=TRUE) # using Perl-compatible regular expressions
exact.matches.2
) and … # retrieve all matches for each -ic/-ical pair with tags
current.all.AdjAdj.be4N <- exact.matches.2( # look for
"(?x) # set free-spacing
<w\\saj.> # a 1st adjective tag
[^<]*? # the 1st adjective
<w\\saj.> # a 1st adjective tag
[^<]* # the 2nd adjective
(?=<w\\sn..>)", # before the noun tag
current.sentences)[[1]] # in current.sentences. save only exact matches
# very efficient regex:
# (?<=<w AJ0>)[^<]+<w AJ0>[^<]+?(?= ?<w N))
all.AdjAdj.be4N
: # add to previous matches of adjective doublets
all.AdjAdj.be4N <- c(all.AdjAdj.be4N, current.all.AdjAdj.be4N)
cat("\f", i/length(corpus.files)) # output to the screen the % of files dealt w/ now
} # end of for: access each corpus file
## 0.25 0.5 0.75 1
Let’s check the results:
object.size(all.AdjAdj.be4N)
## 54208 bytes
head(all.AdjAdj.be4N)
## [1] "<w aj0>one-time <w aj0>polish " "<w aj0>poor <w aj0>old "
## [3] "<w aj0>small <w aj0>unworthy " "<w aj0>northern <w aj0>anglian "
## [5] "<w aj0>little <w aj0>old " "<w aj0>thick <w aj0>woollen "
From this output, we now need to extract the two adjectives in parallel/aligned vectors. We can do this with searching, splitting, or replacing, depending on which regex we find easiest to write.
From all.AdjAdj.be4N
, we extract the adjectives; this is
how to do this w/ exact.matches.2
:
# pick out the adjective when 1 adjective tag is on the left & another is on the right
adj.1.e <- trimws(exact.matches.2(
"(?<=<w aj.>)[^<]+(?= <w aj)",
all.AdjAdj.be4N)[[1]])
# pick out the adjective when 1 adjective tag is on the left & a noun tag is on the right
adj.2.e <- trimws(exact.matches.2(
"(?<= <w aj.>).*",
all.AdjAdj.be4N)[[1]])
From all.AdjAdj.be4N
, we extract the adjectives; this is
how to do this w/ strsplit
:
splitty <- strsplit( # make splitty the result of splitting
all.AdjAdj.be4N, # the triples
" ?<w ...>", # at the tags
perl=TRUE) # using Perl=compatible regular expressions
adj.1.s <- sapply( # make adj.1.the result of applying
splitty, # to splitty
"[", # the extraction/subsetting function
2) # always picking the 2nd thing
adj.1.s <- trimws(adj.1.s) # trim remaining whitespace
adj.2.s <- sapply( # make adj.2.the result of applying
splitty, # to splitty
"[", # the extraction/subsetting function
3) # always picking the 3rd thing
adj.2.s <- trimws(adj.2.s) # trim remaining whitespace
From all.AdjAdj.be4N
, we extract the adjectives; this is
how to do this w/ gsub
:
adj.1.g <- trimws(gsub( # make adj.1 the result of replacing
"^<.*?>", # the tag at the beginning
"", # with nothing
gsub( # in what you get when replacing
" <w a.*", # the first adj tag that has a space in front of it till the end
"", # with nothing
all.AdjAdj.be4N, # in all.AdjAdj.be4N
perl=TRUE), # using Perl-compatible regular expressions
perl=TRUE)) # using Perl-compatible regular expressions
adj.2.g <- trimws(gsub( # make adj.2 the result of replacing
" <w .*$", # a word tag till the end
"", # with nothing
gsub( # in what you get when replacing
"^.*aj.>", # from the beginning till the last adj tag
"", # with nothing
all.AdjAdj.be4N, # in all.AdjAdj.be4N
perl=TRUE), # using Perl-compatible regular expressions
perl=TRUE)) # using Perl-compatible regular expressions
Let’s check the results:
head(data.frame(
ORIG=all.AdjAdj.be4N,
ADJ1=adj.1.e,
ADJ2=adj.2.e),
10)
## ORIG ADJ1 ADJ2
## 1 <w aj0>one-time <w aj0>polish one-time polish
## 2 <w aj0>poor <w aj0>old poor old
## 3 <w aj0>small <w aj0>unworthy small unworthy
## 4 <w aj0>northern <w aj0>anglian northern anglian
## 5 <w aj0>little <w aj0>old little old
## 6 <w aj0>thick <w aj0>woollen thick woollen
## 7 <w aj0>vertical <w aj0>boring vertical boring
## 8 <w aj0>gross <w aj0>domestic gross domestic
## 9 <w aj0>new <w ajc>younger new younger
## 10 <w ajs>clearest <w aj0>possible clearest possible
Let’s compute the lengths of the adjectives:
summary(adj.1.lengths <- nchar(adj.1.e))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 6.792 8.000 18.000
summary(adj.2.lengths <- nchar(adj.2.e))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 6.000 8.000 7.901 9.000 17.000
summary(length.diffs <- adj.1.lengths-adj.2.lengths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.000 -3.000 -1.000 -1.109 1.000 9.000
hist(length.diffs)
abline(v=mean(length.diffs), lty=2)
A better plot:
plot(main="Adjective lengths dep. on position", type="n", axes=FALSE, # plot nothing with this heading
xlab="Adjective position", xlim=c(1,2), x=0, # this x-axis
ylab="Adjective length", ylim=c(0, 20), y=0) # this y-axis
grid(); axis(1, at=1:2); axis(2) # add a grid and some axes
segments(1, adj.1.lengths, # plot arrows from left (1) and adjective length 1
2, adj.2.lengths, # to the right (2) and adjective length 2
col=ifelse(adj.1.lengths<adj.2.lengths, # if 1 < 2
"#0000FF30", # make the arrow blue
"#FF000030")) # otherwise red
segments(1, mean(adj.1.lengths), # plot the overall mean length of adj 1
2, mean(adj.2.lengths), # and the overall mean length of adj 2
lwd=4) # with a heavy/bold line
text(1.5, 20, wilcox.test(adj.1.lengths, adj.2.lengths, alternative="less")$p.value, pos=1, col="grey")
There is a weak short-before-long preference.
We load the complete frequency list of the whole British National Corpus:
summary(freq.list.all <- read.table( # make freq.list.all the result of reading
"files/corp_bnc_freql.txt", # this file
header=TRUE, # which has a header in the 1st row
sep=" ", # uses spaces as column separators
quote="", # uses no quotes
comment.char="")) # & no comments
## FREQUENCY WORD POS FILES
## Min. : 1 Length:938971 Length:938971 Min. : 1.0
## 1st Qu.: 1 Class :character Class :character 1st Qu.: 1.0
## Median : 1 Mode :character Mode :character Median : 1.0
## Mean : 107 Mean : 18.4
## 3rd Qu.: 4 3rd Qu.: 3.0
## Max. :6187267 Max. :4120.0
head(freq.list.all) # check the input
## FREQUENCY WORD POS FILES
## 1 1 !*?* unc 1
## 2 602 % nn0 113
## 3 1 %/100 unc 1
## 4 3 %/day unc 1
## 5 1 %295 unc 1
## 6 1 %5,000 unc 1
object.size(freq.list.all) # 64,280,352
## 64280352 bytes
That’s a big file to process, let’s reduce the memory footprint to what we need, namely only the nouns (which will also make things faster):
where.are.the.adjs <- grep( # find
"^aj\\d$", # adjective tags
freq.list.all$POS, # in the tag columns
perl=TRUE) # using Perl-compatible regular expressions
summary(freq.list.a <- freq.list.all[ # make freq.list.n freq.list.all, but only
where.are.the.adjs, # the rows with adjectives
-3] # and not the POS column anymore
)
## FREQUENCY WORD FILES
## Min. : 1.00 Length:107657 Min. : 1.00
## 1st Qu.: 1.00 Class :character 1st Qu.: 1.00
## Median : 1.00 Mode :character Median : 1.00
## Mean : 58.19 Mean : 19.19
## 3rd Qu.: 4.00 3rd Qu.: 3.00
## Max. :129451.00 Max. :3910.00
object.size(freq.list.a) # 9,138,440 # less than 15% of original
## 9138440 bytes
rm(freq.list.all)
Let’s find out how frequent the adjectives are and also right away how dispersed they are:
# make adj.1.X the X from the BNC, but only those where the 1st adjective matches the word from the BNC
adj.1.freqs <- freq.list.a$FREQUENCY[match(adj.1.e, freq.list.a$WORD)]
adj.2.freqs <- freq.list.a$FREQUENCY[match(adj.2.e, freq.list.a$WORD)]
adj.1.disps <- freq.list.a$FILES[ match(adj.1.e, freq.list.a$WORD)]
adj.2.disps <- freq.list.a$FILES[ match(adj.2.e, freq.list.a$WORD)]
Let’s see what the results are for the (logged) frequencies:
summary(freq.diffs <- log2(adj.1.freqs)-log2(adj.2.freqs))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -15.1022 -1.3586 0.7904 0.9499 3.2553 14.4230 45
hist(freq.diffs, xlim=c(-17, 17))
abline(v=mean(freq.diffs, na.rm=TRUE), lty=2) # dashed line at the mean