Much of corpus linguistics is based on the distributional hypothesis, here in the form provided by Harris (1970:785f.):
[i]f we consider words or morphemes A and B to be more different in meaning than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C. […], difference of meaning correlates with difference of distribution.
In other words, the idea behind this is that distributional similarity reflects functional similarity – semantic, discourse-functional, or other kinds of similarity. That implies words that are semantically similar tend to occur in similar lexical and grammatical contexts. For instance, the collocates – the words you find ‘around’ – the word cat will be more similar to the words you find around the word dog than to the collocates of the word ethereal. This distributional hypothesis has been used particularly much in studies of near synonymy, i.e. for sets of words with extremely similar meanings/functions. Would you be able to explain to a learner of English when to use fast vs. quick vs. rapid vs. swift? When to use fatal vs. lethal vs. deadly vs. mortal? Most likely not … The way a corpus linguist would try to tease apart these synonyms is by their collocates; maybe in all these adjective cases specifically by looking at the nouns they might modify within a noun phrase. That is what we will do here for two pairs of -ic/-ical adjectives, electric(al) and historic(al).
We need to
The above tasks require the following functions:
dir
, maybe with
grep
);scan
);tolower
);grep
);exact.matches.2
) and store them somehow/somewhere
(<-
);gsub
or exact.matches.2
);read.table
);match
);tapply
and median
).Let’s break this down:
dir
, maybe with
grep
);for
-loop where we
scan
);tolower
);grep
);exact.matches.2
) and store them somehow/somewhere
(<-
);gsub
); orexact.matches.2
);table
and math functions; this
should remind you of the difference coefficient/(log) odds ratio part
from last week);read.table
);match
);tapply
and median
).plot
and
text
.# clear memory
rm(list=ls(all=TRUE))
source("https://www.stgries.info/exact.matches.2.r") # get exact.matches.2
We define the corpus files:
corpus.files <- dir(
"files",
pattern="sgml_",
full.names=TRUE)[1:4]
We define a collector structure for the results, an empty character vector:
all.matches <- character()
Then, we loop over each file name and
scan
and set it to lower case
(tolower
):for (i in seq(corpus.files)) { # access each corpus file
# load each of the corpus files
current.corpus.file <- tolower( # make current.corpus.file the lower case
scan( # of what you load
corpus.files[i], # from the i-th corpus path
what=character(), # which is a file with character strings
sep="\n", # separated by line breaks,
quote="", # with no quote characters and
comment.char="", # no comment characters
quiet=TRUE)) # suppress feedback
grep
to find only the sentence lines in the
files: # use only the sentence-tagged lines of the corpus file
current.sentences <- grep( # find
"<s n=", # the sentence number tags
current.corpus.file, # in current.corpus.file
perl=TRUE, # using Perl-compatible regular expressions
value=TRUE) # retrieve the whole line
gsub
to delete all tags that are not word or
punctuation mark tags: # filter out unwanted annotation
current.sentences <- gsub("(?x) # make current.sentences the result of replacing
< # an opening angular bracket
(?! # after which there is NOT ------------+
[wc]\\s # a w or c followed by a space |
(...|...-...|n=.*?) # some POS or sentence number tag |
) # end of after which there is NOT -----+
.*?>", # but after which there is anything else
"", # (replacing all this) by nothing
current.sentences, # in current.sentences
perl=TRUE) # using Perl-compatible regular expressions
# alternative search expression: "<(?![wc] (...|...-...)).*?>[^<]*"
exact.matches.2
): # retrieve all matches for each -ic/-ical pair with tags
current.matches <- exact.matches.2( # look for
"(?x) # set free-spacing
<w\\s(aj0|aj0-...)> # an adjective tag (possibly as a portmanteau tag)
(elect|histo)ric(al)? # electric(al)? or historic(al)?
\\s # a space
<w\\s(n..|n..-...)> # a noun tag (possibly as a portmanteau tag)
[^<]+", #
current.sentences)[[1]] # in current.sentences. save only exact matches
# add to previous matches
all.matches <- c(all.matches, current.matches) # collect
cat("\f", i/length(corpus.files)) # output to the screen the % of files dealt w/ now
} # end of for: access each corpus file
## 0.25 0.5 0.75 1
object.size(all.matches)
## 18296 bytes
We extract the adjectives from all.matches
:
all.adjectives <- # make all.adjectives the result
sub("<.*?>([^<]+) <.*", # of replacing the whole string, but memorize stuff between ">" and " <"
"\\1", # with the memorized stuff
all.matches, # in all.matches
perl=TRUE) # using Perl-compatible regular expressions
We extract the nouns from all.matches
:
all.nouns <- trimws( # make all.nouns the result of trimming whitespace from
sub("^.*>", # what you get when you replace everything till the last ">"
"", # with nothing
all.matches, # in all.matches
perl=TRUE)) # using Perl-compatible regular expressions
Alternatively, you could have done this:
qwe <- strsplit(all.matches, "<.*?>", perl=TRUE)
all.adjectives <- trimws(sapply(qwe, "[", 2))
all.nouns <- trimws(sapply(qwe, "[", 3))
Let’s compile and check the results:
results <- data.frame( # makes results a data frame w/
ADJECTIVES=all.adjectives, # a column with all the adjectives
NOUNS =all.nouns) # a column with all the nouns
results.split <- split( # makes results.split a list by splitting up
results, # results
substr(results$ADJECTIVES, 1, 3)) # dep. on the 1st 4 chars of the adj.
lapply( # apply to each element of
results.split, # the split-up list
head, # the function head &
10) # show the first 10 rows
## $ele
## ADJECTIVES NOUNS
## 1 electric fire
## 3 electric windows
## 4 electric motor
## 5 electric cookers
## 8 electric blanket
## 9 electric heating
## 11 electric company
## 12 electric shock
## 14 electric board
## 16 electric house
##
## $his
## ADJECTIVES NOUNS
## 2 historic town
## 6 historic towns
## 7 historic core
## 10 historic city
## 13 historic organization
## 15 historic status
## 24 historic truth
## 29 historic wildlife
## 33 historic trends
## 40 historical awareness
Let’s compute a term-by-adjective matrix for each adjective pair:
# for electric(al)
dim(tam.ele <- table( # show the dimensions of the table tam.ele, from tabulating
results.split$ele$NOUNS, # all nouns after electric(al)
results.split$ele$ADJECTIVES)) # both electric(al) adjectives
## [1] 89 2
# for historic(al)
dim(tam.his <- table( # show the dimensions of the table tam.his, from tabulating
results.split$his$NOUNS, # all nouns after historic(al)
results.split$his$ADJECTIVES)) # both historic(al) adjectives
## [1] 56 2
# take a peek
head(tam.ele <- tam.ele[order(tam.ele[,1]/rowSums(tam.ele), -rowSums(tam.ele)),])
##
## electric electrical
## energy 0 5
## fault 0 5
## apparatus 0 4
## desk 0 4
## goods 0 4
## supplier 0 4
head(tam.his <- tam.his[order(tam.his[,1]/rowSums(tam.his), -rowSums(tam.his)),])
##
## historic historical
## account 0 4
## context 0 4
## data 0 4
## event 0 4
## figure 0 4
## significance 0 3
We compute what we know well by now, the logged odds ratios:
# for electric(al)
numer.ele <- (tam.ele[,1]+0.5)/(tam.ele[,2]+0.5) # compute pairwise ratios between columns
denom.ele <- sum(tam.ele[,1])/sum(tam.ele[,2])
summary(logged.ors.ele <- log(numer.ele/denom.ele))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.56413 -1.77567 0.93238 0.03467 0.93238 2.54181
# for historic(al)
numer.his <- (tam.his[,1]+0.5)/(tam.his[,2]+0.5) # compute pairwise ratios between columns
denom.his <- sum(tam.his[,1])/sum(tam.his[,2])
summary(logged.ors.his <- log(numer.his/denom.his))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.5294 -0.9416 -0.9416 0.1782 1.7664 3.5010
But because our data set is so small, the usual plot we might consider suffers from so much overplotting that it’s useless. For now, we just sort by the association, i.e. the log odds values:
sort(logged.ors.ele)
## energy fault apparatus desk goods supplier
## -2.56413069 -2.56413069 -2.36346000 -2.36346000 -2.36346000 -2.36346000
## systems appliance charge charges circuits contractor
## -2.36346000 -2.11214557 -1.77567333 -1.77567333 -1.77567333 -1.77567333
## engineer engineering goodies impulses insulators panels
## -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.77567333
## pole potential products register shops staff
## -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.77567333
## theory things tracing wires checks gear
## -1.77567333 -1.77567333 -1.77567333 -1.77567333 -1.26484771 -1.26484771
## trade appliances equipment meter power cars
## -1.26484771 -1.26484771 -1.26484771 -0.67706104 -0.67706104 0.03443528
## shock fire cable cables chair circuit
## 0.68106244 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687
## clock cookers cutter fibres field flymo
## 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687
## gadget garage heating hob house immersion
## 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687
## lights man motors pumps razors rollers
## 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687
## running shears shower supply toasters track
## 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687
## trains valve van water wheelchair bills
## 0.93237687 0.93237687 0.93237687 0.93237687 0.93237687 1.44320249
## company current ecology fires lamp toaster
## 1.44320249 1.44320249 1.44320249 1.44320249 1.44320249 1.44320249
## board car light shaver blankets windows
## 1.77967473 1.77967473 1.77967473 1.77967473 2.03098916 2.03098916
## cooker drill motor blanket bill
## 2.23165985 2.23165985 2.23165985 2.39871394 2.54181478
sort(logged.ors.his)
## account context data event figure
## -1.5293952 -1.5293952 -1.5293952 -1.5293952 -1.5293952
## significance accident appreciation budget cord
## -1.2780808 -0.9416085 -0.9416085 -0.9416085 -0.9416085
## issues linguistics links monuments objects
## -0.9416085 -0.9416085 -0.9416085 -0.9416085 -0.9416085
## ones portraiture progression pythagoras rates
## -0.9416085 -0.9416085 -0.9416085 -0.9416085 -0.9416085
## record records references research romance
## -0.9416085 -0.9416085 -0.9416085 -0.9416085 -0.9416085
## sequence sidestep structure theory visit
## -0.9416085 -0.9416085 -0.9416085 -0.9416085 -0.9416085
## awareness question reconstruction truth buildings
## -0.4307829 -0.4307829 -0.4307829 0.1570037 0.6678294
## centre chair day enterprise landmark
## 1.7664417 1.7664417 1.7664417 1.7664417 1.7664417
## leases organization origins setting sites
## 1.7664417 1.7664417 1.7664417 1.7664417 1.7664417
## status title trend trends wildlife
## 1.7664417 1.7664417 1.7664417 1.7664417 1.7664417
## york character town towns core
## 1.7664417 2.2772673 2.2772673 2.2772673 3.3758796
## city
## 3.5010427
Now let’s test Marchand’s claim that -ical forms are in wider common use. We load the complete frequency list of the whole British National Corpus:
summary(freq.list.all <- read.table( # make freq.list.all the result of reading
"files/corp_bnc_freql.txt", # this file
header=TRUE, # which has a header in the 1st row
sep=" ", # uses spaces as column separators
quote="", # uses no quotes
comment.char="")) # & no comments
## FREQUENCY WORD POS FILES
## Min. : 1 Length:938971 Length:938971 Min. : 1.0
## 1st Qu.: 1 Class :character Class :character 1st Qu.: 1.0
## Median : 1 Mode :character Mode :character Median : 1.0
## Mean : 107 Mean : 18.4
## 3rd Qu.: 4 3rd Qu.: 3.0
## Max. :6187267 Max. :4120.0
head(freq.list.all) # check the input
## FREQUENCY WORD POS FILES
## 1 1 !*?* unc 1
## 2 602 % nn0 113
## 3 1 %/100 unc 1
## 4 3 %/day unc 1
## 5 1 %295 unc 1
## 6 1 %5,000 unc 1
object.size(freq.list.all) # 64,280,352
## 64280352 bytes
That’s a big file to process, let’s reduce the memory footprint to what we need, namely only the nouns:
where.are.the.nouns <- grep( # find
"^nn\\d$", # noun tags
freq.list.all$POS, # in the tag columns
perl=TRUE) # using Perl-compatible regular expressions
summary(freq.list.n <- freq.list.all[ # make freq.list.n freq.list.all, but only
where.are.the.nouns, # the rows with nouns
-3] # and not the POS column anymore
)
## FREQUENCY WORD FILES
## Min. : 1.00 Length:229954 Min. : 1.00
## 1st Qu.: 1.00 Class :character 1st Qu.: 1.00
## Median : 1.00 Mode :character Median : 1.00
## Mean : 86.57 Mean : 23.65
## 3rd Qu.: 5.00 3rd Qu.: 3.00
## Max. :153679.00 Max. :3931.00
object.size(freq.list.n) # 18551280 # less than 30% of original
## 18551280 bytes
rm(freq.list.all)
Let’s find out how frequent the nouns after electric(al) are:
collo.type.ele <- unique( # find the unique
results.split$ele) # of electric(al) w/ nouns
tapply( # apply to
freq.list.n$FREQUENCY[ # the frequencies from the BNC a subsetting, namely
match( # retrieve the frequencies
collo.type.ele$NOUNS, # of the noun collocates of electric(al)
freq.list.n$WORD)], # via cross-referencing the BNC nounbs
collo.type.ele$ADJECTIVES, # group the frequencies by the 2 adjectives
median, # compute the median while
na.rm=TRUE) # omitting missing data/frequencies
## electric electrical
## 1069.0 3229.5
Let’s find out how frequent the nouns after historic(al) are:
collo.type.his <- unique( # find the unique
results.split$his) # of historic(al) w/ nouns
tapply( # apply to
freq.list.n$FREQUENCY[ # the frequencies from the BNC a subsetting, namely
match( # retrieve the frequencies
collo.type.his$NOUNS, # of the noun collocates of historic(al)
freq.list.n$WORD)], # via cross-referencing the BNC nounbs
collo.type.his$ADJECTIVES, # group the frequencies by the 2 adjectives
median, # compute the median while
na.rm=TRUE) # omitting missing data/frequencies
## historic historical
## 4090 4292
This suggests that Marchand’s claim
But maybe “-ical forms are in wider common use” pertains to dispersion (the topic of next session) so let’s do a check of this by looking not at frequency but the number of files in which the noun collocates are observed:
tapply( # apply to
freq.list.n$FILES[ # the ranges from the BNC a subsetting, namely
match( # retrieve the frequencies
collo.type.ele$NOUNS, # of the noun collocates of electric(al)
freq.list.n$WORD)], # via cross-referencing the BNC nounbs
collo.type.ele$ADJECTIVES, # group the frequencies by the 2 adjectives
median, # compute the median while
na.rm=TRUE) # omitting missing data/frequencies
## electric electrical
## 485.5 1054.0
tapply( # apply to
freq.list.n$FILES[ # the ranges from the BNC a subsetting, namely
match( # retrieve the frequencies
collo.type.his$NOUNS, # of the noun collocates of historic(al)
freq.list.n$WORD)], # via cross-referencing the BNC nounbs
collo.type.his$ADJECTIVES, # group the frequencies by the 2 adjectives
median, # compute the median while
na.rm=TRUE) # omitting missing data/frequencies
## historic historical
## 1029.5 1002.0
This suggests that Marchand’s claim
This also means that distinguishing between frequency and dispersion is important.
Housekeeping:
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets compiler methods
## [8] base
##
## other attached packages:
## [1] magrittr_2.0.3
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 jsonlite_1.8.4 evaluate_0.20
## [5] cachem_1.0.6 rlang_1.0.6 cli_3.6.0 rstudioapi_0.14
## [9] jquerylib_0.1.4 bslib_0.4.2 rmarkdown_2.20 tools_4.2.2
## [13] xfun_0.37 yaml_2.3.7 fastmap_1.1.0 htmltools_0.5.4
## [17] knitr_1.42 sass_0.4.5