In corpus linguistics, the method of key words is often used to find the words that are used disproportionately more often in one (part of a) corpus than in some other (part of the same) corpus. It is often used in applied linguistics contexts to identify words that are characteristic for a particular genre or topic area; the paper by Leech & Fallon is an application of this method to different varieties of English in the hope of identifying words that are characteristic for a particular culture. We will generate two frequency lists – one for BrE, one for AmE (both of the 1990s) – on our way to determining for each word in both corpora which corpus/variety it is more characteristic of.
We need to
The above tasks require the following functions:
scan
and tolower
;gsub
;strsplit
and
unlist
and nzchar
;table
+ …;plot
and
text
.Let’s break this down:
scan
and tolower
;gsub
: we
table
, but we need to count
each word in both corpora:
table
will need a vector with all words so that it can
count those;table
will need an equally long vector saying in which
each word form occurred, meaning we need to repeat each corpus name as
often as there are words in the corpus;plot
and
text
.We load each corpus file into a separate vector:
head(frown <- tolower( # show the head of frown, when you set to lower case the result of
scan("files/FROWN.txt", # loading <files/FROWN.txt>
what=character(), # which contains text, not numbers
sep="\n", # elements are separated by line breaks
quote="", # there are no quotes in there
comment.char=""))) # and no comment characters
## [1] "<#frown:a01\\><h_><p_>after 35 straight veto victories, intense lobbying fails president with election in offing<p/>"
## [2] "<p_>by elaine s. povich<p/>"
## [3] "<p_>chicago tribune<p/>"
## [4] "<h/> <p_>washington - despite intense white house lobbying, congress has voted to override the veto of a cable television regulation bill, dealing president bush the first veto defeat of his presidency just four weeks before the election.<p/>"
## [5] "<p_>monday night, the senate overrode the veto 74-25. the same margin by which the upper house approved the bill last month and comfortably above the two-thirds majority needed.<p/>"
## [6] "<p_>not one senator changed sides, a blow to bush's prestige after he had heavily lobbied republican senators, urging them not to embarrass him this close to the election.<p/>"
head(flob <- tolower( # show the head of flob, when you set to lower case the result of
scan("files/FLOB.txt", # loading <files/FLOB.txt>
what=character(), # which contains text, not numbers
sep="\n", # elements are separated by line breaks
quote="", # there are no quotes in there
comment.char=""))) # and no comment characters
## [1] "<#flob:a01\\><h_><p_>labour pledges reversal of nhs hospital opt-outs<p/>"
## [2] "<p_>by stephen castle<p/>"
## [3] "<p_>political correspondent<p/>"
## [4] "<h/> <p_>robin cook, labour's health spokesman, yesterday repeated party opposition to the internal market in the national health service and said there had been <quote_>\"no secret pacts with health service <}_><-|>manager<+|>managers<}/>\"<quote/> to maintain hospital trusts.<p/>"
## [5] "<p_>speaking to prospective labour parliamentary candidates in london, mr cook said his party <quote_>\"will bring back into the local nhs all those hospitals that have opted out\"<quote/>. \"if there is an election in november and we win office we will stop any hospital in the pipeline.\"<p/>"
## [6] "<p_>he and his colleagues are concerned that managers have told some ngs staff that a labour government would accept trust status as a <tf_>fait accompli<tf/>. however, mr cook said tory plans for an internal market demonstrated the division between the values of the two parties.<p/>"
We remove tags and other kinds of annotation (paragraph marks), which are between angular brackets:
(head(frown <- gsub( # make frown what you get when you replace
pattern="<.*?>", # annotation between angular brackets
replacement="", # with nothing
frown, # in frown
perl=TRUE))) # using Perl-compatible regular expressions
## [1] "after 35 straight veto victories, intense lobbying fails president with election in offing"
## [2] "by elaine s. povich"
## [3] "chicago tribune"
## [4] " washington - despite intense white house lobbying, congress has voted to override the veto of a cable television regulation bill, dealing president bush the first veto defeat of his presidency just four weeks before the election."
## [5] "monday night, the senate overrode the veto 74-25. the same margin by which the upper house approved the bill last month and comfortably above the two-thirds majority needed."
## [6] "not one senator changed sides, a blow to bush's prestige after he had heavily lobbied republican senators, urging them not to embarrass him this close to the election."
(head(flob <- gsub( # make flob what you get when you replace
pattern="<.*?>", # annotation between angular brackets
replacement="", # with nothing
flob, # in flob
perl=TRUE))) # using Perl-compatible regular expressions
## [1] "labour pledges reversal of nhs hospital opt-outs"
## [2] "by stephen castle"
## [3] "political correspondent"
## [4] " robin cook, labour's health spokesman, yesterday repeated party opposition to the internal market in the national health service and said there had been \"no secret pacts with health service managermanagers\" to maintain hospital trusts."
## [5] "speaking to prospective labour parliamentary candidates in london, mr cook said his party \"will bring back into the local nhs all those hospitals that have opted out\". \"if there is an election in november and we win office we will stop any hospital in the pipeline.\""
## [6] "he and his colleagues are concerned that managers have told some ngs staff that a labour government would accept trust status as a fait accompli. however, mr cook said tory plans for an internal market demonstrated the division between the values of the two parties."
We split up the elements of each corpus into words at some character strings we don’t want to keep/consider as word material, which different users might define differently. For that, we first identify all characters in each corpus:
noquote( # show w/out quotes
character.table <- sort( # character.table, the result of sorting
unique( # the unique elements of the
unlist( # unlisted
strsplit( # result of splitting up
c(frown, flob), # the combination of both corpora
"" # at each character
))))) # close strsplit, unlist, unique, sort, noquote
## [1] \035 ' - ! " # $ % &
## [11] ( ) * , . / : ; ? @
## [21] [ \\ ] _ { | } ´ £ +
## [31] = > § ° \u0096 0 ½ 1 2 3
## [41] 4 5 6 7 8 9 a ä b c
## [51] d e é f g h i j k l
## [61] m n o ó ò ô ö õ p q
## [71] r s t u ü v w x y z
A nicer version might use magrittr
’s pipe, but one needs
to be aware of one small potential hiccup (of how to assign something so
that it’s available in the global environment):
c(frown, flob) %>%
strsplit("") %>% unlist %>%
unique %>% sort %>%
"<<-"("character.table", .) %>% # note the arrow!
noquote
Let’s decide to discard all non-alphanumeric characters in this
vector. We create a character vector split.expression
that
contains all characters we want to split on:
(split.expression <- paste0( # make split.expression the result of pasting w/ nothing
"[^0-9", # a character class: NOT the digits &
paste0( # the result of pasting together w/ nothing
character.table[47:80], # these letters
collapse="" # with nothing in between
), # close inner paste0
"]+" # end of character class, one or more of those
)) # close paste0 & output
## [1] "[^0-9aäbcdeéfghijklmnoóòôöõpqrstuüvwxyz]+"
We split up each corpus vector separately into its words:
head(frown.words <- unlist( # make frown.words the result of unlisting
strsplit( # the result of splitting up
frown, # this vector frown
split.expression, # at our character class
perl=TRUE) # using Perl-compatible regular expressions
)) # close unlist, close head
## [1] "after" "35" "straight" "veto" "victories" "intense"
head(flob.words <- unlist( # make flob.words the result of unlisting
strsplit( # the result of splitting up
flob, # this vector frown
split.expression, # at our character class
perl=TRUE) # using Perl-compatible regular expressions
)) # close unlist, close head
## [1] "labour" "pledges" "reversal" "of" "nhs" "hospital"
Quick check:
table(nchar(frown.words))
##
## 0 1 2 3 4 5 6 7 8 9 10
## 5491 52271 168143 206603 163647 112540 87742 81822 58582 42233 28082
## 11 12 13 14 15 16 17 18 19 20 21
## 16220 8974 4862 2116 789 267 135 41 18 14 5
## 22 23 25 30
## 5 5 2 1
table(nchar(flob.words))
##
## 0 1 2 3 4 5 6 7 8 9 10
## 5731 47231 176592 211298 163045 110258 85330 79560 56864 41759 27346
## 11 12 13 14 15 16 17 18 19 20 21
## 15868 8227 4494 1758 655 240 85 50 34 14 10
## 22 23 24 25 26 27 29 30
## 4 6 4 1 1 3 2 1
We eliminate the thousands of empty character vectors:
frown.words <- frown.words[nzchar(frown.words)] # subset frown.words to those 1+ character long
# and we check:
table(nchar(frown.words))
##
## 1 2 3 4 5 6 7 8 9 10 11
## 52271 168143 206603 163647 112540 87742 81822 58582 42233 28082 16220
## 12 13 14 15 16 17 18 19 20 21 22
## 8974 4862 2116 789 267 135 41 18 14 5 5
## 23 25 30
## 5 2 1
flob.words <- flob.words[nzchar(flob.words)] # subset flob.words to those 1+ character long
# and we check:
table(nchar(flob.words))
##
## 1 2 3 4 5 6 7 8 9 10 11
## 47231 176592 211298 163045 110258 85330 79560 56864 41759 27346 15868
## 12 13 14 15 16 17 18 19 20 21 22
## 8227 4494 1758 655 240 85 50 34 14 10 4
## 23 24 25 26 27 29 30
## 6 4 1 1 3 2 1
Since we have the corpora ‘handy’, let’s quickly save each corpus’s frequency list into a file. Who knows when we might need those again … (I do, in session 10.)
flob.table <- sort(table(flob.words), decreasing=TRUE)
cat("WORD\tFREQ",
paste(names(flob.table), flob.table, sep="\t"),
sep="\n", file="120_06_freq_flob.csv")
frown.table <- sort(table(frown.words), decreasing=TRUE)
cat("WORD\tFREQ",
paste(names(frown.table), frown.table, sep="\t"),
sep="\n", file="120_06_freq_frown.csv")
In order to cross-tabulate words and corpora, we need two equally
long vectors, one with all words, the other with stating for each word
which corpus it’s from. We create the former from
frown.words
and flob.words
and the latter with
rep
from the lengths of frown.words
and
flob.words
:
# define a vector with all words
length(all.words <- c( # show the length of all.words, the combination
frown.words, # of this corpus &
flob.words)) # of this corpus
## [1] 2065859
# define a vector that says which corpus each word is from
length(all.corps <- c( # show the length of all.corps, the combination
rep( # of repeating
"frown", # "frown" as often as ...
length(frown.words) # FROWN has words
), # and
rep( # of repeating
"flob", # "flob" as often as ...
length(flob.words) # FLOB has words
))) # close rep, c, & length
## [1] 2065859
We create a table – essentially a term-corpus matrix (by analogy to the notion of a term-document matrix used in information retrieval):
dim(tcm <- table( # show the dimensions of the table tcm, from tabulating
all.words, # all words from
all.corps)) # both corpora
## [1] 64626 2
# take a peek
tcm[1:10,1:2]
## all.corps
## all.words flob frown
## 0 140 143
## 00 10 16
## 000 334 351
## 0000 1 0
## 00001 0 2
## 00011 0 2
## 0004 0 1
## 000ft 2 3
## 000l 0 1
## 000strong 1 0
I hope you were immediately thinking that we should maybe fix the whole numbering issue, maybe like this:
all.words <- gsub(
"[0-9]+",
"#",
all.words,
perl=TRUE)
dim(tcm <- table( # show the dimensions of the table tcm, from tabulating
all.words, # all words from
all.corps)) # both corpora
## [1] 62168 2
tcm[1:10,1:2] # take a peek
## all.corps
## all.words flob frown
## # 11248 12459
## #a 48 41
## #a# 10 28
## #aa 1 0
## #al 0 1
## #alphai 0 1
## #am 15 0
## #as 1 0
## #b 23 34
## #b# 7 31
Since here the two corpora are nearly identically large (they differ by about half a percent only), we could again compute the difference coefficient, which would express which word is preferred how much in which corpus. As before, it would be computed like this and its values falls into the interval [-1,1]:
\[diff. coeff.=\frac{freq_{word~in~FROWN} - freq_{word~in~FLOB}}{freq_{word~in~FROWN} + freq_{word~in~FLOB}}\]
numerator <- tcm[,2]-tcm[,1] # compute pairwise differences between columns
denominator <- tcm[,2]+tcm[,1] # or rowSums(tcm)
difference.coefficients <- numerator/denominator
# check result:
head(sort(difference.coefficients), 20) # useless
## #aa #am #as #ba #bn #bs #cc #cm
## -1 -1 -1 -1 -1 -1 -1 -1
## #cu #d# #deg #degreem #dn #du #e# #el
## -1 -1 -1 -1 -1 -1 -1 -1
## #f# #ff #g #gt
## -1 -1 -1 -1
tail(sort(difference.coefficients), 20) # useless
## zodiac zombie zoroastrianism zorthian zs
## 1 1 1 1 1
## zubchenko zubero zubro zucchinis zuckerbrot
## 1 1 1 1 1
## zuckerman zude zuniga zunis zurawik
## 1 1 1 1 1
## zw zwiener zwilling zychik zyklon
## 1 1 1 1 1
set.seed(sum(utf8ToInt("haribo"))); sort(sample(difference.coefficients, 50))
## fairyland kp marshallsmarshals scuffs
## -1.00000000 -1.00000000 -1.00000000 -1.00000000
## illuminates dervla bikers wail
## -1.00000000 -1.00000000 -1.00000000 -1.00000000
## olea discoverer gallowgate obviouosly
## -1.00000000 -1.00000000 -1.00000000 -1.00000000
## looka disinvestment stalls chimes
## -1.00000000 -1.00000000 -1.00000000 -1.00000000
## trotting embryo jagger shore
## -0.66666667 -0.47368421 -0.33333333 -0.23809524
## audit assuming approaches summed
## -0.15789474 -0.13513514 -0.12500000 -0.07692308
## starkly abstain demonstrators dark
## 0.00000000 0.00000000 0.00000000 0.05637982
## mechanism experimentally academy vanish
## 0.13846154 0.14285714 0.23809524 0.25000000
## adventurous chops mono bangs
## 0.25000000 0.40000000 0.50000000 0.50000000
## psychoanalytic nonrandom jazzy prowl
## 0.57142857 1.00000000 1.00000000 1.00000000
## hypobiosis deluding timbres ingratiating
## 1.00000000 1.00000000 1.00000000 1.00000000
## histones communique retransmits supertonic
## 1.00000000 1.00000000 1.00000000 1.00000000
## prokaryotic iliad
## 1.00000000 1.00000000
But we also compute the odds ratios again, but this time we log them as well (to make the numerical ‘preference territory’ identically big for each corpus, namely 0 to ∞), but note also that we must ‘discount’ them to avoid numerical trouble with 0s, which is why we add 0.5 to each cell frequency):
numerator <- (tcm[,1]+0.5)/(tcm[,2]+0.5) # compute pairwise ratios between columns
denominator <- sum(tcm[,1])/sum(tcm[,2])
summary(logged.odds.ratios <- log(numerator/denominator))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.452530 -1.094373 0.004239 -0.024785 1.102852 6.106798
sort(sample(logged.odds.ratios, 50))
## prosecuted yellowish apparatuses intestines overvalue
## -2.560709952 -1.941670744 -1.941670744 -1.605198507 -1.605198507
## ruttenstein okefenokee raimunda grandstands squealing
## -1.605198507 -1.094372884 -1.094372884 -1.094372884 -1.094372884
## sevzapkino deltaz pennsylvanians encumber rimless
## -1.094372884 -1.094372884 -1.094372884 -1.094372884 -1.094372884
## bakr teledemocracy intruded shameless lately
## -1.094372884 -1.094372884 -0.843058455 -0.506586219 -0.247075023
## accuracy survival debate stimulation burglar
## -0.196431290 -0.180189634 -0.165866564 -0.162814680 0.004239405
## concur boned trough assumption replaced
## 0.004239405 0.004239405 0.129402548 0.132620572 0.136507374
## company worst resentment fist supple
## 0.154899680 0.188088846 0.255553833 0.283824267 0.456224529
## depictions chemistry clemmensen patriotiques epitomize
## 0.851537265 0.989092808 1.102851694 1.102851694 1.102851694
## bulletother flecked weeded eileen soy
## 1.102851694 1.102851694 1.102851694 1.362362889 1.613677318
## monarchist dishcloth flue consumerist realised
## 1.613677318 1.613677318 1.950149554 1.950149554 4.879436728
How about a plot that reflects both
We put the (log of the) former on the x-axis and the latter on the y-axis:
plot(type="n", # plot nothing
xlab="Binary log of word frequency", # w/ this x-axis label
xlim=c(0, 17), # w/ these x-axis limits
x=log2(rowSums(tcm)), # & these x-axis values
ylab="Diff. coeffs. (<0: FROWN; >0: FLOB)", # w/ this y-axis label
ylim=c(-1, 1), # w/ this y-axis label
y=difference.coefficients) # w/ this y-axis label
grid() # add a grey grid
text( # plot text
log2(rowSums(tcm)), # at these x-axis coordinates
difference.coefficients, # at these y-axis coordinates
labels=rownames(tcm), # the disfluencies
font=3, cex=0.8) # italicized and 20% smaller
# add a dashed horizontal line at 'neutrality' (the corpus sizes)
abline(h=(colSums(tcm)[1] - colSums(tcm)[2]) / sum(colSums(tcm)), lty=2)