Disfluencies in language production are often correlated with problems in retrieving and planning the upcoming elements and structures. In this little exercise, we will investigate in one corpus file which part-of-speech (POS) tags are often preceded by disfluencies and what this might tell us about psycholinguistic processes of retrieving and planning etc. Specifically, we will use the SGML-annotated version of the BNC as our first practice corpus.
"<w unc>erm <w cjs" "<w unc>er <w prp" "<w unc>erm <pause> <w vhb"
"<w unc>erm <w vvn" "<w unc>er <w at0" "<w unc>erm <w dt0"
"<w unc>er<c pun>, <w cjs" "<w unc>erm <pause> <w pnp" "<w unc>erm <w at0"
"<w unc>er <pause> <w aj0" "<w unc>er <w pnp" "<w unc>er <w vvg"
We need to
The above tasks require the following functions:
c
;c
;scan
;exact.matches.2
or
gregexpr
/regmatches
;gsub
or
exact.matches.2
;gsub
or exact.matches.2
or
substr
;match
;table
plot
and
text
.Given the simplicity of this script, this is already what step 3 ‘The overall structure and pseudocode of the script’ would be so we can now go ahead.
We load the corpus file <corp_bnc_sgml_1.txt> into a vector
called corpus.file
:
corpus.file <- tolower(scan( # make corpus.file what you get when you load & tolower
"files/corp_bnc_sgml_1.txt", # load <corp_bnc_sgml_1.txt>
what=character(), # which contains text, not numbers
sep="\n", # elements are separated by line breaks
quote="", # there are no quotes in there
comment.char="")) # and no comment characters
# check import:
corpus.file[4:9]
## [1] "<u who=xx1ps000>"
## [2] "<s n=\"1\"><w av0>carefully <w pnp>they <w vvd>crossed <w at0>the <w nn1>river <w cjc>and <w pnp>they <w vvd>made <w dps>their <w nn1>way <w prp>into <w at0>the <w nn1>city<c pun>."
## [3] "<s n=\"2\"><w pnp>i <w vvb>believe <w at0>the <w nn2>children <w vm0>should <w vhi>have <w dps>their <w nn1>way <w av0>once <w prp>in <w at0>a <w nn1>while"
## [4] "<s n=\"3\"><pause> <w pnp>i <w vvd>seemed <w to0>to <w vhi>have <w vvn>worked <w dps>my <w nn1>way <w avp>down <w prp>to <w at0>the <w nn1>bottom <w prf>of <w dps>my <w nn1>list<c pun>."
## [5] "<s n=\"4\"><w vdd>did <w pnp>you <w vvi>want <w to0>to <trunc> <w unc>s </trunc> <w pnp>you <w vvd>started <unclear> <w dps>your <w nn1>way <w prp-avp>through <w av0>there<c pun>?"
## [6] "<s n=\"5\"><w cjc>and <w av0>then <w pnp>they<w vm0>'d <w vvi>wind <w dps>their <w nn1>way <w av0>home <w prp>with <w at0>the <w aj0>old <w nn1>port <w cjc>and <w pni>everything<c pun>."
We cross-tabulate the disfluencies with those coarse-grained POS tags:
result <- table(sum.tags.after.disfluencies, all.disfluencies)
(result <- result[order(result[,1]/result[,2]),])
## all.disfluencies
## sum.tags.after.disfluencies er erm
## adverb_particle 1 2
## of 9 11
## have 13 15
## pronoun 179 201
## conjunction 142 145
## determiner 249 235
## do 8 7
## existential 19 15
## other 96 75
## modal 13 10
## numeral 16 12
## adverb 115 86
## be 35 24
## interjection 9 6
## lexical_verb 131 80
## noun 270 162
## preposition 161 94
## adjective 154 84
## infinitive 18 9
## not 7 3
We compute the so-called difference coefficient, which expresses which disfluency is preferred how much by which tag, which is computed like this:
\[diff. coeff.=\frac{freq_{POS~with~er} - freq_{POS~with~erm}}{freq_{POS~with~er} + freq_{POS~with~erm}}\]
These values falls into the interval [-1,1] and this is how they behave if the two disfluencies were equally frequent. Imagine you have a word that occurs 10 times in a corpus. Then, the word could occur
This is how these 11 situations are reflected in the difference coefficients:
Let’s compute this for our disfluencies:
numerator <- result[,1]-result[,2] # compute pairwise differences between columns
denominator <- result[,1]+result[,2] # or rowSums(result)
sort(difference.coefficients <- numerator/denominator)
## adverb_particle of have pronoun conjunction
## -0.33333333 -0.10000000 -0.07142857 -0.05789474 -0.01045296
## determiner do existential other modal
## 0.02892562 0.06666667 0.11764706 0.12280702 0.13043478
## numeral adverb be interjection lexical_verb
## 0.14285714 0.14427861 0.18644068 0.20000000 0.24170616
## noun preposition adjective infinitive not
## 0.25000000 0.26274510 0.29411765 0.33333333 0.40000000
But we actually would need to correct for the fact that the two disfluencies are not equally frequent. We can do so by (i) computing the odds for er for each part of speech and then (ii) dividing those odds by the overall odds for er in the data as a whole:
numerator <- result[,1]/result[,2] # compute pairwise ratios between columns
denominator <- sum(result[,1])/sum(result[,2])
sort(odds.ratios <- numerator/denominator)
## adverb_particle of have pronoun conjunction
## 0.3878419 0.6346505 0.6722594 0.6907832 0.7596353
## determiner do existential other modal
## 0.8218948 0.8864959 0.9825329 0.9928754 1.0083891
## numeral adverb be interjection lexical_verb
## 1.0342452 1.0372517 1.1312057 1.1635258 1.2701824
## noun preposition adjective infinitive not
## 1.2928065 1.3285650 1.4220871 1.5513678 1.8099291
How about a plot that reflects both
We put the (log of the) former on the x-axis and the (log of the) latter on the y-axis:
plot(type="n", # plot nothing
xlab="Binary log of tag frequency", # w/ this x-axis label
xlim=c(0, 10), # w/ these x-axis limits
x=log2(rowSums(result)), # & these x-axis values
ylab="Odds ratios (>1: er; <1: erm)", # w/ this y-axis label
ylim=c(0, 2), # w/ this y-axis label
y=odds.ratios) # w/ thes y-axis values
grid() # add a grey grid
text( # plot text
log2(rowSums(result)), # at these x-axis coordinates
odds.ratios, # at these y-axis coordinates
labels=rownames(result), # the disfluencies
font=3, cex=0.75) # italicized, 25% smaller
# add a dashed horizontal line at 'neutrality' (the disfluencies' frequencies)
abline(h=1, lty=2)
Housekeeping:
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods compiler
## [8] base
##
## other attached packages:
## [1] magrittr_2.0.3
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 jsonlite_1.8.4 evaluate_0.20
## [5] highr_0.10 cachem_1.0.6 rlang_1.0.6 cli_3.6.0
## [9] rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.2 rmarkdown_2.20
## [13] tools_4.2.2 xfun_0.37 yaml_2.3.7 fastmap_1.1.0
## [17] htmltools_0.5.4 knitr_1.42 sass_0.4.5