1 Introduction

Disfluencies in language production are often correlated with problems in retrieving and planning the upcoming elements and structures. In this little exercise, we will investigate in one corpus file which part-of-speech (POS) tags are often preceded by disfluencies and what this might tell us about psycholinguistic processes of retrieving and planning etc. Specifically, we will use the SGML-annotated version of the BNC as our first practice corpus.

"<w unc>erm <w cjs"         "<w unc>er <w prp"          "<w unc>erm <pause> <w vhb"
"<w unc>erm <w vvn"         "<w unc>er <w at0"          "<w unc>erm <w dt0"
"<w unc>er<c pun>, <w cjs"  "<w unc>erm <pause> <w pnp" "<w unc>erm <w at0"
"<w unc>er <pause> <w aj0"  "<w unc>er <w pnp"          "<w unc>er <w vvg"

1.1 Step 1: things we need to do

We need to

  1. define vectors with the fine-grained and the coarse-grained POS tags;
  2. load a corpus file;
  3. search the corpus file for disfluencies;
  4. get the disfluencies and the beginning of the fine-grained tags out of there;
  5. change the fine-grained tags to the coarse-grained POS tag;
  6. determine how often each disfluency occurs with with each coarse-grained POS tag.

1.2 Step 2: functions we will need for that

The above tasks require the following functions:

  1. Preparation
    1. we define a vector with all the fine-grained tags used in the corpus: c;
    2. we define a corresponding vector with all the coarse fine-grained tags used in the corpus: c;
  2. we load the corpus file: scan;
  3. we retrieve all disfluencies and the next word’s POS tag: exact.matches.2 or gregexpr/regmatches;
  4. from those retrieved instances,
    1. we isolate the disfluency per se: gsub or exact.matches.2;
    2. we isolate the beginning of the POS tag of the next word: gsub or exact.matches.2 or substr;
  5. we create a vector that contains for each of the (fine-grained) POS tags the corresponding coarse-grained POS tag: match;
  6. we
    1. cross-tabulate the disfluencies with those coarse-grained POS tags: table
    2. we compute some sort of coefficient that expresses which disfluency ‘prefers to precede’ which coarse-grained POS tag: just ‘math functions’;
    3. we visualize these results: plot and text.

Given the simplicity of this script, this is already what step 3 ‘The overall structure and pseudocode of the script’ would be so we can now go ahead.

2 Implementation

2.1 Task 1: define vectors w/ fine-/coarse-grained POS tags

rm(list=ls(all.names=TRUE))
bnc.tags<-tolower(c("AJ0", "AJC", "AJS", "AT0", "AV0", "AVP", "AVQ", "CJC", "CJS", "CJT", "CRD", "DPS", "DT0", "DTQ", "EX0", "ITJ", "NN0", "NN1", "NN2", "NNN", "NNN", "NNS", "NP0", "NUL", "ORD", "PNI", "PNP", "PNQ", "PNX", "POS", "PRF", "PRP", "PUL", "PUN", "PUQ", "PUR", "TO0", "UNC", "VBB", "VBD", "VBG", "VBI", "VBN", "VBZ", "VDB", "VDD", "VDG", "VDI", "VDN", "VDZ", "VHB", "VHD", "VHG", "VHI", "VHN", "VHZ", "VM0", "VVB", "VVD", "VVG", "VVI", "VVN", "VVZ", "XX0", "ZZ0"))
sum.tags<-c("adjective", "adjective", "adjective", "determiner", "adverb", "adverb_particle", "adverb", "conjunction", "conjunction", "conjunction", "numeral", "determiner", "determiner", "determiner", "existential", "interjection", "noun", "noun", "noun", "other", "other", "other", "noun", "other", "numeral", "pronoun", "pronoun", "pronoun", "pronoun", "pos", "of", "preposition", "punctuation", "punctuation", "punctuation", "punctuation", "infinitive", "other", "be", "be", "be", "be", "be", "be", "do", "do", "do", "do", "do", "do", "have", "have", "have", "have", "have", "have", "modal", "lexical_verb", "lexical_verb", "lexical_verb", "lexical_verb", "lexical_verb", "lexical_verb", "not", "other")
# see how they are aligned:
head(
   data.frame(bnc.tags, sum.tags),
   15)
##    bnc.tags        sum.tags
## 1       aj0       adjective
## 2       ajc       adjective
## 3       ajs       adjective
## 4       at0      determiner
## 5       av0          adverb
## 6       avp adverb_particle
## 7       avq          adverb
## 8       cjc     conjunction
## 9       cjs     conjunction
## 10      cjt     conjunction
## 11      crd         numeral
## 12      dps      determiner
## 13      dt0      determiner
## 14      dtq      determiner
## 15      ex0     existential

2.2 Task 2: load corpus file

We load the corpus file <corp_bnc_sgml_1.txt> into a vector called corpus.file:

corpus.file <- tolower(scan(    # make corpus.file what you get when you load & tolower
   "files/corp_bnc_sgml_1.txt", # load <corp_bnc_sgml_1.txt>
   what=character(),            # which contains text, not numbers
   sep="\n",                    # elements are separated by line breaks
   quote="",                    # there are no quotes in there
   comment.char=""))            # and no comment characters
# check import:
corpus.file[4:9]
## [1] "<u who=xx1ps000>"                                                                                                                                                                         
## [2] "<s n=\"1\"><w av0>carefully <w pnp>they <w vvd>crossed <w at0>the <w nn1>river <w cjc>and <w pnp>they <w vvd>made <w dps>their <w nn1>way <w prp>into <w at0>the <w nn1>city<c pun>."     
## [3] "<s n=\"2\"><w pnp>i <w vvb>believe <w at0>the <w nn2>children <w vm0>should <w vhi>have <w dps>their <w nn1>way <w av0>once <w prp>in <w at0>a <w nn1>while"                              
## [4] "<s n=\"3\"><pause> <w pnp>i <w vvd>seemed <w to0>to <w vhi>have <w vvn>worked <w dps>my <w nn1>way <w avp>down <w prp>to <w at0>the <w nn1>bottom <w prf>of <w dps>my <w nn1>list<c pun>."
## [5] "<s n=\"4\"><w vdd>did <w pnp>you <w vvi>want <w to0>to <trunc> <w unc>s </trunc> <w pnp>you <w vvd>started <unclear> <w dps>your <w nn1>way <w prp-avp>through <w av0>there<c pun>?"      
## [6] "<s n=\"5\"><w cjc>and <w av0>then <w pnp>they<w vm0>'d <w vvi>wind <w dps>their <w nn1>way <w av0>home <w prp>with <w at0>the <w aj0>old <w nn1>port <w cjc>and <w pni>everything<c pun>."

2.3 Task 3: retrieve disfluencies & next word tags

We retrieve all occurrences of the two disfluencies er or erm, which tagged as “<w UNC>” and everything till the POS tag for a word and put the result into a vector called disfl.and.tags:

disfl.and.tags <- exact.matches.2(
   "(?x)              # set free-spacing
   <w\\sunc>          # the POS tags of the disfluencies
   erm?               # the disfluencies (m is optional)
   .*?                # the shortest amount of 'stuff' till
   <w\\s...           # the beginning of the next word tag
   ", corpus.file)[[  # from the corpus file
   1]]                # retain only the exact matches, i.e. part 1
                      # of the output of exact.matches.2
# check result:
str(disfl.and.tags)
##  chr [1:2921] "<w unc>erm <w cjs" "<w unc>er <w prp" ...
head(disfl.and.tags)
## [1] "<w unc>erm <w cjs"         "<w unc>er <w prp"         
## [3] "<w unc>erm <pause> <w vhb" "<w unc>erm <w vvn"        
## [5] "<w unc>er <w at0"          "<w unc>erm <w dt0"
tail(disfl.and.tags)
## [1] "<w unc>er<c pun>, <w cjs"  "<w unc>erm <pause> <w pnp"
## [3] "<w unc>erm <w at0"         "<w unc>er <pause> <w aj0" 
## [5] "<w unc>er <w pnp"          "<w unc>er <w vvg"

2.4 Task 4: isolate disfluencies & POS tags

From disfl.and.tags, we retrieve the disfluencies, which we put into a vector called all.disfluencies either with gsub or with exact.matches.2:

# using gsub to get the disfluencies
all.disfluencies <- gsub(     # make all.disfluencies the result of replacing
   pattern="<w unc>(erm?).*", # this, which matches the whole result, by
   replacement="\\1",         # only the disfluency bit in it
   x=disfl.and.tags,          # in disfl.and.tags
   perl=TRUE)                 # use Perl-compatible regular expressions
# using exact.matches.2 to get the disfluencies
all.disfluencies <- exact.matches.2(
   "(?x)                    # set free-spacing
   (?<=                     # look to the left & see
   <w\\sunc>)               # the disfluency tag
   erm?                     # & then capture the disfluency
   ", disfl.and.tags)[[     # from disfl.and.tags
   1]]                      # retain only the exact matches, i.e. part 1
                            # of the output of exact.matches.2

From disfl.and.tags, we retrieve the POS tags of the following words ( either with gsub or with substr), which we put into a vector called all.tags.after.disfluencies.

# using gsub to get the disfluencies
all.tags.after.disfluencies <- gsub( # make all.tags.after.disfluencies the result of replacing
   pattern="^.* ",                   # everything from the beginning to the last space
   replacement="",                   # by nothing
   x=disfl.and.tags,                 # in disfl.and.tags
   perl=TRUE)                        # use Perl-compatible regular expressions
# using substr to get the disfluencies
all.tags.after.disfluencies <- substr( # make all.tags.after.disfluencies the substring of
   x=disfl.and.tags,                   # disfl.and.tags
   start=nchar(disfl.and.tags)-2,      # 3 characters from the end
   stop=nchar(disfl.and.tags))         # to the end

2.5 Task 5: get the coarse-grained tags

We create a vector sum.tags.after.disfluencies that contains for each of the (fine-grained) POS tags the corresponding coarse-grained POS tag:

where.are.the.tags.in.our.vector <- # make this vector
   match(                           # the positions where
      all.tags.after.disfluencies,  # the tags in our data
      bnc.tags)                     # occur in the list of the fine-grained tags

sum.tags.after.disfluencies <-                # make this vector
   sum.tags[where.are.the.tags.in.our.vector] # the summary tags in those positions

2.6 Task 6: cross-tabulation

We cross-tabulate the disfluencies with those coarse-grained POS tags:

result <- table(sum.tags.after.disfluencies, all.disfluencies)
(result <- result[order(result[,1]/result[,2]),])
##                            all.disfluencies
## sum.tags.after.disfluencies  er erm
##             adverb_particle   1   2
##             of                9  11
##             have             13  15
##             pronoun         179 201
##             conjunction     142 145
##             determiner      249 235
##             do                8   7
##             existential      19  15
##             other            96  75
##             modal            13  10
##             numeral          16  12
##             adverb          115  86
##             be               35  24
##             interjection      9   6
##             lexical_verb    131  80
##             noun            270 162
##             preposition     161  94
##             adjective       154  84
##             infinitive       18   9
##             not               7   3

2.7 Task 7: quantify which tag goes w/ which disfluency

We compute the so-called difference coefficient, which expresses which disfluency is preferred how much by which tag, which is computed like this:

\[diff. coeff.=\frac{freq_{POS~with~er} - freq_{POS~with~erm}}{freq_{POS~with~er} + freq_{POS~with~erm}}\]

These values falls into the interval [-1,1] and this is how they behave if the two disfluencies were equally frequent. Imagine you have a word that occurs 10 times in a corpus. Then, the word could occur

  • 0 times with er, 10 times with erm;
  • 1 times with er, 9 times with erm;
  • 2 times with er, 8 times with erm; …
  • 9 times with er, 1 times with erm;
  • 10 times with er, 0 times with erm.

This is how these 11 situations are reflected in the difference coefficients:

Let’s compute this for our disfluencies:

numerator <- result[,1]-result[,2]   # compute pairwise differences between columns
denominator <- result[,1]+result[,2] # or rowSums(result)
sort(difference.coefficients <- numerator/denominator)
## adverb_particle              of            have         pronoun     conjunction 
##     -0.33333333     -0.10000000     -0.07142857     -0.05789474     -0.01045296 
##      determiner              do     existential           other           modal 
##      0.02892562      0.06666667      0.11764706      0.12280702      0.13043478 
##         numeral          adverb              be    interjection    lexical_verb 
##      0.14285714      0.14427861      0.18644068      0.20000000      0.24170616 
##            noun     preposition       adjective      infinitive             not 
##      0.25000000      0.26274510      0.29411765      0.33333333      0.40000000

But we actually would need to correct for the fact that the two disfluencies are not equally frequent. We can do so by (i) computing the odds for er for each part of speech and then (ii) dividing those odds by the overall odds for er in the data as a whole:

numerator <- result[,1]/result[,2]   # compute pairwise ratios between columns
denominator <- sum(result[,1])/sum(result[,2])
sort(odds.ratios <- numerator/denominator)
## adverb_particle              of            have         pronoun     conjunction 
##       0.3878419       0.6346505       0.6722594       0.6907832       0.7596353 
##      determiner              do     existential           other           modal 
##       0.8218948       0.8864959       0.9825329       0.9928754       1.0083891 
##         numeral          adverb              be    interjection    lexical_verb 
##       1.0342452       1.0372517       1.1312057       1.1635258       1.2701824 
##            noun     preposition       adjective      infinitive             not 
##       1.2928065       1.3285650       1.4220871       1.5513678       1.8099291

2.8 Task 8: we visualize

How about a plot that reflects both

  • the frequency of each tag with both disfluencies;
  • the preferential behavior of each tag with regard to the disfluencies?

We put the (log of the) former on the x-axis and the (log of the) latter on the y-axis:

plot(type="n",                            # plot nothing
   xlab="Binary log of tag frequency",    # w/ this x-axis label
   xlim=c(0, 10),                         # w/ these x-axis limits
   x=log2(rowSums(result)),               # & these x-axis values
   ylab="Odds ratios  (>1: er; <1: erm)", # w/ this y-axis label
   ylim=c(0, 2),                          # w/ this y-axis label
   y=odds.ratios)                         # w/ thes y-axis values
grid() # add a grey grid
text(                       # plot text
   log2(rowSums(result)),   # at these x-axis coordinates
   odds.ratios, # at these y-axis coordinates
   labels=rownames(result), # the disfluencies
   font=3, cex=0.75) # italicized, 25% smaller
# add a dashed horizontal line at 'neutrality' (the disfluencies' frequencies)
abline(h=1, lty=2)

Housekeeping:

sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   compiler 
## [8] base     
## 
## other attached packages:
## [1] magrittr_2.0.3
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.31   R6_2.5.1        jsonlite_1.8.4  evaluate_0.20  
##  [5] highr_0.10      cachem_1.0.6    rlang_1.0.6     cli_3.6.0      
##  [9] rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.2     rmarkdown_2.20 
## [13] tools_4.2.2     xfun_0.37       yaml_2.3.7      fastmap_1.1.0  
## [17] htmltools_0.5.4 knitr_1.42      sass_0.4.5