1 Preparation
2 Computing a traditional collexeme analysis (LLR, from 2x2 tables)
3 Computing a collexeme analysis using varying intercepts of a glmer
4 Comparing the LLR-scores to the varying intercepts of a glmer

1 Preparation

Let’s load some packages we’ll need as well as the data for the as-predicative constructions (based on the ICE-GB) studied in, among other places, Gries, Hampe, & Schönefeld (2005, 2010):

rm(list=ls(all.names=TRUE))
library(doParallel); library(lme4); library(mgcv); library(rgl); library(scatterplot3d)
log.neg <- function (qwe, base=2) {
   return(ifelse(qwe==0, 0, log(abs(qwe), base=base) * sign(qwe)))
}

head(x <- read.delim("2020_STG_CollAsGlmer.csv", row.names=1))

##          VERBINCORPFREQ VERBINCXFREQ
## regard               99           80
## describe            259           88
## see                1988          111
## know               2120           79
## treat                92           21
## define               83           18

summary(x)

##  VERBINCORPFREQ    VERBINCXFREQ
##  Min.   :   1.0   Min.   :  1.000
##  1st Qu.:  13.5   1st Qu.:  1.000
##  Median :  62.0   Median :  2.000
##  Mean   : 261.6   Mean   :  6.421
##  3rd Qu.: 202.0   3rd Qu.:  3.000
##  Max.   :4287.0   Max.   :111.000

This data frame is the ‘traditional input format’ for collostructional analyses, we need to change this into the long format / case-by-variable format that is required for a collexeme analysis in the form of a glmer, as suggested by Baayen (2011):

freq.of.cx <- sum(x$VERBINCXFREQ) # 1131
corpus.size <- 138664 # number of verbs quoted in previous work

# prepare for generalized linear mixed-effects model
WORD <- factor(c(
   rep(rownames(x), x$VERBINCXFREQ),
   rep(rownames(x), x$VERBINCORPFREQ-x$VERBINCXFREQ),
   rep("all.other", corpus.size-sum(x$VERBINCORPFREQ))))
CX <- factor(c(
   rep("aspred", sum(x$VERBINCXFREQ)),
   rep("all.other", corpus.size-sum(x$VERBINCXFREQ))))

summary(ASPREDICATIVE <-
           data.frame(WORD=relevel(WORD, "all.other"),
                      CX=relevel(CX, "all.other")))

##         WORD                CX
##  all.other:110674   all.other:137977
##  have     :  4287   aspred   :   687
##  know     :  2120
##  see      :  1988
##  make     :  1951
##  take     :  1653
##  (Other)  : 15991

rm(WORD); rm(CX)

This way, we have

a column WORD that states
- for each verb in the ICE-GB that ever occurred with the as-predicative which it is, or
- for each verb in the ICE-GB that never occurred with the as-predicative all.other;
a column CX that states
- for each verb in the ICE-GB that ever occurred with the as-predicative which construction it occurred with: aspred or all.other, or
- for each verb in the ICE-GB that never occurred with the as-predicative all.other.

2 Computing a traditional collexeme analysis (LLR, from 2x2 tables)

Let’s first use the data frame ASPREDICATIVE to compute a collexeme analysis using a traditional measure such as the log-likelihood ratio (LLR);¹ I will use a signed version such that positive and negative values reflect attraction and repulsion respectively.

registerDoParallel(cl <- makePSOCKcluster(detectCores()-1))
g.squareds <- foreach(i=levels(ASPREDICATIVE$WORD), .combine=c) %dopar% {
   curr.model <- glm(CX ~ WORD==i, data=ASPREDICATIVE, family=binomial)
   (curr.model$null.deviance - curr.model$deviance) * sign(coef(curr.model)[2])
}; stopCluster(cl)
names(g.squareds) <- levels(ASPREDICATIVE$WORD)

Let’s plot the relationship between LLR and frequency:

sort(round(g.squareds, 2))

##      all.other           have           make            put           call           find           read           give        look_at
##       -2212.28         -18.47         -12.94          -5.15          -2.12          -1.96          -1.84          -1.61          -0.92
##        provide       remember            run          write          allow           hold          agree        develop         report
##          -0.50          -0.48          -0.48          -0.33          -0.29          -0.21          -0.12          -0.02           0.10
##           base        measure        suggest        publish          leave       identify        express        propose        prepare
##           0.11           0.27           0.34           0.35           0.39           0.41           0.63           0.63           0.78
##          offer          build           date         attack        observe        declare          paint           keep          count
##           0.82           0.88           0.92           0.94           1.16           1.18           1.24           1.48           1.53
##      represent         choose          elect       preserve      establish          judge      construct           list          serve
##           1.75           1.78           1.92           1.92           1.97           2.02           2.14           2.14           2.21
##           term          fancy         intend           show characteris|ze          label       classify      translate          adopt
##           2.77           3.06           3.34           3.49           3.54           3.54           3.68           4.01           4.16
##          claim         credit          group        show_up         praise      designate          gloss    conceive_of    symbolis|ze
##           4.19           4.20           4.20           4.20           4.66           4.93           5.26           5.65           5.65
##        present       register        misread    acknowledge       construe      mark_down    visualis|ze       refer_to  prescribe_for
##           5.72           5.82           6.15           6.69           6.82           6.82           6.82           7.45           7.85
##           tout        display           take       conceive       catapult       re-elect           cast          train       think_of
##           7.85           8.42           8.93           9.87          10.62          10.62          10.77          10.91          11.46
##           cite           rate         depict       diagnose        portray      advert_to           name         accept       consider
##          11.77          11.77          12.30          13.64          15.44          15.71          16.63          17.03          19.65
##        dismiss       denounce        appoint      interpret          class           hail       perceive   categoris|ze    recognis|ze
##          20.71          22.34          24.70          24.70          25.15          27.37          34.86          50.32          51.86
##            map           view            use         define          treat           know            see       describe         regard
##          55.44          78.30          93.46         105.37         125.40         191.28         356.67         615.35         762.20

par(mfrow=c(1,2))
plot(main="Freq & LLR of verbs in the as-predicative", type="n",
   xlab="Frequency (logged to the base of 2)", ylab="Log-likelihood ratio",
   xlim=range(log2(table(ASPREDICATIVE$WORD))), ylim=range(g.squareds),
   x=0, y=0); grid()
text(x=log2(table(ASPREDICATIVE$WORD)), y=g.squareds, labels=names(log2(table(ASPREDICATIVE$WORD))), font=3)
plot(main="Freq & LLR of verbs in the as-predicative", type="n",
   xlab="Frequency (logged to the base of 2)", ylab="Log-likelihood ratio",
   xlim=range(log2(table(ASPREDICATIVE$WORD))), ylim=c(-20, max(g.squareds)),
   x=0, y=0); grid()
text(x=log2(table(ASPREDICATIVE$WORD)), y=g.squareds, labels=names(log2(table(ASPREDICATIVE$WORD))),
     font=3, cex=0.75, srt=45, col=ifelse(g.squareds>0, "blue", "red"))

par(mfrow=c(1,1))

In terms of LLR association strength, we find results that confirm previously published work: regard scores highest, followed by verbs such as describe, see, know, treat, define, use, view, map, recognis|ze, categoris|ze, perceive, and hail.

3 Computing a collexeme analysis using varying intercepts of a `glmer`

summary(m.null <- glmer( # fit a glmer
   CX ~ 1 +  # predicting the construction from an overall intercept
   (1|WORD), # which is adjusted for each verb
   family=binomial, data=ASPREDICATIVE))

## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
##  Family: binomial  ( logit )
## Formula: CX ~ 1 + (1 | WORD)
##    Data: ASPREDICATIVE
##
##      AIC      BIC   logLik deviance df.resid
##   4862.0   4881.7  -2429.0   4858.0   138662
##
## Scaled residuals:
##    Min     1Q Median     3Q    Max
## -1.975 -0.004 -0.004 -0.004 32.116
##
## Random effects:
##  Groups Name        Variance Std.Dev.
##  WORD   (Intercept) 3.954    1.988
## Number of obs: 138664, groups:  WORD, 108
##
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -3.4214     0.1891  -18.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# retrieve the varying intercept adjustments for each verb and ...
varying.intercepts <- ranef(m.null)$WORD[,1]
# ... name them with their verb:
names(varying.intercepts) <- rownames(ranef(m.null)$WORD)

In terms of varying-intercepts association strength, we find results that are largely compatible with previously published work (at least in terms of the semantics of the verbs scoring high): regard scores highest, followed by verbs such as hail, categoris|ze, class, describe, denounce, advert_to, re-elect, catapult, map, view, diagnose, and tout.

sort(varying.intercepts)

##      all.other           make           have            put           call           read           find           give        look_at
##    -7.54650493    -3.51736379    -3.18193024    -2.84003289    -2.55462836    -2.49805711    -2.45887286    -2.35217396    -2.25511212
##        provide       remember            run          write          allow           hold          agree        develop          leave
##    -2.09034243    -2.07664101    -2.07664101    -2.03101942    -1.97162102    -1.91258247    -1.82504806    -1.67101914    -1.46443739
##        suggest         report           base           keep           take        measure           show          offer          build
##    -1.28293920    -1.26217926    -1.25035384    -1.13366754    -1.07253211    -1.06508994    -1.04415713    -1.03882175    -1.01085688
##        publish       identify        express        propose      represent        prepare         choose      establish           date
##    -0.99593088    -0.94651975    -0.77845950    -0.77845950    -0.68400829    -0.68014590    -0.67719966    -0.61350631    -0.59497285
##         attack          serve        observe        declare          paint          claim          count         intend       refer_to
##    -0.58207792    -0.53705603    -0.45625917    -0.44105342    -0.40979741    -0.37542183    -0.25279667    -0.20324216    -0.10547476
##       think_of        present          elect       preserve          judge          adopt      construct           list       consider
##    -0.08164890    -0.07840152    -0.05969422    -0.05969422    -0.01000756     0.01783862     0.04281294     0.04281294     0.07514220
##            use           know         accept           term        display       register          fancy            see    acknowledge
##     0.08019755     0.16906002     0.21740809     0.33334837     0.40588618     0.44292995     0.45818813     0.59204011     0.65504902
## characteris|ze          label       classify           cast          train      translate         credit          group        show_up
##     0.66076883     0.66076883     0.72027990     0.80661220     0.83078287     0.85404651     0.93012392     0.93012392     0.93012392
##         praise           name      designate    recognis|ze          gloss       conceive    conceive_of    symbolis|ze        appoint
##     1.10753445     1.11596675     1.21320890     1.25147269     1.33471882     1.40482473     1.47771942     1.47771942     1.53586598
##      interpret        portray        dismiss        misread           cite           rate       construe      mark_down    visualis|ze
##     1.53586598     1.57978710     1.63454760     1.65156394     1.83946912     1.83946912     1.87342510     1.87342510     1.87342510
##         depict       perceive         define          treat  prescribe_for           tout       diagnose           view            map
##     1.95877205     2.01066397     2.09927605     2.16904274     2.18035251     2.18035251     2.25840632     2.46437341     2.66098631
##       catapult       re-elect      advert_to       denounce       describe          class   categoris|ze           hail         regard
##     2.67880362     2.67880362     2.70768281     2.71526502     2.74506958     3.17725878     3.47028060     3.52889945     4.78200774

par(mfrow=c(1,2))
plot(main="Freq & intercept adj of verbs in the as-predicative", type="n",
   xlab="Frequency (logged to the base of 2)", ylab="Intercept adjustments",
   xlim=range(log2(table(ASPREDICATIVE$WORD))), ylim=range(varying.intercepts),
   x=0, y=0); grid()
text(x=log2(table(ASPREDICATIVE$WORD)), y=varying.intercepts, labels=names(log2(table(ASPREDICATIVE$WORD))), font=3)
abline(h=0, lty=2)
plot(main="Freq & interc. adj. of verbs in the as-predicative", type="n",
   xlab="Frequency (logged to the base of 2)", ylab="Intercept adjustments",
   xlim=range(log2(table(ASPREDICATIVE$WORD))), ylim=c(-4, max(varying.intercepts)),
   x=0, y=0); grid()
text(x=log2(table(ASPREDICATIVE$WORD)), y=varying.intercepts, labels=names(log2(table(ASPREDICATIVE$WORD))),
     font=3, cex=0.75, srt=45, col=ifelse(varying.intercepts>0, "blue", "red"))
abline(h=0, lty=2)

par(mfrow=c(1,1))

4 Comparing the LLR-scores to the varying intercepts of a `glmer`

Let’s correlate the LLR-scores (raw and logged) with the varying intercepts:

par(mfrow=c(1,2))
plot(g.squareds, varying.intercepts, type="n"); grid()
   text(g.squareds, varying.intercepts, names(g.squareds))
   lines(lowess(varying.intercepts ~ g.squareds), col="red", lwd=2)
   cor.test(g.squareds, varying.intercepts, method="spearman")

##
##  Spearman's rank correlation rho
##
## data:  g.squareds and varying.intercepts
## S = 25131, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho
## 0.8802919

plot(log.neg(g.squareds), varying.intercepts, type="n"); grid()
   text(log.neg(g.squareds), varying.intercepts, names(g.squareds))
   lines(lowess(varying.intercepts ~ log.neg(g.squareds)), col="red", lwd=2)

   cor.test(log.neg(g.squareds), varying.intercepts, method="spearman")

##
##  Spearman's rank correlation rho
##
## data:  log.neg(g.squareds) and varying.intercepts
## S = 45554, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho
## 0.7830068

par(mfrow=c(1,1))

There’s definitely the (expected) positive correlation; the modeling that follows does not include the ‘all.other’ cases:

summary(xx <- data.frame(
   FREQ=as.numeric(log2(table(ASPREDICATIVE$WORD))[-1]),
   G2LOG=log.neg(g.squareds)[-1],
   VARINT=varying.intercepts[-1]))

##       FREQ            G2LOG             VARINT
##  Min.   : 0.000   Min.   :-4.2072   Min.   :-3.5174
##  1st Qu.: 3.754   1st Qu.: 0.7144   1st Qu.:-0.9712
##  Median : 5.954   Median : 2.0718   Median : 0.0802
##  Mean   : 5.822   Mean   : 2.1948   Mean   : 0.2580
##  3rd Qu.: 7.658   3rd Qu.: 3.5571   3rd Qu.: 1.5359
##  Max.   :12.066   Max.   : 9.5740   Max.   : 4.7820

summary(m1 <- gam(G2LOG ~ s(FREQ) + s(VARINT), data=xx))

##
## Family: gaussian
## Link function: identity
##
## Formula:
## G2LOG ~ s(FREQ) + s(VARINT)
##
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.19475    0.08414   26.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##             edf Ref.df     F p-value
## s(FREQ)   8.818  8.987 27.27  <2e-16 ***
## s(VARINT) 8.196  8.823 81.68  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =   0.89   Deviance explained = 90.8%
## GCV = 0.91081  Scale est. = 0.75748   n = 107

summary(m2 <- gam(G2LOG ~ s(FREQ, VARINT), data=xx))

##
## Family: gaussian
## Link function: identity
##
## Formula:
## G2LOG ~ s(FREQ, VARINT)
##
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.19475    0.07123   30.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##                  edf Ref.df     F p-value
## s(FREQ,VARINT) 24.85  27.96 44.62  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.921   Deviance explained =   94%
## GCV = 0.71589  Scale est. = 0.54293   n = 107

Given the similarity of the R^ss (and the similarity of the predictions), let’s go with the simpler model and see how the (logged) LLR-values ‘follow’ from (logged) frequencies and the varying intercepts:

nd <- expand.grid(
   FREQ=seq(min(xx$FREQ), max(xx$FREQ), length.out=20),
   VARINT=seq(min(xx$VARINT), max(xx$VARINT), length.out=20))
nd$PREDS <- predict(m1, newdata=nd)
scatterplot3d(pch=16, highlight.3d=TRUE,
   xlab="Frequency (logged to based of 2)", x=nd$FREQ,
   ylab="Intercept adjustments", y=nd$VARINT,
   zlab="Predicted LLR", z=nd$PREDS)

The rotatable 3-D version can be created like this:

plot3d(
   xlab="Frequency (logged to based of 2)", x=nd$FREQ,
   ylab="Intercept adjustments", y=nd$VARINT,
   zlab="Predicted LLR", z=nd$PREDS)
text3d(
   x=xx$FREQ,
   y=xx$VARINT,
   z=xx$G2LOG,
   texts=rownames(xx), col="blue")

Of course, we could compute this (faster) from the data frame x, I’m just showing all the computations based on the case-by-variable format.↩︎

Collocation/collostructions as a mixed-effects model

Stefan Th. Gries

2020-08-21 12:34:56

1 Preparation

2 Computing a traditional collexeme analysis (LLR, from 2x2 tables)

3 Computing a collexeme analysis using varying intercepts of a `glmer`

4 Comparing the LLR-scores to the varying intercepts of a `glmer`

Collocation/collostructions as a mixed-effects model

Stefan Th. Gries

2020-08-21 12:34:56

1 Preparation

2 Computing a traditional collexeme analysis (LLR, from 2x2 tables)

3 Computing a collexeme analysis using varying intercepts of a glmer

4 Comparing the LLR-scores to the varying intercepts of a glmer

3 Computing a collexeme analysis using varying intercepts of a `glmer`

4 Comparing the LLR-scores to the varying intercepts of a `glmer`