Let’s load some packages we’ll need as well as the data for the as-predicative constructions (based on the ICE-GB) studied in, among other places, Gries, Hampe, & Schönefeld (2005, 2010):
rm(list=ls(all.names=TRUE))
library(doParallel); library(lme4); library(mgcv); library(rgl); library(scatterplot3d)
log.neg <- function (qwe, base=2) {
return(ifelse(qwe==0, 0, log(abs(qwe), base=base) * sign(qwe)))
}
head(x <- read.delim("2020_STG_CollAsGlmer.csv", row.names=1))
## VERBINCORPFREQ VERBINCXFREQ
## regard 99 80
## describe 259 88
## see 1988 111
## know 2120 79
## treat 92 21
## define 83 18
summary(x)
## VERBINCORPFREQ VERBINCXFREQ
## Min. : 1.0 Min. : 1.000
## 1st Qu.: 13.5 1st Qu.: 1.000
## Median : 62.0 Median : 2.000
## Mean : 261.6 Mean : 6.421
## 3rd Qu.: 202.0 3rd Qu.: 3.000
## Max. :4287.0 Max. :111.000
This data frame is the ‘traditional input format’ for collostructional analyses, we need to change this into the long format / case-by-variable format that is required for a collexeme analysis in the form of a glmer
, as suggested by Baayen (2011):
freq.of.cx <- sum(x$VERBINCXFREQ) # 1131
corpus.size <- 138664 # number of verbs quoted in previous work
# prepare for generalized linear mixed-effects model
WORD <- factor(c(
rep(rownames(x), x$VERBINCXFREQ),
rep(rownames(x), x$VERBINCORPFREQ-x$VERBINCXFREQ),
rep("all.other", corpus.size-sum(x$VERBINCORPFREQ))))
CX <- factor(c(
rep("aspred", sum(x$VERBINCXFREQ)),
rep("all.other", corpus.size-sum(x$VERBINCXFREQ))))
summary(ASPREDICATIVE <-
data.frame(WORD=relevel(WORD, "all.other"),
CX=relevel(CX, "all.other")))
## WORD CX
## all.other:110674 all.other:137977
## have : 4287 aspred : 687
## know : 2120
## see : 1988
## make : 1951
## take : 1653
## (Other) : 15991
rm(WORD); rm(CX)
This way, we have
WORD
that states
CX
that states
Let’s first use the data frame ASPREDICATIVE
to compute a collexeme analysis using a traditional measure such as the log-likelihood ratio (LLR);1 I will use a signed version such that positive and negative values reflect attraction and repulsion respectively.
registerDoParallel(cl <- makePSOCKcluster(detectCores()-1))
g.squareds <- foreach(i=levels(ASPREDICATIVE$WORD), .combine=c) %dopar% {
curr.model <- glm(CX ~ WORD==i, data=ASPREDICATIVE, family=binomial)
(curr.model$null.deviance - curr.model$deviance) * sign(coef(curr.model)[2])
}; stopCluster(cl)
names(g.squareds) <- levels(ASPREDICATIVE$WORD)
Let’s plot the relationship between LLR and frequency:
sort(round(g.squareds, 2))
## all.other have make put call find read give look_at
## -2212.28 -18.47 -12.94 -5.15 -2.12 -1.96 -1.84 -1.61 -0.92
## provide remember run write allow hold agree develop report
## -0.50 -0.48 -0.48 -0.33 -0.29 -0.21 -0.12 -0.02 0.10
## base measure suggest publish leave identify express propose prepare
## 0.11 0.27 0.34 0.35 0.39 0.41 0.63 0.63 0.78
## offer build date attack observe declare paint keep count
## 0.82 0.88 0.92 0.94 1.16 1.18 1.24 1.48 1.53
## represent choose elect preserve establish judge construct list serve
## 1.75 1.78 1.92 1.92 1.97 2.02 2.14 2.14 2.21
## term fancy intend show characteris|ze label classify translate adopt
## 2.77 3.06 3.34 3.49 3.54 3.54 3.68 4.01 4.16
## claim credit group show_up praise designate gloss conceive_of symbolis|ze
## 4.19 4.20 4.20 4.20 4.66 4.93 5.26 5.65 5.65
## present register misread acknowledge construe mark_down visualis|ze refer_to prescribe_for
## 5.72 5.82 6.15 6.69 6.82 6.82 6.82 7.45 7.85
## tout display take conceive catapult re-elect cast train think_of
## 7.85 8.42 8.93 9.87 10.62 10.62 10.77 10.91 11.46
## cite rate depict diagnose portray advert_to name accept consider
## 11.77 11.77 12.30 13.64 15.44 15.71 16.63 17.03 19.65
## dismiss denounce appoint interpret class hail perceive categoris|ze recognis|ze
## 20.71 22.34 24.70 24.70 25.15 27.37 34.86 50.32 51.86
## map view use define treat know see describe regard
## 55.44 78.30 93.46 105.37 125.40 191.28 356.67 615.35 762.20
par(mfrow=c(1,2))
plot(main="Freq & LLR of verbs in the as-predicative", type="n",
xlab="Frequency (logged to the base of 2)", ylab="Log-likelihood ratio",
xlim=range(log2(table(ASPREDICATIVE$WORD))), ylim=range(g.squareds),
x=0, y=0); grid()
text(x=log2(table(ASPREDICATIVE$WORD)), y=g.squareds, labels=names(log2(table(ASPREDICATIVE$WORD))), font=3)
plot(main="Freq & LLR of verbs in the as-predicative", type="n",
xlab="Frequency (logged to the base of 2)", ylab="Log-likelihood ratio",
xlim=range(log2(table(ASPREDICATIVE$WORD))), ylim=c(-20, max(g.squareds)),
x=0, y=0); grid()
text(x=log2(table(ASPREDICATIVE$WORD)), y=g.squareds, labels=names(log2(table(ASPREDICATIVE$WORD))),
font=3, cex=0.75, srt=45, col=ifelse(g.squareds>0, "blue", "red"))
par(mfrow=c(1,1))
In terms of LLR association strength, we find results that confirm previously published work: regard scores highest, followed by verbs such as describe, see, know, treat, define, use, view, map, recognis|ze, categoris|ze, perceive, and hail.
glmer
summary(m.null <- glmer( # fit a glmer
CX ~ 1 + # predicting the construction from an overall intercept
(1|WORD), # which is adjusted for each verb
family=binomial, data=ASPREDICATIVE))
## Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
## Family: binomial ( logit )
## Formula: CX ~ 1 + (1 | WORD)
## Data: ASPREDICATIVE
##
## AIC BIC logLik deviance df.resid
## 4862.0 4881.7 -2429.0 4858.0 138662
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.975 -0.004 -0.004 -0.004 32.116
##
## Random effects:
## Groups Name Variance Std.Dev.
## WORD (Intercept) 3.954 1.988
## Number of obs: 138664, groups: WORD, 108
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4214 0.1891 -18.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# retrieve the varying intercept adjustments for each verb and ...
varying.intercepts <- ranef(m.null)$WORD[,1]
# ... name them with their verb:
names(varying.intercepts) <- rownames(ranef(m.null)$WORD)
In terms of varying-intercepts association strength, we find results that are largely compatible with previously published work (at least in terms of the semantics of the verbs scoring high): regard scores highest, followed by verbs such as hail, categoris|ze, class, describe, denounce, advert_to, re-elect, catapult, map, view, diagnose, and tout.
sort(varying.intercepts)
## all.other make have put call read find give look_at
## -7.54650493 -3.51736379 -3.18193024 -2.84003289 -2.55462836 -2.49805711 -2.45887286 -2.35217396 -2.25511212
## provide remember run write allow hold agree develop leave
## -2.09034243 -2.07664101 -2.07664101 -2.03101942 -1.97162102 -1.91258247 -1.82504806 -1.67101914 -1.46443739
## suggest report base keep take measure show offer build
## -1.28293920 -1.26217926 -1.25035384 -1.13366754 -1.07253211 -1.06508994 -1.04415713 -1.03882175 -1.01085688
## publish identify express propose represent prepare choose establish date
## -0.99593088 -0.94651975 -0.77845950 -0.77845950 -0.68400829 -0.68014590 -0.67719966 -0.61350631 -0.59497285
## attack serve observe declare paint claim count intend refer_to
## -0.58207792 -0.53705603 -0.45625917 -0.44105342 -0.40979741 -0.37542183 -0.25279667 -0.20324216 -0.10547476
## think_of present elect preserve judge adopt construct list consider
## -0.08164890 -0.07840152 -0.05969422 -0.05969422 -0.01000756 0.01783862 0.04281294 0.04281294 0.07514220
## use know accept term display register fancy see acknowledge
## 0.08019755 0.16906002 0.21740809 0.33334837 0.40588618 0.44292995 0.45818813 0.59204011 0.65504902
## characteris|ze label classify cast train translate credit group show_up
## 0.66076883 0.66076883 0.72027990 0.80661220 0.83078287 0.85404651 0.93012392 0.93012392 0.93012392
## praise name designate recognis|ze gloss conceive conceive_of symbolis|ze appoint
## 1.10753445 1.11596675 1.21320890 1.25147269 1.33471882 1.40482473 1.47771942 1.47771942 1.53586598
## interpret portray dismiss misread cite rate construe mark_down visualis|ze
## 1.53586598 1.57978710 1.63454760 1.65156394 1.83946912 1.83946912 1.87342510 1.87342510 1.87342510
## depict perceive define treat prescribe_for tout diagnose view map
## 1.95877205 2.01066397 2.09927605 2.16904274 2.18035251 2.18035251 2.25840632 2.46437341 2.66098631
## catapult re-elect advert_to denounce describe class categoris|ze hail regard
## 2.67880362 2.67880362 2.70768281 2.71526502 2.74506958 3.17725878 3.47028060 3.52889945 4.78200774
par(mfrow=c(1,2))
plot(main="Freq & intercept adj of verbs in the as-predicative", type="n",
xlab="Frequency (logged to the base of 2)", ylab="Intercept adjustments",
xlim=range(log2(table(ASPREDICATIVE$WORD))), ylim=range(varying.intercepts),
x=0, y=0); grid()
text(x=log2(table(ASPREDICATIVE$WORD)), y=varying.intercepts, labels=names(log2(table(ASPREDICATIVE$WORD))), font=3)
abline(h=0, lty=2)
plot(main="Freq & interc. adj. of verbs in the as-predicative", type="n",
xlab="Frequency (logged to the base of 2)", ylab="Intercept adjustments",
xlim=range(log2(table(ASPREDICATIVE$WORD))), ylim=c(-4, max(varying.intercepts)),
x=0, y=0); grid()
text(x=log2(table(ASPREDICATIVE$WORD)), y=varying.intercepts, labels=names(log2(table(ASPREDICATIVE$WORD))),
font=3, cex=0.75, srt=45, col=ifelse(varying.intercepts>0, "blue", "red"))
abline(h=0, lty=2)
par(mfrow=c(1,1))
glmer
Let’s correlate the LLR-scores (raw and logged) with the varying intercepts:
par(mfrow=c(1,2))
plot(g.squareds, varying.intercepts, type="n"); grid()
text(g.squareds, varying.intercepts, names(g.squareds))
lines(lowess(varying.intercepts ~ g.squareds), col="red", lwd=2)
cor.test(g.squareds, varying.intercepts, method="spearman")
##
## Spearman's rank correlation rho
##
## data: g.squareds and varying.intercepts
## S = 25131, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8802919
plot(log.neg(g.squareds), varying.intercepts, type="n"); grid()
text(log.neg(g.squareds), varying.intercepts, names(g.squareds))
lines(lowess(varying.intercepts ~ log.neg(g.squareds)), col="red", lwd=2)
cor.test(log.neg(g.squareds), varying.intercepts, method="spearman")
##
## Spearman's rank correlation rho
##
## data: log.neg(g.squareds) and varying.intercepts
## S = 45554, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7830068
par(mfrow=c(1,1))
There’s definitely the (expected) positive correlation; the modeling that follows does not include the ‘all.other’ cases:
summary(xx <- data.frame(
FREQ=as.numeric(log2(table(ASPREDICATIVE$WORD))[-1]),
G2LOG=log.neg(g.squareds)[-1],
VARINT=varying.intercepts[-1]))
## FREQ G2LOG VARINT
## Min. : 0.000 Min. :-4.2072 Min. :-3.5174
## 1st Qu.: 3.754 1st Qu.: 0.7144 1st Qu.:-0.9712
## Median : 5.954 Median : 2.0718 Median : 0.0802
## Mean : 5.822 Mean : 2.1948 Mean : 0.2580
## 3rd Qu.: 7.658 3rd Qu.: 3.5571 3rd Qu.: 1.5359
## Max. :12.066 Max. : 9.5740 Max. : 4.7820
summary(m1 <- gam(G2LOG ~ s(FREQ) + s(VARINT), data=xx))
##
## Family: gaussian
## Link function: identity
##
## Formula:
## G2LOG ~ s(FREQ) + s(VARINT)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.19475 0.08414 26.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(FREQ) 8.818 8.987 27.27 <2e-16 ***
## s(VARINT) 8.196 8.823 81.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.89 Deviance explained = 90.8%
## GCV = 0.91081 Scale est. = 0.75748 n = 107
summary(m2 <- gam(G2LOG ~ s(FREQ, VARINT), data=xx))
##
## Family: gaussian
## Link function: identity
##
## Formula:
## G2LOG ~ s(FREQ, VARINT)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.19475 0.07123 30.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(FREQ,VARINT) 24.85 27.96 44.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.921 Deviance explained = 94%
## GCV = 0.71589 Scale est. = 0.54293 n = 107
Given the similarity of the Rss (and the similarity of the predictions), let’s go with the simpler model and see how the (logged) LLR-values ‘follow’ from (logged) frequencies and the varying intercepts:
nd <- expand.grid(
FREQ=seq(min(xx$FREQ), max(xx$FREQ), length.out=20),
VARINT=seq(min(xx$VARINT), max(xx$VARINT), length.out=20))
nd$PREDS <- predict(m1, newdata=nd)
scatterplot3d(pch=16, highlight.3d=TRUE,
xlab="Frequency (logged to based of 2)", x=nd$FREQ,
ylab="Intercept adjustments", y=nd$VARINT,
zlab="Predicted LLR", z=nd$PREDS)
The rotatable 3-D version can be created like this:
plot3d(
xlab="Frequency (logged to based of 2)", x=nd$FREQ,
ylab="Intercept adjustments", y=nd$VARINT,
zlab="Predicted LLR", z=nd$PREDS)
text3d(
x=xx$FREQ,
y=xx$VARINT,
z=xx$G2LOG,
texts=rownames(xx), col="blue")
Of course, we could compute this (faster) from the data frame x
, I’m just showing all the computations based on the case-by-variable format.↩︎