Let’s load a data set for a few binary logistic regression modeling examples; the data are in _input/genitives.csv and you can find information about the variables/columns in _input/genitives.r.
rm(list=ls(all.names=TRUE))library(car); library(effects) # & now load small helper functions:source("_helpers/R2s.r"); source("_helpers/C.score.r"); source("_helpers/cohens.kappa.r")summary(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
CASE GENITIVE SPEAKER MODALITY POR_LENGTH
Min. : 2 of:2720 nns:2666 spoken :1685 Min. : 1.00
1st Qu.:1006 s : 880 ns : 934 written:1915 1st Qu.: 8.00
Median :2018 Median : 11.00
Mean :2012 Mean : 14.58
3rd Qu.:3017 3rd Qu.: 17.00
Max. :4040 Max. :204.00
PUM_LENGTH POR_ANIMACY POR_FINAL_SIB POR_DEF
Min. : 2.00 animate : 920 absent :2721 definite :2349
1st Qu.: 6.00 collective: 607 present: 879 indefinite:1251
Median : 9.00 inanimate :1671
Mean : 10.35 locative : 243
3rd Qu.: 13.00 temporal : 159
Max. :109.00
As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.
2 Deviance & baseline(s)
Let’s already compute the baselines for what will be the response variable in all case studies, GENITIVE:
(baselines <-c("baseline 1"=max( # make baselines[1] the highestprop.table( # proportion in thetable(d$GENITIVE))), # frequency table of the response"baseline 2"=sum( # make baselines[2] the sum of theprop.table( # proportions in thetable(d$GENITIVE))^2))) # frequency table of the response squared
baseline 1 baseline 2
0.7555556 0.6306173
As for deviance, in the case of lm, we defined deviance as the sum of squared residuals and we said we want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. What is deviance here in the case of a glm? To understand that, we first compute a null model:
Call:
glm(formula = GENITIVE ~ 1, family = binomial, data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.12847 0.03878 -29.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 4004.3 on 3599 degrees of freedom
AIC: 4006.3
Number of Fisher Scoring iterations: 4
deviance(m_00)
[1] 4004.273
sum(residuals(m_00)^2)
[1] 4004.273
We see that the deviance of m_00 is still the sum of this model’s squared residuals and the model’s “null deviance” (and, if a model is a null model, also what is called “residual deviance”). But what are squared residuals here? We will discuss that in more detail after we have fitted our first ‘real model’ …
3 A binary predictor
Does the choice of a genitive construction vary as a function of the modality in which the construction was produced (spoken vs. written)?
3.1 Exploration & preparation
Some exploration of the relevant variables:
# the predictor(s)/response on its/their owntable(d$GENITIVE)
of s
2720 880
table(d$MODALITY)
spoken written
1685 1915
# the predictor(s) w/ the response(mod_x_gen <-table(d$MODALITY, d$GENITIVE))
of s
spoken 1232 453
written 1488 427
How could one approach this ‘descriptively’? We could say what the odds are for s-genitives in the spoken data …
453/1232# odds for s-genitives in the spoken data
[1] 0.3676948
… and we could say what the odds are for s-genitives in the written data:
427/1488# odds for s-genitives in the written data
[1] 0.2869624
But in some sense, odds are annoying because they have this very different value range for ‘I like s-genitives’ vs. ‘I don’t like s-genitives’, here shown for the spoken part:
1675/10# I like s-genitives a lot
[1] 167.5
1000/685# I like s-genitives somewhat
[1] 1.459854
843/842# I like s-genitives pretty much as much as of-genitives
[1] 1.001188
685/1000# I dislike s-genitives somewhat
[1] 0.685
10/1675# I don't like s-genitives at all
[1] 0.005970149
Let’s make this more symmetric by logging
log(1675/10) # I like s-genitives a lot
[1] 5.120983
log(1000/685) # I like s-genitives somewhat
[1] 0.3783364
log( 843/842) # I like s-genitives pretty much as much as of-genitives
[1] 0.001186944
log( 685/1000) # I dislike s-genitives somewhat
[1] -0.3783364
log( 10/1675) # I don't like s-genitives at all
[1] -5.120983
Much better, so let’s do that for our real data:
mod_x_gen # reminder what the real data were
of s
spoken 1232 453
written 1488 427
log(453/1232) # LOGGED odds for s-genitives in the spoken data
[1] -1.000502
log(427/1488) # LOGGED odds for s-genitives in the written data
[1] -1.248404
Keep those numbers in mind. But most people would probably express this in probabilities:
addmargins(mod_x_gen) # reminder what the real data were
of s Sum
spoken 1232 453 1685
written 1488 427 1915
Sum 2720 880 3600
453/1685# probability of s-genitives in the spoken data
[1] 0.2688427
427/1915# probability of s-genitives in the written data
[1] 0.2229765
We’ll revisit this in a moment …
3.2 Modeling & numerical interpretation
How would we have dealt with this in Ling 201/104? First, we would have computed a chi-squared test like this, where the chi-squared value is computed involving differences of observed and expected values:
How do we do this now? With a binary logistic regression, which returns the G2-statistic, which is computed based on ratios of observed and expected values (not differences like chi-squared) and which also returns the (logged) odds ratio. Let’s fit our regression model:
summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+# GENITIVE ~ an overall intercept (1) MODALITY, # & the predictor MODALITYfamily=binomial, # resp = binarydata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data
Call:
glm(formula = GENITIVE ~ 1 + MODALITY, family = binomial, data = d,
na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.00050 0.05495 -18.208 < 2e-16 ***
MODALITYwritten -0.24790 0.07767 -3.192 0.00141 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 3994.1 on 3598 degrees of freedom
AIC: 3998.1
Number of Fisher Scoring iterations: 4
pchisq( # compute the area under the chi-squared curveq=4004.3-3994.1, # for this chi-squared value: null - residual deviancedf=3599-3598, # at this df: null - residual dflower.tail=FALSE) # only using the right/upper tail/side
[1] 0.001404407
drop1(m_01, # drop each droppable predictor at a time from m_01 &test="Chisq") # test its significance w/ a LRT
(Note how similar the chi-squared value and the G2-statistic and their p-values are.)
But what are the coefficients?
coef(m_01)
(Intercept) MODALITYwritten
-1.0005020 -0.2479022
Recognize the intercept?
log(453/1232) # LOGGED odds for s-genitives in the spoken data
[1] -1.000502
Recognize what the MODALITY coefficient is?
log(427/1488) -log(453/1232) # difference between LOGGED odds for s-genitives
[1] -0.2479022
Let’s make ‘predictions’, which we add to the data:
d$PRED_PP_S <-predict( # make d$PRED_PP_S the result of predicting m_01, # from m_01type="response") # predicted probabilities of "s"# if you don't say type="response", predict gives you log odds:unique(predict(m_01), 6)
[1] -1.000502 -1.248404
d$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_S>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1]), # otherwise, predict "of"levels=levels(d$GENITIVE)) # to ensure 2-by-2 confusion matrices!
Actually, the much shorter though less intuitive version of generating d$PRED_CAT is this:
On top of that, let’s also add something else, namely a column called PRED_PP_OBS that contains, for each case, the predicted probability of the level that was observed in the data, i.e. what the model should have predicted. How does that work? Consider the head of d:
head(d)
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
2 3 of nns spoken 22 7 animate
3 4 of nns spoken 11 8 animate
4 5 of nns spoken 26 4 collective
5 6 s nns spoken 8 4 animate
6 7 s nns spoken 7 3 animate
POR_FINAL_SIB POR_DEF PRED_PP_S PRED_CAT
1 absent definite 0.2688427 of
2 absent definite 0.2688427 of
3 present definite 0.2688427 of
4 absent definite 0.2688427 of
5 absent definite 0.2688427 of
6 absent definite 0.2688427 of
In the first case, the model predicted s with a probability of 0.2688427 and, thus of with a probability of 1-0.2688427=0.7311573. The actual observation, what the speaker really said, is of so PRED_PP_OBS[1] should be 0.7311573. In the fifth case, the model again predicted s with a probability of 0.2688427 and, thus of with a probability of 1-0.2688427=0.7311573. The actual observation, what the speaker really said, is s so PRED_PP_OBS[2] should be 0.2688427. How do we do that efficiently for all cases (and in the spirit of Section 3.5.2, efficiently means ‘without a loop’)?
d$PRED_PP_OBS <-ifelse( # d$PRED_PP_OBS is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_S, # take its predicted probability1-d$PRED_PP_S) # otherwise take 1 minus its predicted probabilityhead(d)
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
2 3 of nns spoken 22 7 animate
3 4 of nns spoken 11 8 animate
4 5 of nns spoken 26 4 collective
5 6 s nns spoken 8 4 animate
6 7 s nns spoken 7 3 animate
POR_FINAL_SIB POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1 absent definite 0.2688427 of 0.7311573
2 absent definite 0.2688427 of 0.7311573
3 present definite 0.2688427 of 0.7311573
4 absent definite 0.2688427 of 0.7311573
5 absent definite 0.2688427 of 0.2688427
6 absent definite 0.2688427 of 0.2688427
Here, too, a shorter but much less intuitive version is available; if you have time to waste, try and find out how/why this works but for some weird reason it’s a brain teaser, consider yourself warned.
We’ll get to why this is useful in a moment … But let’s now evaluate the predictive capacity of the model (I’ll explain McFadden’s R2 below):
(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2720 0
s 880 0
# evaluating the confusion matrix here in the usual way doesn't make much sense,# which is why we discuss that in the next sectionc(R2s(m_01),"C"=C.score(d$GENITIVE, d$PRED_PP_S))
McFadden R-squared Nagelkerke R-squared C
0.002545694 0.004212717 0.530915775
3.3 Visual interpretation
Let’s visualize the nature of the effect of MODALITY based on the predictions:
(mo_d <-data.frame( # make mo_d a data frame of mo <-effect( # the effect"MODALITY", # of MODALITY m_01))) # in m_01
MODALITY fit se lower upper
1 spoken 0.2688427 0.010800758 0.2482073 0.2905308
2 written 0.2229765 0.009511784 0.2048903 0.2421729
plot(mo, # plot the effect of MODALITYtype="response", # but plot predicted probabilitiesylim=c(0, 1), # ylim: the whole range of possible probabilitiesgrid=TRUE) # & w/ a grid
3.4 Excursus: deviance derived from logloss
We saw above that the deviance of a glm is again the sum of squared residuals, which means we again want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. But what is deviance here in the case of a glm exactly, how it is computed? To understand, let’s look at the predictions of m_01 again:
head(d[,c(2,4,10:12)])
GENITIVE MODALITY PRED_PP_S PRED_CAT PRED_PP_OBS
1 of spoken 0.2688427 of 0.7311573
2 of spoken 0.2688427 of 0.7311573
3 of spoken 0.2688427 of 0.7311573
4 of spoken 0.2688427 of 0.7311573
5 s spoken 0.2688427 of 0.2688427
6 s spoken 0.2688427 of 0.2688427
What would the best model ever have returned for these 6 cases? Something like this, where the model
predicts all instances of s correctly by returning a predicted probability of s of essentially 1;
predicts all instances of of correctly by returning a predicted probability of s of essentially 0:
GENITIVE PRED_PP_S PRED_CAT PRED_PP_OBS
1 of 0.00001 of 0.99999
2 of 0.00001 of 0.99999
3 of 0.00001 of 0.99999
4 of 0.00001 of 0.99999
5 s 0.99999 s 0.99999
6 s 0.99999 s 0.99999
That means we could use 1-PRED_PP_OBS to quantify how much off from a perfect prediction the model is, but there is an alternative: We could taking the negative logs of PRED_PP_OBS because then
if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 1 – i.e. the model is good – then we end up -logging something that’s very close to 1, which makes the negative logs really close to 0, meaning ‘we’re not far off’:
-log(0.8)
[1] 0.2231436
-log(0.9)
[1] 0.1053605
-log(0.95)
[1] 0.05129329
if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 0 – i.e. the model is bad – then we end up -logging something that’s very close to 0, which makes the negative logs grow and grow, meaning ‘we’re quite far off’:
-log(0.2)
[1] 1.609438
-log(0.1)
[1] 2.302585
-log(0.05)
[1] 2.995732
I am calling these values contributions to logloss (because the mean of all those values is a evaluation metric called logloss). But what does this have to do with deviance? This:
contributions2logloss <--log(d$PRED_PP_OBS)# the sum of all contributions to logloss times 2sum(contributions2logloss) *2
[1] 3994.079
deviance(m_01) # is this!
[1] 3994.079
That means, the deviance of such a glm, what the summary calls “residual deviance”, expresses how far all the predicted probabilities of the events that should have been predicted are from their theoretical ideals of (very near to) 0 and (very near to) 1! That actually means we can also again express how much the deviance of the null model, the null deviance, is reduced by the predictors in percent:
(m_00$deviance -m_01$deviance) / m_00$deviance # same as# (m_01$null.deviance-m_01$deviance) / m_01$null.deviance
[1] 0.002545694
This is McFadden’s R2-value – not the one that is mostly used, which is Nagelkerke’s R2, but still very useful because of its parallelism to the R2of linear models.
3.5 Write-up
To determine whether the choice of a genitive varies as a function of the modality (spoken vs. written), a generalized linear model was fit with GENITIVE as the response variable and MODALITY as a predictor. The model was highly significant (LR=10.19, df=1, p<0.002) but explains very little of the response variable (Nagelkerke R2=0.004, McFadden’s R2=0.003, C=0.53) and has no predictive power: the model only ever predicts of-genitives. The coefficients indicate that s-genitives are more likely in spoken data than in written data, as shown in the summary table shown here. [Show a plot]
Estimate
95%-CI
se
z
p2-tailed
Intercept (MODALITY=spoken)
-1
[-1.11, -0.89]
0.05
-18.21
<0.001
MODALITYspoken→written
-0.25
[-0.4, -0.1]
0.08
-3.19
<0.002
4 A categorical predictor
Does the choice of a genitive construction vary as a function on the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal)?
4.1 Exploration & preparation
Some exploration of the relevant variables:
# the predictor(s)/response on its/their owntable(d$GENITIVE)
# the predictor(s) w/ the responsetable(d$POR_ANIMACY, d$GENITIVE)
of s
animate 370 550
collective 408 199
inanimate 1638 33
locative 199 44
temporal 105 54
4.2 Modeling & numerical interpretation
Let’s fit our regression model:
summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+# GENITIVE ~ an overall intercept (1) POR_ANIMACY, # & the predictor POR_ANIMACYfamily=binomial, # resp = binarydata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data
Call:
glm(formula = GENITIVE ~ 1 + POR_ANIMACY, family = binomial,
data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.39642 0.06724 5.896 3.73e-09 ***
POR_ANIMACYcollective -1.11438 0.10953 -10.174 < 2e-16 ***
POR_ANIMACYinanimate -4.30114 0.18823 -22.851 < 2e-16 ***
POR_ANIMACYlocative -1.90553 0.17965 -10.607 < 2e-16 ***
POR_ANIMACYtemporal -1.06139 0.18045 -5.882 4.06e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 2766.0 on 3595 degrees of freedom
AIC: 2776
Number of Fisher Scoring iterations: 6
pchisq( # compute the area under the chi-squared curveq=4004.3-2766, # for this chi-squared value: null - residual deviancedf=3599-3595, # at this df: null - residual dflower.tail=FALSE) # only using the right/upper tail/side
[1] 7.926259e-267
drop1(m_01, # drop each droppable predictor at a time from m_01 &test="Chisq") # test its significance w/ a LRT
Let’s make (numerical and categorical) ‘predictions’ and evaluate them:
d$PRED_PP_S <-predict( # make d$PRED_PP_S the result of predicting m_01, # from m_01type="response") # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_S>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1])) # otherwise, predict "of"(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2350 370
s 330 550
c( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.5978261 0.6250000 0.8768657 0.8639706
Acc. (overall)
0.8055556
Again, we also add our column PRED_PP_OBS with, for each case, the predicted probability of the level that was observed in the data:
d$PRED_PP_OBS <-ifelse( # d$PRED_PP_OBS is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_S, # take its predicted probability1-d$PRED_PP_S) # otherwise take 1 minus its predicted probabilityhead(d)
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
2 3 of nns spoken 22 7 animate
3 4 of nns spoken 11 8 animate
4 5 of nns spoken 26 4 collective
5 6 s nns spoken 8 4 animate
6 7 s nns spoken 7 3 animate
POR_FINAL_SIB POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1 absent definite 0.3278418 of 0.6721582
2 absent definite 0.5978261 s 0.4021739
3 present definite 0.5978261 s 0.4021739
4 absent definite 0.3278418 of 0.6721582
5 absent definite 0.5978261 s 0.5978261
6 absent definite 0.5978261 s 0.5978261
4.3 Visual interpretation
Let’s visualize the nature of the effect of POR_ANIMACY based on the predictions:
(an_d <-data.frame( # make an_d a data frame of an <-effect( # the effect"POR_ANIMACY", # of POR_ANIMACY m_01))) # in m_01
plot(an, # plot the effect of POR_ANIMACYtype="response", # but plot predicted probabilitiesylim=c(0, 1), # ylim: the whole range of possible probabilitiesgrid=TRUE) # & w/ a grid
4.4 Excursus: deviance derived from logloss
The way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:
deviance(m_01)
[1] 2765.995
sum(-log(d$PRED_PP_OBS)) *2
[1] 2765.995
m_01$deviance
[1] 2765.995
4.5 Excursus: model comparison
Given a lot of previous literature on the genitive alternation, you might actually have hypothesized that the only thing one needs is a distinction between animate (and maybe collective) and all inanimate possessors. Let’s use model comparison to check this. First, we create a version of POR_ANIMACY that conflates levels as desired:
d$POR_ANIMACY_confl <- d$POR_ANIMACY # create a copy of POR_ANIMACYlevels(d$POR_ANIMACY_confl) <-# overwrite the copy's levels such thatc("anim/coll", "anim/coll", # "animate" and "collective" become "anim/coll""inanim/loc/temp", "inanim/loc/temp", "inanim/loc/temp") # all others become "inanim/loc/temp"table(POR_ANIMACY=d$POR_ANIMACY, POR_ANIMACY_confl=d$POR_ANIMACY_confl) # check you did it right
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
2 3 of nns spoken 22 7 animate
3 4 of nns spoken 11 8 animate
4 5 of nns spoken 26 4 collective
5 6 s nns spoken 8 4 animate
6 7 s nns spoken 7 3 animate
POR_ANIMACY_confl POR_FINAL_SIB POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1 anim/coll absent definite 0.3278418 of 0.6721582
2 anim/coll absent definite 0.5978261 s 0.4021739
3 anim/coll present definite 0.5978261 s 0.4021739
4 anim/coll absent definite 0.3278418 of 0.6721582
5 anim/coll absent definite 0.5978261 s 0.5978261
6 anim/coll absent definite 0.5978261 s 0.5978261
We already have the model with POR_ANIMACY as the only predictor so we only need to fit a new model with POR_ANIMACY_confl as the only predictor:
summary(m_01_animconfl <-glm( # make/summarize the gen. linear model m_01_animconfl: GENITIVE ~1+ POR_ANIMACY_confl, # ORDER ~ an overall intercept (1) & POR_ANIMACY_conflfamily=binomial, data=d, na.action=na.exclude)) # the other arguments
Call:
glm(formula = GENITIVE ~ 1 + POR_ANIMACY_confl, family = binomial,
data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.03799 0.05119 -0.742 0.458
POR_ANIMACY_conflinanim/loc/temp -2.65829 0.10377 -25.616 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 3093.4 on 3598 degrees of freedom
AIC: 3097.4
Number of Fisher Scoring iterations: 5
… and then we do a comparison with anova just like above, the only difference being the argument to test (because we’re now fitting a generalized linear model).
anova(m_01, # compare the model w/ the 5-level predictor POR_ANIMACY m_01_animconfl, # & the model w/ the 2-level predictor POR_ANIMACY_confltest="Chisq") # using a LRT
The difference between the two models is hugely significant, we cannot reduce POR_ANIMACY to the only two levels we were considering.
4.6 Write-up
To determine whether the choice of a genitive varies as a function of the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal), a generalized linear model was fit with GENITIVE as the response variable and POR_ANIMACY as a predictor. The model was highly significant (LR=1238.3, df=4, p<0.001) and explains a decent amount of the response variable (Nagelkerke R2=0.434, McFadden’s R2=0.309, Cohen’s κ=0.482, C=0.85) with a good amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.
Estimate
95%-CI
se
z
p2-tailed
Intercept (POR_ANIMACY=animate)
0.40
[0.27, 0.53]
0.07
5.9
<0.001
POR_ANIMACYanimate→collective
-1.11
[-1.33, -0.9]
0.11
-10.17
<0.002
POR_ANIMACYanimate→inanimate
-4.30
[-4.69, -3.95]
0.19
-22.85
<0.002
POR_ANIMACYanimate→locative
-1.91
[-2.27, -1.56]
0.18
-10.61
<0.002
POR_ANIMACYanimate→temporal
-1.06
[-1.42, -0.71]
0.18
-5.88
<0.002
The model predicts s-genitives for animate possessors but of-genitives for collective/temporal possessors, then locative possessors, and then nearly exclusively for inanimate possessors. [Show a plot] A post-hoc comparison of the above model with one in which POS_ANIMACY was reduced to 2 levels (animate/collective vs. inanimate/locative/temporal) showed that this reduced predictor is highly significantly worse (LR=327.39, p<0.0001).
5 A numeric predictor
Does the choice of a genitive construction vary as a function on the length of the possessor (in characters)?
5.1 Exploration & preparation
# the predictor(s)/response on its/their owntable(d$GENITIVE)
# the predictor(s) w/ the responsespineplot(d$GENITIVE ~ d$POR_LENGTH)
The predictor POR_LENGTH looks like it’s going to cause problems if we use it like this, given its very skewed distribution, but we’ll try for now and fit our regression model.
5.2 Modeling & numerical interpretation
Here’s our initial model:
summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+# GENITIVE ~ an overall intercept (1) POR_LENGTH, # & the predictor POR_LENGTHfamily=binomial, # resp = binarydata=d, # those vars = in xna.action=na.exclude)) # skip cases with NA/missing data
Call:
glm(formula = GENITIVE ~ 1 + POR_LENGTH, family = binomial, data = d,
na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.337551 0.090941 3.712 0.000206 ***
POR_LENGTH -0.126693 0.008189 -15.471 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 3598.6 on 3598 degrees of freedom
AIC: 3602.6
Number of Fisher Scoring iterations: 6
pchisq( # compute the area under the chi-squared curveq=4004.3-3598.6, # for this chi-squared value: null - residual deviancedf=3599-3598, # at this df: null - residual dflower.tail=FALSE) # only using the right/upper tail/side
[1] 3.163282e-90
drop1(m_01, # drop each droppable predictor at a time from m_01 &test="Chisq") # test its significance w/ a LRT
d$PRED_PP_S <-predict( # make d$PRED_PP_S the result of predicting m_01, # from m_01type="response") # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_S>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1])) # otherwise, predict "of"(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2712 8
s 868 12
c( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.60000000 0.01363636 0.75754190 0.99705882
Acc. (overall)
0.75666667
plot(le, # plot the effect of POR_LENGTHtype="response", # but plot predicted probabilitiesylim=c(0, 1), # ylim: the whole range of possible probabilitiesgrid=TRUE) # & w/ a grid
5.4 What if we log-transform POR_LENGTH?
Ok, those results were pretty bad – let’s do this again because maybe it’s really smarter to transform the possessor lengths and, given the shape of this variable’s histogram and because no length is 0, we try the log transformation:
# the predictor(s)/response on its/their owntable(d$GENITIVE)
# the predictor(s) w/ the responsespineplot(d$GENITIVE ~ d$POR_LENGTH_LOG)
Let’s fit our 2nd regression model:
summary(m_02 <-glm( # make/summarize the gen. linear model m_02: GENITIVE ~1+# GENITIVE ~ an overall intercept (1) POR_LENGTH_LOG, # & the predictor POR_LENGTH_LOGfamily=binomial, # resp = binarydata=d, # those vars = in xna.action=na.exclude)) # skip cases with NA/missing data
Call:
glm(formula = GENITIVE ~ 1 + POR_LENGTH_LOG, family = binomial,
data = d, na.action = na.exclude)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.90960 0.17609 10.85 <2e-16 ***
POR_LENGTH_LOG -0.90493 0.05325 -16.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4004.3 on 3599 degrees of freedom
Residual deviance: 3643.1 on 3598 degrees of freedom
AIC: 3647.1
Number of Fisher Scoring iterations: 4
pchisq( # compute the area under the chi-squared curveq=4004.3-3643.1, # for this chi-squared value: null - residual deviancedf=3599-3598, # at this df: null - residual dflower.tail=FALSE) # only using the right/upper tail/side
[1] 1.542725e-80
drop1(m_02, # drop each droppable predictor at a time from m_02 &test="Chisq") # test its significance w/ a LRT
d$PRED_PP_S <-predict( # make d$PRED_PP_S the result of predicting m_02, # from m_02type="response") # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_S>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1])) # otherwise, predict "of"(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns
PREDS
OBS of s
of 2614 106
s 809 71
c( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.40112994 0.08068182 0.76365761 0.96102941
Acc. (overall)
0.74583333
Again, we also add our column PRED_PP_OBS with, for each case, the predicted probability of the level that was observed in the data:
d$PRED_PP_OBS <-ifelse( # d$PRED_PP_OBS is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_S, # take its predicted probability1-d$PRED_PP_S) # otherwise take 1 minus its predicted probabilityhead(d)
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
2 3 of nns spoken 22 7 animate
3 4 of nns spoken 11 8 animate
4 5 of nns spoken 26 4 collective
5 6 s nns spoken 8 4 animate
6 7 s nns spoken 7 3 animate
POR_ANIMACY_confl POR_FINAL_SIB POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1 anim/coll absent definite 0.19169521 of 0.8083048
2 anim/coll absent definite 0.10660805 of 0.8933919
3 anim/coll present definite 0.22777155 of 0.7722285
4 anim/coll absent definite 0.08754694 of 0.9124531
5 anim/coll absent definite 0.30891765 of 0.3089176
6 anim/coll absent definite 0.34731689 of 0.3473169
POR_LENGTH_LOG
1 3.700440
2 4.459432
3 3.459432
4 4.700440
5 3.000000
6 2.807355
Let’s visualize the nature of the effect of POR_LENGTH_LOG based on the predictions:
(ll_d <-data.frame( # make ll_d a data frame of ll <-effect( # the effect"POR_LENGTH_LOG", # of POR_LENGTH_LOG m_02))) # in m_02
plot(ll, # plot the effect of POR_LENGTH_LOGtype="response", # but plot predicted probabilitiesylim=c(0, 1), # ylim: the whole range of possible probabilitiesgrid=TRUE) # & w/ a grid
5.5 Excursus: deviance derived from logloss
The way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:
deviance(m_02)
[1] 3643.097
sum(-log(d$PRED_PP_OBS)) *2
[1] 3643.097
m_02$deviance
[1] 3643.097
5.6 Write-up
To determine whether the choice of a genitive varies as a function of the length of the possessor, a generalized linear model was fit with GENITIVE as the response variable and POR_LENGTH as a predictor; however, the initial model resulted in extremely poor fit probably in part because of the very long right tail of the predictor, which was therefore logged to the base of 2. A new model with this predictor was highly significant (LR=361.18, df=1, p<0.001) but explains only a small amount of the response variable (Nagelkerke R2=0.142, McFadden’s R2=0.101, Cohen’s κ=0.057, C=0.7088) with a fairly bad amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.
Estimate
95%-CI
se
z
p2-tailed
Intercept (POR_LENGTH=0)
1.91
[0.27, 0.53]
0.18
10.85
<0.001
POR_LENGTH0→1
-0.9
[-1.33, -0.9]
0.05
-16.99
<0.001
The model indicates a negative correlation between s-genitives and possessor length: the longer the possessor, the less likely s-genitives become and the more likely of-genitives become. [Show a plot]
6 Homework
To prepare for next session, read SFLWR3, Sections 5.3.4-5.3.5 on multifactorial binary logistic regression modeling.
---title: "Ling 202: session 04: bin. logistic regr. 1 (key)"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2025-04-23 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 5 fig-height: 5 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---# IntroLet's load a data set for a few binary logistic regression modeling examples; the data are in [_input/genitives.csv](_input/genitives.csv) and you can find information about the variables/columns in [_input/genitives.r](_input/genitives.r).```{r}rm(list=ls(all.names=TRUE))library(car); library(effects) # & now load small helper functions:source("_helpers/R2s.r"); source("_helpers/C.score.r"); source("_helpers/cohens.kappa.r")summary(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors```As in the book, we will first discuss some didactically motivated monofactorial models before we go over a 'proper' multifactorial modeling approach.# Deviance & baseline(s)Let's already compute the baselines for what will be the response variable in all case studies, `GENITIVE`:```{r}(baselines <-c("baseline 1"=max( # make baselines[1] the highestprop.table( # proportion in thetable(d$GENITIVE))), # frequency table of the response"baseline 2"=sum( # make baselines[2] the sum of theprop.table( # proportions in thetable(d$GENITIVE))^2))) # frequency table of the response squared```As for deviance, in the case of `lm`, we defined deviance as the sum of squared residuals and we said we want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. What is deviance here in the case of a `glm`? To understand that, we first compute a null model:```{r}summary(m_00 <-glm(GENITIVE ~1, family=binomial, data=d, na.action=na.exclude))deviance(m_00)sum(residuals(m_00)^2)```We see that the deviance of `m_00` is still the sum of this model's squared residuals and the model's "null deviance" (and, if a model is a null model, also what is called "residual deviance"). But what are squared residuals here? We will discuss that in more detail after we have fitted our first 'real model' ...# A binary predictorDoes the choice of a genitive construction vary as a function of the modality in which the construction was produced (*spoken* vs. *written*)?## Exploration & preparationSome exploration of the relevant variables:```{r}# the predictor(s)/response on its/their owntable(d$GENITIVE)table(d$MODALITY)# the predictor(s) w/ the response(mod_x_gen <-table(d$MODALITY, d$GENITIVE))```How could one approach this 'descriptively'? We could say what the odds are for *s*-genitives in the spoken data ...```{r}453/1232# odds for s-genitives in the spoken data```... and we could say what the odds are for *s*-genitives in the written data:```{r}427/1488# odds for s-genitives in the written data```But in some sense, odds are annoying because they have this very different value range for 'I like *s*-genitives' vs. 'I don't like *s*-genitives', here shown for the spoken part:```{r}1675/10# I like s-genitives a lot1000/685# I like s-genitives somewhat843/842# I like s-genitives pretty much as much as of-genitives685/1000# I dislike s-genitives somewhat10/1675# I don't like s-genitives at all```Let's make this more symmetric by logging```{r}log(1675/10) # I like s-genitives a lotlog(1000/685) # I like s-genitives somewhatlog( 843/842) # I like s-genitives pretty much as much as of-genitiveslog( 685/1000) # I dislike s-genitives somewhatlog( 10/1675) # I don't like s-genitives at all```Much better, so let's do that for our real data:```{r}mod_x_gen # reminder what the real data werelog(453/1232) # LOGGED odds for s-genitives in the spoken datalog(427/1488) # LOGGED odds for s-genitives in the written data```Keep those numbers in mind. But most people would probably express this in probabilities:```{r}addmargins(mod_x_gen) # reminder what the real data were453/1685# probability of s-genitives in the spoken data427/1915# probability of s-genitives in the written data```We'll revisit this in a moment ...## Modeling & numerical interpretationHow would we have dealt with this in Ling 201/104? First, we would have computed a chi-squared test like this, where the chi-squared value is computed involving differences of observed and expected values:```{r}chisq.test(mod_x_gen, correct=FALSE)```Second, we might have computed the odds ratio and the logged odds ratio for the *s*-genitives:```{r}(odds.ratio <- (mod_x_gen[2,2] / mod_x_gen[2,1]) / (mod_x_gen[1,2] / mod_x_gen[1,1]))log(odds.ratio)```How do we do this now? With a binary logistic regression, which returns the *G*^2^-statistic, which is computed based on *ratios* of observed and expected values (not *differences* like chi-squared) and which also returns the (logged) odds ratio. Let's fit our regression model:```{r}summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+# GENITIVE ~ an overall intercept (1) MODALITY, # & the predictor MODALITYfamily=binomial, # resp = binarydata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing datapchisq( # compute the area under the chi-squared curveq=4004.3-3994.1, # for this chi-squared value: null - residual deviancedf=3599-3598, # at this df: null - residual dflower.tail=FALSE) # only using the right/upper tail/sidedrop1(m_01, # drop each droppable predictor at a time from m_01 &test="Chisq") # test its significance w/ a LRTConfint(m_01) # confidence intervals```(Note how similar the chi-squared value and the *G*^2^-statistic and their *p*-values are.)But what are the coefficients?```{r}coef(m_01)```Recognize the intercept?```{r}log(453/1232) # LOGGED odds for s-genitives in the spoken data```Recognize what the `MODALITY` coefficient is?```{r}log(427/1488) -log(453/1232) # difference between LOGGED odds for s-genitives```Let's make 'predictions', which we add to the data:```{r}d$PRED_PP_S <-predict( # make d$PRED_PP_S the result of predicting m_01, # from m_01type="response") # predicted probabilities of "s"# if you don't say type="response", predict gives you log odds:unique(predict(m_01), 6)d$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_S>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1]), # otherwise, predict "of"levels=levels(d$GENITIVE)) # to ensure 2-by-2 confusion matrices!```Actually, the much shorter though less intuitive version of generating `d$PRED_CAT` is this:```{r}d$PRED_CAT <-factor(levels(d$GENITIVE)[1+(d$PRED_PP_S>=0.5)], levels=levels(d$GENITIVE))```On top of that, let's also add something else, namely a column called `PRED_PP_OBS` that contains, for each case, the predicted probability of the level that was observed in the data, i.e. what the model *should* have predicted. How does that work? Consider the head of `d`:```{r}head(d)```In the first case, the model predicted *s* with a probability of `r d$PRED_PP_S[1]` and, thus *of* with a probability of 1-`r d$PRED_PP_S[1]`=`r 1-d$PRED_PP_S[1]`. The actual observation, what the speaker really said, is *of* so `PRED_PP_OBS[1]` should be `r 1-d$PRED_PP_S[1]`. In the fifth case, the model again predicted *s* with a probability of `r d$PRED_PP_S[5]` and, thus *of* with a probability of 1-`r d$PRED_PP_S[5]`=`r 1-d$PRED_PP_S[5]`. The actual observation, what the speaker really said, is *s* so `PRED_PP_OBS[2]` should be `r d$PRED_PP_S[5]`. How do we do that efficiently for all cases (and in the spirit of Section 3.5.2, efficiently means 'without a loop')?```{r}d$PRED_PP_OBS <-ifelse( # d$PRED_PP_OBS is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_S, # take its predicted probability1-d$PRED_PP_S) # otherwise take 1 minus its predicted probabilityhead(d)```Here, too, a shorter but much less intuitive version is available; if you have time to waste, try and find out how/why this works but for some weird reason it's a brain teaser, consider yourself warned.```{r}#| eval: falsed$PRED_PP_OBS <-abs(d$PRED_PP_S -as.numeric(d$GENITIVE!="s"))```We'll get to why this is useful in a moment ... But let's now evaluate the predictive capacity of the model (I'll explain McFadden's *R*^2^ below):```{r}(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columns# evaluating the confusion matrix here in the usual way doesn't make much sense,# which is why we discuss that in the next sectionc(R2s(m_01),"C"=C.score(d$GENITIVE, d$PRED_PP_S))```## Visual interpretationLet's visualize the nature of the effect of `MODALITY` based on the predictions:```{r}(mo_d <-data.frame( # make mo_d a data frame of mo <-effect( # the effect"MODALITY", # of MODALITY m_01))) # in m_01plot(mo, # plot the effect of MODALITYtype="response", # but plot predicted probabilitiesylim=c(0, 1), # ylim: the whole range of possible probabilitiesgrid=TRUE) # & w/ a grid```## Excursus: deviance derived from loglossWe saw above that the deviance of a `glm` is again the sum of squared residuals, which means we again want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. But what is deviance here in the case of a `glm` exactly, how it is computed? To understand, let's look at the predictions of `m_01` again:```{r}head(d[,c(2,4,10:12)])```What would the best model ever have returned for these 6 cases? Something like this, where the model* predicts all instances of *s* correctly by returning a predicted probability of *s* of essentially 1;* predicts all instances of *of* correctly by returning a predicted probability of *s* of essentially 0:```{r}#| echo: falsedata.frame(GENITIVE =d$GENITIVE[1:6],PRED_PP_S =rep(c(0.00001, 0.99999), c(4, 2)),PRED_CAT =d$GENITIVE[1:6],PRED_PP_OBS=rep(0.99999, 6))```That means we could use `1-PRED_PP_OBS` to quantify how much off from a perfect prediction the model is, but there is an alternative: We could taking the negative logs of `PRED_PP_OBS` because then* if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 1 -- i.e. the model is good -- then we end up -logging something that's very close to 1, which makes the negative logs really close to 0, meaning 'we're not far off':```{r}-log(0.8)-log(0.9)-log(0.95)```* if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 0 -- i.e. the model is bad -- then we end up -logging something that's very close to 0, which makes the negative logs grow and grow, meaning 'we're quite far off':```{r}-log(0.2)-log(0.1)-log(0.05)```I am calling these values **contributions to logloss** (because the mean of all those values is a evaluation metric called *logloss*). But what does this have to do with deviance? This:```{r}contributions2logloss <--log(d$PRED_PP_OBS)# the sum of all contributions to logloss times 2sum(contributions2logloss) *2deviance(m_01) # is this!```That means, the deviance of such a `glm`, what the summary calls "residual deviance", expresses how far all the predicted probabilities of the events that *should* have been predicted are from their theoretical ideals of (very near to) 0 and (very near to) 1! That actually means we can also again express how much the deviance of the null model, the null deviance, is reduced by the predictors in percent:```{r}#| results: hold (m_00$deviance -m_01$deviance) / m_00$deviance # same as# (m_01$null.deviance-m_01$deviance) / m_01$null.deviance```This is McFadden's *R*^2^-value -- not the one that is mostly used, which is Nagelkerke's *R*^2^, but still very useful because of its parallelism to the *R*^2^of linear models.## Write-upTo determine whether the choice of a genitive varies as a function of the modality (*spoken* vs. *written*), a generalized linear model was fit with `GENITIVE` as the response variable and `MODALITY` as a predictor. The model was highly significant (*LR*=10.19, *df*=1, *p*<0.002) but explains very little of the response variable (Nagelkerke *R*^2^=0.004, McFadden's *R*^2^=0.003, *C*=0.53) and has no predictive power: the model only ever predicts *of*-genitives. The coefficients indicate that *s*-genitives are more likely in spoken data than in written data, as shown in the summary table shown here. [Show a plot]| | Estimate | 95%-CI | *se* | *z* | *p*~2-tailed~ ||:----------------------------------|:---------|:---------------|:-----|:-------|:--------------|| Intercept (MODALITY=*spoken*) | -1 | [-1.11, -0.89] | 0.05 | -18.21 | <0.001 || MODALITY~*spoken*→*written*~ | -0.25 | [-0.4, -0.1] | 0.08 | -3.19 | <0.002 |# A categorical predictorDoes the choice of a genitive construction vary as a function on the degree/kind of animacy of the possessor (*animate* vs. *collective* vs. *inanimate* vs. *locative* vs. *temporal*)?## Exploration & preparationSome exploration of the relevant variables:```{r}# the predictor(s)/response on its/their owntable(d$GENITIVE)table(d$POR_ANIMACY)# the predictor(s) w/ the responsetable(d$POR_ANIMACY, d$GENITIVE)```## Modeling & numerical interpretationLet's fit our regression model:```{r}summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+# GENITIVE ~ an overall intercept (1) POR_ANIMACY, # & the predictor POR_ANIMACYfamily=binomial, # resp = binarydata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing datapchisq( # compute the area under the chi-squared curveq=4004.3-2766, # for this chi-squared value: null - residual deviancedf=3599-3595, # at this df: null - residual dflower.tail=FALSE) # only using the right/upper tail/sidedrop1(m_01, # drop each droppable predictor at a time from m_01 &test="Chisq") # test its significance w/ a LRTConfint(m_01) # confidence intervals```Let's make (numerical and categorical) 'predictions' and evaluate them:```{r}d$PRED_PP_S <-predict( # make d$PRED_PP_S the result of predicting m_01, # from m_01type="response") # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_S>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1])) # otherwise, predict "of"(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columnsc( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))c(R2s(m_01),"Cohen's kappa"=cohens.kappa(c_m)[[1]],"C"=C.score(d$GENITIVE, d$PRED_PP_S))```Again, we also add our column `PRED_PP_OBS` with, for each case, the predicted probability of the level that was observed in the data:```{r}d$PRED_PP_OBS <-ifelse( # d$PRED_PP_OBS is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_S, # take its predicted probability1-d$PRED_PP_S) # otherwise take 1 minus its predicted probabilityhead(d)```## Visual interpretationLet's visualize the nature of the effect of `POR_ANIMACY` based on the predictions:```{r}(an_d <-data.frame( # make an_d a data frame of an <-effect( # the effect"POR_ANIMACY", # of POR_ANIMACY m_01))) # in m_01plot(an, # plot the effect of POR_ANIMACYtype="response", # but plot predicted probabilitiesylim=c(0, 1), # ylim: the whole range of possible probabilitiesgrid=TRUE) # & w/ a grid```## Excursus: deviance derived from loglossThe way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:```{r}deviance(m_01)sum(-log(d$PRED_PP_OBS)) *2m_01$deviance```## Excursus: model comparisonGiven a lot of previous literature on the genitive alternation, you might actually have hypothesized that the only thing one needs is a distinction between animate (and maybe collective) and all inanimate possessors. Let's use model comparison to check this. First, we create a version of `POR_ANIMACY` that conflates levels as desired:```{r}d$POR_ANIMACY_confl <- d$POR_ANIMACY # create a copy of POR_ANIMACYlevels(d$POR_ANIMACY_confl) <-# overwrite the copy's levels such thatc("anim/coll", "anim/coll", # "animate" and "collective" become "anim/coll""inanim/loc/temp", "inanim/loc/temp", "inanim/loc/temp") # all others become "inanim/loc/temp"table(POR_ANIMACY=d$POR_ANIMACY, POR_ANIMACY_confl=d$POR_ANIMACY_confl) # check you did it righthead(d <- d[,c(1:7,13,8:12)])```We already have the model with `POR_ANIMACY` as the only predictor so we only need to fit a new model with `POR_ANIMACY_confl` as the only predictor:```{r}summary(m_01_animconfl <-glm( # make/summarize the gen. linear model m_01_animconfl: GENITIVE ~1+ POR_ANIMACY_confl, # ORDER ~ an overall intercept (1) & POR_ANIMACY_conflfamily=binomial, data=d, na.action=na.exclude)) # the other arguments```... and then we do a comparison with `anova` just like above, the only difference being the argument to `test` (because we're now fitting a generalized linear model).```{r}anova(m_01, # compare the model w/ the 5-level predictor POR_ANIMACY m_01_animconfl, # & the model w/ the 2-level predictor POR_ANIMACY_confltest="Chisq") # using a LRT```The difference between the two models is hugely significant, we cannot reduce `POR_ANIMACY` to the only two levels we were considering.## Write-upTo determine whether the choice of a genitive varies as a function of the degree/kind of animacy of the possessor (*animate* vs. *collective* vs. *inanimate* vs. *locative* vs. *temporal*), a generalized linear model was fit with `GENITIVE` as the response variable and `POR_ANIMACY` as a predictor. The model was highly significant (*LR*=1238.3, *df*=4, *p*<0.001) and explains a decent amount of the response variable (Nagelkerke *R*^2^=0.434, McFadden's *R*^2^=0.309, Cohen's κ=0.482, *C*=0.85) with a good amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.| | Estimate | 95%-CI | *se* | *z* | *p*~2-tailed~ ||:-----------------------------------------|:---------|:---------------|:-----|:-------|:--------------|| Intercept (POR_ANIMACY=*animate*) | 0.40 | [0.27, 0.53] | 0.07 | 5.9 | <0.001 || POR_ANIMACY~*animate*→*collective*~ | -1.11 | [-1.33, -0.9] | 0.11 | -10.17 | <0.002 || POR_ANIMACY~*animate*→*inanimate*~ | -4.30 | [-4.69, -3.95] | 0.19 | -22.85 | <0.002 || POR_ANIMACY~*animate*→*locative*~ | -1.91 | [-2.27, -1.56] | 0.18 | -10.61 | <0.002 || POR_ANIMACY~*animate*→*temporal*~ | -1.06 | [-1.42, -0.71] | 0.18 | -5.88 | <0.002 |The model predicts *s*-genitives for animate possessors but *of*-genitives for collective/temporal possessors, then locative possessors, and then nearly exclusively for inanimate possessors. [Show a plot] A post-hoc comparison of the above model with one in which `POS_ANIMACY` was reduced to 2 levels (*animate/collective* vs. *inanimate/locative/temporal*) showed that this reduced predictor is highly significantly worse (*LR*=327.39, *p*<0.0001).# A numeric predictorDoes the choice of a genitive construction vary as a function on the length of the possessor (in characters)?## Exploration & preparation```{r}# the predictor(s)/response on its/their owntable(d$GENITIVE)hist(d$POR_LENGTH); hist(d$POR_LENGTH, breaks="FD")# the predictor(s) w/ the responsespineplot(d$GENITIVE ~ d$POR_LENGTH)```The predictor `POR_LENGTH` looks like it's going to cause problems if we use it like this, given its very skewed distribution, but we'll try for now and fit our regression model.## Modeling & numerical interpretationHere's our initial model:```{r}summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+# GENITIVE ~ an overall intercept (1) POR_LENGTH, # & the predictor POR_LENGTHfamily=binomial, # resp = binarydata=d, # those vars = in xna.action=na.exclude)) # skip cases with NA/missing datapchisq( # compute the area under the chi-squared curveq=4004.3-3598.6, # for this chi-squared value: null - residual deviancedf=3599-3598, # at this df: null - residual dflower.tail=FALSE) # only using the right/upper tail/sidedrop1(m_01, # drop each droppable predictor at a time from m_01 &test="Chisq") # test its significance w/ a LRTConfint(m_01) # confidence intervals```Let's make 'predictions' and evaluate them:```{r}d$PRED_PP_S <-predict( # make d$PRED_PP_S the result of predicting m_01, # from m_01type="response") # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_S>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1])) # otherwise, predict "of"(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columnsc( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))c(R2s(m_01),"Cohen's kappa"=cohens.kappa(c_m)[[1]],"C"=C.score(d$GENITIVE, d$PRED_PP_S))```## Visual interpretationLet's visualize the nature of the effect of `POR_LENGTH` based on the predictions:```{r}(le_d <-data.frame( # make le_d a data frame of le <-effect ( # the effect"POR_LENGTH", # of POR_LENGTH m_01))) # in m_01plot(le, # plot the effect of POR_LENGTHtype="response", # but plot predicted probabilitiesylim=c(0, 1), # ylim: the whole range of possible probabilitiesgrid=TRUE) # & w/ a grid```## What if we log-transform `POR_LENGTH`?Ok, those results were pretty bad -- let's do this again because maybe it's really smarter to transform the possessor lengths and, given the shape of this variable's histogram and because no length is 0, we try the log transformation:```{r}# the predictor(s)/response on its/their owntable(d$GENITIVE)d$POR_LENGTH_LOG <-log2(d$POR_LENGTH)hist(d$POR_LENGTH_LOG); hist(d$POR_LENGTH_LOG, breaks="FD")# the predictor(s) w/ the responsespineplot(d$GENITIVE ~ d$POR_LENGTH_LOG)```Let's fit our 2nd regression model:```{r}summary(m_02 <-glm( # make/summarize the gen. linear model m_02: GENITIVE ~1+# GENITIVE ~ an overall intercept (1) POR_LENGTH_LOG, # & the predictor POR_LENGTH_LOGfamily=binomial, # resp = binarydata=d, # those vars = in xna.action=na.exclude)) # skip cases with NA/missing datapchisq( # compute the area under the chi-squared curveq=4004.3-3643.1, # for this chi-squared value: null - residual deviancedf=3599-3598, # at this df: null - residual dflower.tail=FALSE) # only using the right/upper tail/sidedrop1(m_02, # drop each droppable predictor at a time from m_02 &test="Chisq") # test its significance w/ a LRTConfint(m_02) # confidence intervals```Let's make 'predictions' and evaluate them:```{r}d$PRED_PP_S <-predict( # make d$PRED_PP_S the result of predicting m_02, # from m_02type="response") # predicted probabilities of "s"# if you don't say type="response", predict gives you log oddsd$PRED_CAT <-factor( # make d$PRED_CAT the result of checkingifelse(d$PRED_PP_S>=0.5, # if the predicted prob. of "s" is >=0.5levels(d$GENITIVE)[2], # if yes, predict "s"levels(d$GENITIVE)[1])) # otherwise, predict "of"(c_m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_CAT)) # predicted orders in the columnsc( # evaluate the confusion matrix"Prec. for s"=c_m[ "s", "s"] /sum(c_m[ , "s"]),"Acc./rec. for s"=c_m[ "s", "s"] /sum(c_m[ "s", ]),"Prec. for of"=c_m["of","of"] /sum(c_m[ ,"of"]),"Acc./rec. for of"=c_m["of","of"] /sum(c_m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_CAT))c(R2s(m_02),"Cohen's kappa"=cohens.kappa(c_m)[[1]],"C"=C.score(d$GENITIVE, d$PRED_PP_S))```Again, we also add our column `PRED_PP_OBS` with, for each case, the predicted probability of the level that was observed in the data:```{r}d$PRED_PP_OBS <-ifelse( # d$PRED_PP_OBS is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PRED_PP_S, # take its predicted probability1-d$PRED_PP_S) # otherwise take 1 minus its predicted probabilityhead(d)```Let's visualize the nature of the effect of `POR_LENGTH_LOG` based on the predictions:```{r}(ll_d <-data.frame( # make ll_d a data frame of ll <-effect( # the effect"POR_LENGTH_LOG", # of POR_LENGTH_LOG m_02))) # in m_02plot(ll, # plot the effect of POR_LENGTH_LOGtype="response", # but plot predicted probabilitiesylim=c(0, 1), # ylim: the whole range of possible probabilitiesgrid=TRUE) # & w/ a grid```## Excursus: deviance derived from loglossThe way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:```{r}deviance(m_02)sum(-log(d$PRED_PP_OBS)) *2m_02$deviance```## Write-upTo determine whether the choice of a genitive varies as a function of the length of the possessor, a generalized linear model was fit with `GENITIVE` as the response variable and `POR_LENGTH` as a predictor; however, the initial model resulted in extremely poor fit probably in part because of the very long right tail of the predictor, which was therefore logged to the base of 2. A new model with this predictor was highly significant (*LR*=361.18, *df*=1, *p*<0.001) but explains only a small amount of the response variable (Nagelkerke *R*^2^=0.142, McFadden's *R*^2^=0.101, Cohen's κ=0.057, *C*=0.7088) with a fairly bad amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.| | Estimate | 95%-CI | *se* | *z* | *p*~2-tailed~ ||:-------------------------|:---------|:---------------|:-----|:-------|:--------------|| Intercept (POR_LENGTH=0) | 1.91 | [0.27, 0.53] | 0.18 | 10.85 | <0.001 || POR_LENGTH~0→1~ | -0.9 | [-1.33, -0.9] | 0.05 | -16.99 | <0.001 |The model indicates a negative correlation between *s*-genitives and possessor length: the longer the possessor, the less likely *s*-genitives become and the more likely *of*-genitives become. [Show a plot]# HomeworkTo prepare for next session, read *SFLWR*^3^, Sections 5.3.4-5.3.5 on multifactorial binary logistic regression modeling.# Session info```{r}sessionInfo()```