Ling 202: session 04: bin. logistic regr. 1 (key)

Author

Affiliations

UC Santa Barbara

JLU Giessen

Published

23 Apr 2025 12-34-56

1 Intro

Let’s load a data set for a few binary logistic regression modeling examples; the data are in _input/genitives.csv and you can find information about the variables/columns in _input/genitives.r.

rm(list=ls(all.names=TRUE))
library(car); library(effects) # & now load small helper functions:
source("_helpers/R2s.r"); source("_helpers/C.score.r"); source("_helpers/cohens.kappa.r")
summary(d <- read.delim(  # summarize d, the result of loading
   file="_input/genitives.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

      CASE      GENITIVE  SPEAKER       MODALITY      POR_LENGTH
 Min.   :   2   of:2720   nns:2666   spoken :1685   Min.   :  1.00
 1st Qu.:1006   s : 880   ns : 934   written:1915   1st Qu.:  8.00
 Median :2018                                       Median : 11.00
 Mean   :2012                                       Mean   : 14.58
 3rd Qu.:3017                                       3rd Qu.: 17.00
 Max.   :4040                                       Max.   :204.00
   PUM_LENGTH         POR_ANIMACY   POR_FINAL_SIB        POR_DEF
 Min.   :  2.00   animate   : 920   absent :2721   definite  :2349
 1st Qu.:  6.00   collective: 607   present: 879   indefinite:1251
 Median :  9.00   inanimate :1671
 Mean   : 10.35   locative  : 243
 3rd Qu.: 13.00   temporal  : 159
 Max.   :109.00

As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.

2 Deviance & baseline(s)

Let’s already compute the baselines for what will be the response variable in all case studies, GENITIVE:

(baselines <- c(
   "baseline 1"=max(          # make baselines[1] the highest
      prop.table(             # proportion in the
         table(d$GENITIVE))), # frequency table of the response
   "baseline 2"=sum(             # make baselines[2] the sum of the
      prop.table(                # proportions in the
         table(d$GENITIVE))^2))) # frequency table of the response squared

baseline 1 baseline 2
 0.7555556  0.6306173

As for deviance, in the case of lm, we defined deviance as the sum of squared residuals and we said we want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. What is deviance here in the case of a glm? To understand that, we first compute a null model:

summary(m_00 <- glm(GENITIVE ~ 1, family=binomial, data=d, na.action=na.exclude))


Call:
glm(formula = GENITIVE ~ 1, family = binomial, data = d, na.action = na.exclude)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.12847    0.03878   -29.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 4004.3  on 3599  degrees of freedom
AIC: 4006.3

Number of Fisher Scoring iterations: 4

deviance(m_00)

[1] 4004.273

sum(residuals(m_00)^2)

[1] 4004.273

We see that the deviance of m_00 is still the sum of this model’s squared residuals and the model’s “null deviance” (and, if a model is a null model, also what is called “residual deviance”). But what are squared residuals here? We will discuss that in more detail after we have fitted our first ‘real model’ …

3 A binary predictor

Does the choice of a genitive construction vary as a function of the modality in which the construction was produced (spoken vs. written)?

3.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
table(d$GENITIVE)


  of    s
2720  880

table(d$MODALITY)


 spoken written
   1685    1915

# the predictor(s) w/ the response
(mod_x_gen <- table(d$MODALITY, d$GENITIVE))


            of    s
  spoken  1232  453
  written 1488  427

How could one approach this ‘descriptively’? We could say what the odds are for s-genitives in the spoken data …

453/1232 # odds for s-genitives in the spoken data

[1] 0.3676948

… and we could say what the odds are for s-genitives in the written data:

427/1488 # odds for s-genitives in the written data

[1] 0.2869624

But in some sense, odds are annoying because they have this very different value range for ‘I like s-genitives’ vs. ‘I don’t like s-genitives’, here shown for the spoken part:

1675 /   10 # I like s-genitives a lot

[1] 167.5

1000 /  685 # I like s-genitives somewhat

[1] 1.459854

 843 /  842 # I like s-genitives pretty much as much as of-genitives

[1] 1.001188

 685 / 1000 # I dislike s-genitives somewhat

[1] 0.685

  10 / 1675 # I don't like s-genitives at all

[1] 0.005970149

Let’s make this more symmetric by logging

log(1675 /   10) # I like s-genitives a lot

[1] 5.120983

log(1000 /  685) # I like s-genitives somewhat

[1] 0.3783364

log( 843 /  842) # I like s-genitives pretty much as much as of-genitives

[1] 0.001186944

log( 685 / 1000) # I dislike s-genitives somewhat

[1] -0.3783364

log(  10 / 1675) # I don't like s-genitives at all

[1] -5.120983

Much better, so let’s do that for our real data:

mod_x_gen # reminder what the real data were


            of    s
  spoken  1232  453
  written 1488  427

log(453 / 1232) # LOGGED odds for s-genitives in the spoken data

[1] -1.000502

log(427 / 1488) # LOGGED odds for s-genitives in the written data

[1] -1.248404

Keep those numbers in mind. But most people would probably express this in probabilities:

addmargins(mod_x_gen) # reminder what the real data were


            of    s  Sum
  spoken  1232  453 1685
  written 1488  427 1915
  Sum     2720  880 3600

453 / 1685 # probability of s-genitives in the spoken data

[1] 0.2688427

427 / 1915 # probability of s-genitives in the written data

[1] 0.2229765

We’ll revisit this in a moment …

3.2 Modeling & numerical interpretation

How would we have dealt with this in Ling 201/104? First, we would have computed a chi-squared test like this, where the chi-squared value is computed involving differences of observed and expected values:

chisq.test(mod_x_gen, correct=FALSE)


    Pearson's Chi-squared test

data:  mod_x_gen
X-squared = 10.21, df = 1, p-value = 0.001397

Second, we might have computed the odds ratio and the logged odds ratio for the s-genitives:

(odds.ratio <-
   (mod_x_gen[2,2] / mod_x_gen[2,1]) /
   (mod_x_gen[1,2] / mod_x_gen[1,1]))

[1] 0.7804363

log(odds.ratio)

[1] -0.2479022

How do we do this now? With a binary logistic regression, which returns the G²-statistic, which is computed based on ratios of observed and expected values (not differences like chi-squared) and which also returns the (logged) odds ratio. Let’s fit our regression model:

summary(m_01 <- glm(      # make/summarize the gen. linear model m_01:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   MODALITY,              # & the predictor MODALITY
   family=binomial,       # resp = binary
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data


Call:
glm(formula = GENITIVE ~ 1 + MODALITY, family = binomial, data = d,
    na.action = na.exclude)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)     -1.00050    0.05495 -18.208  < 2e-16 ***
MODALITYwritten -0.24790    0.07767  -3.192  0.00141 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 3994.1  on 3598  degrees of freedom
AIC: 3998.1

Number of Fisher Scoring iterations: 4

pchisq(              # compute the area under the chi-squared curve
   q=4004.3-3994.1,  # for this chi-squared value: null - residual deviance
   df=3599-3598,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side

[1] 0.001404407

drop1(m_01,      # drop each droppable predictor at a time from m_01 &
   test="Chisq") # test its significance w/ a LRT

Single term deletions

Model:
GENITIVE ~ 1 + MODALITY
         Df Deviance    AIC    LRT Pr(>Chi)
<none>        3994.1 3998.1
MODALITY  1   4004.3 4006.3 10.194 0.001409 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confint(m_01) # confidence intervals

                  Estimate      2.5 %      97.5 %
(Intercept)     -1.0005020 -1.1091198 -0.89367488
MODALITYwritten -0.2479022 -0.4002591 -0.09571937

(Note how similar the chi-squared value and the G²-statistic and their p-values are.)

But what are the coefficients?

coef(m_01)

    (Intercept) MODALITYwritten
     -1.0005020      -0.2479022

Recognize the intercept?

log(453 / 1232) # LOGGED odds for s-genitives in the spoken data

[1] -1.000502

Recognize what the MODALITY coefficient is?

log(427 / 1488) - log(453 / 1232) # difference between LOGGED odds for s-genitives

[1] -0.2479022

Let’s make ‘predictions’, which we add to the data:

d$PRED_PP_S <- predict( # make d$PRED_PP_S the result of predicting
   m_01,                # from m_01
   type="response")     # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds:
unique(predict(m_01), 6)

[1] -1.000502 -1.248404

d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_S>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1]), # otherwise, predict "of"
   levels=levels(d$GENITIVE)) # to ensure 2-by-2 confusion matrices!

Actually, the much shorter though less intuitive version of generating d$PRED_CAT is this:

d$PRED_CAT <- factor(levels(d$GENITIVE)[1+(d$PRED_PP_S>=0.5)], levels=levels(d$GENITIVE))

On top of that, let’s also add something else, namely a column called PRED_PP_OBS that contains, for each case, the predicted probability of the level that was observed in the data, i.e. what the model should have predicted. How does that work? Consider the head of d:

head(d)

  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_FINAL_SIB  POR_DEF PRED_PP_S PRED_CAT
1        absent definite 0.2688427       of
2        absent definite 0.2688427       of
3       present definite 0.2688427       of
4        absent definite 0.2688427       of
5        absent definite 0.2688427       of
6        absent definite 0.2688427       of

In the first case, the model predicted s with a probability of 0.2688427 and, thus of with a probability of 1-0.2688427=0.7311573. The actual observation, what the speaker really said, is of so PRED_PP_OBS[1] should be 0.7311573. In the fifth case, the model again predicted s with a probability of 0.2688427 and, thus of with a probability of 1-0.2688427=0.7311573. The actual observation, what the speaker really said, is s so PRED_PP_OBS[2] should be 0.2688427. How do we do that efficiently for all cases (and in the spirit of Section 3.5.2, efficiently means ‘without a loop’)?

d$PRED_PP_OBS <- ifelse( # d$PRED_PP_OBS is determined by ifelse
   d$GENITIVE=="s",      # if the obs genitive is the 2nd level of the response
     d$PRED_PP_S,        # take its predicted probability
   1-d$PRED_PP_S)        # otherwise take 1 minus its predicted probability
head(d)

  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_FINAL_SIB  POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1        absent definite 0.2688427       of   0.7311573
2        absent definite 0.2688427       of   0.7311573
3       present definite 0.2688427       of   0.7311573
4        absent definite 0.2688427       of   0.7311573
5        absent definite 0.2688427       of   0.2688427
6        absent definite 0.2688427       of   0.2688427

Here, too, a shorter but much less intuitive version is available; if you have time to waste, try and find out how/why this works but for some weird reason it’s a brain teaser, consider yourself warned.

d$PRED_PP_OBS <- abs(d$PRED_PP_S - as.numeric(d$GENITIVE!="s"))

We’ll get to why this is useful in a moment … But let’s now evaluate the predictive capacity of the model (I’ll explain McFadden’s R² below):

(c_m <- table(          # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,  # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns

    PREDS
OBS    of    s
  of 2720    0
  s   880    0

# evaluating the confusion matrix here in the usual way doesn't make much sense,
# which is why we discuss that in the next section
c(R2s(m_01),
  "C"=C.score(d$GENITIVE, d$PRED_PP_S))

  McFadden R-squared Nagelkerke R-squared                    C
         0.002545694          0.004212717          0.530915775

3.3 Visual interpretation

Let’s visualize the nature of the effect of MODALITY based on the predictions:

(mo_d <- data.frame( # make mo_d a data frame of
   mo <- effect(     # the effect
      "MODALITY",    # of MODALITY
      m_01)))        # in m_01

  MODALITY       fit          se     lower     upper
1   spoken 0.2688427 0.010800758 0.2482073 0.2905308
2  written 0.2229765 0.009511784 0.2048903 0.2421729

plot(mo,            # plot the effect of MODALITY
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

3.4 Excursus: deviance derived from logloss

We saw above that the deviance of a glm is again the sum of squared residuals, which means we again want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. But what is deviance here in the case of a glm exactly, how it is computed? To understand, let’s look at the predictions of m_01 again:

head(d[,c(2,4,10:12)])

  GENITIVE MODALITY PRED_PP_S PRED_CAT PRED_PP_OBS
1       of   spoken 0.2688427       of   0.7311573
2       of   spoken 0.2688427       of   0.7311573
3       of   spoken 0.2688427       of   0.7311573
4       of   spoken 0.2688427       of   0.7311573
5        s   spoken 0.2688427       of   0.2688427
6        s   spoken 0.2688427       of   0.2688427

What would the best model ever have returned for these 6 cases? Something like this, where the model

predicts all instances of s correctly by returning a predicted probability of s of essentially 1;
predicts all instances of of correctly by returning a predicted probability of s of essentially 0:

  GENITIVE PRED_PP_S PRED_CAT PRED_PP_OBS
1       of   0.00001       of     0.99999
2       of   0.00001       of     0.99999
3       of   0.00001       of     0.99999
4       of   0.00001       of     0.99999
5        s   0.99999        s     0.99999
6        s   0.99999        s     0.99999

That means we could use 1-PRED_PP_OBS to quantify how much off from a perfect prediction the model is, but there is an alternative: We could taking the negative logs of PRED_PP_OBS because then

if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 1 – i.e. the model is good – then we end up -logging something that’s very close to 1, which makes the negative logs really close to 0, meaning ‘we’re not far off’:

-log(0.8)

[1] 0.2231436

-log(0.9)

[1] 0.1053605

-log(0.95)

[1] 0.05129329

if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 0 – i.e. the model is bad – then we end up -logging something that’s very close to 0, which makes the negative logs grow and grow, meaning ‘we’re quite far off’:

-log(0.2)

[1] 1.609438

-log(0.1)

[1] 2.302585

-log(0.05)

[1] 2.995732

I am calling these values contributions to logloss (because the mean of all those values is a evaluation metric called logloss). But what does this have to do with deviance? This:

contributions2logloss <- -log(d$PRED_PP_OBS)
# the sum of all contributions to logloss times 2
sum(contributions2logloss) * 2

[1] 3994.079

deviance(m_01) # is this!

[1] 3994.079

That means, the deviance of such a glm, what the summary calls “residual deviance”, expresses how far all the predicted probabilities of the events that should have been predicted are from their theoretical ideals of (very near to) 0 and (very near to) 1! That actually means we can also again express how much the deviance of the null model, the null deviance, is reduced by the predictors in percent:

  (m_00$deviance     -m_01$deviance) / m_00$deviance # same as
# (m_01$null.deviance-m_01$deviance) / m_01$null.deviance

[1] 0.002545694

This is McFadden’s R²-value – not the one that is mostly used, which is Nagelkerke’s R², but still very useful because of its parallelism to the R²of linear models.

3.5 Write-up

To determine whether the choice of a genitive varies as a function of the modality (spoken vs. written), a generalized linear model was fit with GENITIVE as the response variable and MODALITY as a predictor. The model was highly significant (LR=10.19, df=1, p<0.002) but explains very little of the response variable (Nagelkerke R²=0.004, McFadden’s R²=0.003, C=0.53) and has no predictive power: the model only ever predicts of-genitives. The coefficients indicate that s-genitives are more likely in spoken data than in written data, as shown in the summary table shown here. [Show a plot]

	Estimate	95%-CI	se	z	p_2-tailed
Intercept (MODALITY=spoken)	-1	[-1.11, -0.89]	0.05	-18.21	<0.001
MODALITY_{spoken→written}	-0.25	[-0.4, -0.1]	0.08	-3.19	<0.002

4 A categorical predictor

Does the choice of a genitive construction vary as a function on the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal)?

4.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
table(d$GENITIVE)


  of    s
2720  880

table(d$POR_ANIMACY)


   animate collective  inanimate   locative   temporal
       920        607       1671        243        159

# the predictor(s) w/ the response
table(d$POR_ANIMACY, d$GENITIVE)


               of    s
  animate     370  550
  collective  408  199
  inanimate  1638   33
  locative    199   44
  temporal    105   54

4.2 Modeling & numerical interpretation

Let’s fit our regression model:

summary(m_01 <- glm(      # make/summarize the gen. linear model m_01:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   POR_ANIMACY,           # & the predictor POR_ANIMACY
   family=binomial,       # resp = binary
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data


Call:
glm(formula = GENITIVE ~ 1 + POR_ANIMACY, family = binomial,
    data = d, na.action = na.exclude)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
(Intercept)            0.39642    0.06724   5.896 3.73e-09 ***
POR_ANIMACYcollective -1.11438    0.10953 -10.174  < 2e-16 ***
POR_ANIMACYinanimate  -4.30114    0.18823 -22.851  < 2e-16 ***
POR_ANIMACYlocative   -1.90553    0.17965 -10.607  < 2e-16 ***
POR_ANIMACYtemporal   -1.06139    0.18045  -5.882 4.06e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 2766.0  on 3595  degrees of freedom
AIC: 2776

Number of Fisher Scoring iterations: 6

pchisq(              # compute the area under the chi-squared curve
   q=4004.3-2766,    # for this chi-squared value: null - residual deviance
   df=3599-3595,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side

[1] 7.926259e-267

drop1(m_01,      # drop each droppable predictor at a time from m_01 &
   test="Chisq") # test its significance w/ a LRT

Single term deletions

Model:
GENITIVE ~ 1 + POR_ANIMACY
            Df Deviance    AIC    LRT  Pr(>Chi)
<none>           2766.0 2776.0
POR_ANIMACY  4   4004.3 4006.3 1238.3 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confint(m_01) # confidence intervals

                        Estimate      2.5 %     97.5 %
(Intercept)            0.3964153  0.2651511  0.5288144
POR_ANIMACYcollective -1.1143776 -1.3303631 -0.9008787
POR_ANIMACYinanimate  -4.3011390 -4.6879154 -3.9477968
POR_ANIMACYlocative   -1.9055305 -2.2682451 -1.5625579
POR_ANIMACYtemporal   -1.0613916 -1.4207695 -0.7121414

Let’s make (numerical and categorical) ‘predictions’ and evaluate them:

d$PRED_PP_S <- predict( # make d$PRED_PP_S the result of predicting
   m_01,                 # from m_01
   type="response")      # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_S>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"
(c_m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns

    PREDS
OBS    of    s
  of 2350  370
  s   330  550

c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
       0.5978261        0.6250000        0.8768657        0.8639706
  Acc. (overall)
       0.8055556

c(R2s(m_01),
  "Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_S))

  McFadden R-squared Nagelkerke R-squared        Cohen's kappa
           0.3092390            0.4336234            0.4815668
                   C
           0.8472389

Again, we also add our column PRED_PP_OBS with, for each case, the predicted probability of the level that was observed in the data:

d$PRED_PP_OBS <- ifelse( # d$PRED_PP_OBS is determined by ifelse
   d$GENITIVE=="s",       # if the obs genitive is the 2nd level of the response
     d$PRED_PP_S,        # take its predicted probability
   1-d$PRED_PP_S)        # otherwise take 1 minus its predicted probability
head(d)

  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_FINAL_SIB  POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1        absent definite 0.3278418       of   0.6721582
2        absent definite 0.5978261        s   0.4021739
3       present definite 0.5978261        s   0.4021739
4        absent definite 0.3278418       of   0.6721582
5        absent definite 0.5978261        s   0.5978261
6        absent definite 0.5978261        s   0.5978261

4.3 Visual interpretation

Let’s visualize the nature of the effect of POR_ANIMACY based on the predictions:

(an_d <- data.frame( # make an_d a data frame of
   an <- effect(      # the effect
      "POR_ANIMACY",  # of POR_ANIMACY
      m_01)))         # in m_01

  POR_ANIMACY        fit          se      lower      upper
1     animate 0.59782609 0.016165922 0.56577463 0.62906282
2  collective 0.32784185 0.019053448 0.29164055 0.36621363
3   inanimate 0.01974865 0.003403457 0.01407325 0.02764863
4    locative 0.18106996 0.024702646 0.13756935 0.23458435
5    temporal 0.33962264 0.037557428 0.27028269 0.41659580

plot(an,            # plot the effect of POR_ANIMACY
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

4.4 Excursus: deviance derived from logloss

The way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:

deviance(m_01)

[1] 2765.995

sum(-log(d$PRED_PP_OBS)) * 2

[1] 2765.995

m_01$deviance

[1] 2765.995

4.5 Excursus: model comparison

Given a lot of previous literature on the genitive alternation, you might actually have hypothesized that the only thing one needs is a distinction between animate (and maybe collective) and all inanimate possessors. Let’s use model comparison to check this. First, we create a version of POR_ANIMACY that conflates levels as desired:

d$POR_ANIMACY_confl <- d$POR_ANIMACY # create a copy of POR_ANIMACY
levels(d$POR_ANIMACY_confl) <-       # overwrite the copy's levels such that
   c("anim/coll", "anim/coll",       # "animate" and "collective" become "anim/coll"
      "inanim/loc/temp", "inanim/loc/temp", "inanim/loc/temp") # all others become "inanim/loc/temp"
table(POR_ANIMACY=d$POR_ANIMACY, POR_ANIMACY_confl=d$POR_ANIMACY_confl) # check you did it right

            POR_ANIMACY_confl
POR_ANIMACY  anim/coll inanim/loc/temp
  animate          920               0
  collective       607               0
  inanimate          0            1671
  locative           0             243
  temporal           0             159

head(d <- d[,c(1:7,13,8:12)])

  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_ANIMACY_confl POR_FINAL_SIB  POR_DEF PRED_PP_S PRED_CAT PRED_PP_OBS
1         anim/coll        absent definite 0.3278418       of   0.6721582
2         anim/coll        absent definite 0.5978261        s   0.4021739
3         anim/coll       present definite 0.5978261        s   0.4021739
4         anim/coll        absent definite 0.3278418       of   0.6721582
5         anim/coll        absent definite 0.5978261        s   0.5978261
6         anim/coll        absent definite 0.5978261        s   0.5978261

We already have the model with POR_ANIMACY as the only predictor so we only need to fit a new model with POR_ANIMACY_confl as the only predictor:

summary(m_01_animconfl <- glm( # make/summarize the gen. linear model m_01_animconfl:
   GENITIVE ~ 1 + POR_ANIMACY_confl,    # ORDER ~ an overall intercept (1) & POR_ANIMACY_confl
   family=binomial, data=d, na.action=na.exclude)) # the other arguments


Call:
glm(formula = GENITIVE ~ 1 + POR_ANIMACY_confl, family = binomial,
    data = d, na.action = na.exclude)

Coefficients:
                                 Estimate Std. Error z value Pr(>|z|)
(Intercept)                      -0.03799    0.05119  -0.742    0.458
POR_ANIMACY_conflinanim/loc/temp -2.65829    0.10377 -25.616   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 3093.4  on 3598  degrees of freedom
AIC: 3097.4

Number of Fisher Scoring iterations: 5

… and then we do a comparison with anova just like above, the only difference being the argument to test (because we’re now fitting a generalized linear model).

anova(m_01,        # compare the model w/ the 5-level predictor POR_ANIMACY
   m_01_animconfl, # & the model w/ the 2-level predictor POR_ANIMACY_confl
   test="Chisq")   # using a LRT

Analysis of Deviance Table

Model 1: GENITIVE ~ 1 + POR_ANIMACY
Model 2: GENITIVE ~ 1 + POR_ANIMACY_confl
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)
1      3595     2766.0
2      3598     3093.4 -3  -327.39 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the two models is hugely significant, we cannot reduce POR_ANIMACY to the only two levels we were considering.

4.6 Write-up

To determine whether the choice of a genitive varies as a function of the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal), a generalized linear model was fit with GENITIVE as the response variable and POR_ANIMACY as a predictor. The model was highly significant (LR=1238.3, df=4, p<0.001) and explains a decent amount of the response variable (Nagelkerke R²=0.434, McFadden’s R²=0.309, Cohen’s κ=0.482, C=0.85) with a good amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.

	Estimate	95%-CI	se	z	p_2-tailed
Intercept (POR_ANIMACY=animate)	0.40	[0.27, 0.53]	0.07	5.9	<0.001
POR_ANIMACY_{animate→collective}	-1.11	[-1.33, -0.9]	0.11	-10.17	<0.002
POR_ANIMACY_{animate→inanimate}	-4.30	[-4.69, -3.95]	0.19	-22.85	<0.002
POR_ANIMACY_{animate→locative}	-1.91	[-2.27, -1.56]	0.18	-10.61	<0.002
POR_ANIMACY_{animate→temporal}	-1.06	[-1.42, -0.71]	0.18	-5.88	<0.002

The model predicts s-genitives for animate possessors but of-genitives for collective/temporal possessors, then locative possessors, and then nearly exclusively for inanimate possessors. [Show a plot] A post-hoc comparison of the above model with one in which POS_ANIMACY was reduced to 2 levels (animate/collective vs. inanimate/locative/temporal) showed that this reduced predictor is highly significantly worse (LR=327.39, p<0.0001).

5 A numeric predictor

Does the choice of a genitive construction vary as a function on the length of the possessor (in characters)?

5.1 Exploration & preparation

# the predictor(s)/response on its/their own
table(d$GENITIVE)


  of    s
2720  880

hist(d$POR_LENGTH); hist(d$POR_LENGTH, breaks="FD")

# the predictor(s) w/ the response
spineplot(d$GENITIVE ~ d$POR_LENGTH)

The predictor POR_LENGTH looks like it’s going to cause problems if we use it like this, given its very skewed distribution, but we’ll try for now and fit our regression model.

5.2 Modeling & numerical interpretation

Here’s our initial model:

summary(m_01 <- glm(      # make/summarize the gen. linear model m_01:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   POR_LENGTH,            # & the predictor POR_LENGTH
   family=binomial,       # resp = binary
   data=d,                # those vars = in x
   na.action=na.exclude)) # skip cases with NA/missing data


Call:
glm(formula = GENITIVE ~ 1 + POR_LENGTH, family = binomial, data = d,
    na.action = na.exclude)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.337551   0.090941   3.712 0.000206 ***
POR_LENGTH  -0.126693   0.008189 -15.471  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 3598.6  on 3598  degrees of freedom
AIC: 3602.6

Number of Fisher Scoring iterations: 6

pchisq(              # compute the area under the chi-squared curve
   q=4004.3-3598.6,  # for this chi-squared value: null - residual deviance
   df=3599-3598,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side

[1] 3.163282e-90

drop1(m_01,      # drop each droppable predictor at a time from m_01 &
   test="Chisq") # test its significance w/ a LRT

Single term deletions

Model:
GENITIVE ~ 1 + POR_LENGTH
           Df Deviance    AIC    LRT  Pr(>Chi)
<none>          3598.6 3602.6
POR_LENGTH  1   4004.3 4006.3 405.69 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confint(m_01) # confidence intervals

              Estimate      2.5 %     97.5 %
(Intercept)  0.3375506  0.1607386  0.5172721
POR_LENGTH  -0.1266934 -0.1431028 -0.1110008

Let’s make ‘predictions’ and evaluate them:

d$PRED_PP_S <- predict( # make d$PRED_PP_S the result of predicting
   m_01,                 # from m_01
   type="response")      # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_S>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"
(c_m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,     # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns

    PREDS
OBS    of    s
  of 2712    8
  s   868   12

c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
      0.60000000       0.01363636       0.75754190       0.99705882
  Acc. (overall)
      0.75666667

c(R2s(m_01),
  "Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_S))

  McFadden R-squared Nagelkerke R-squared        Cohen's kappa
          0.10131504           0.15878289           0.01597604
                   C
          0.70785741

5.3 Visual interpretation

Let’s visualize the nature of the effect of POR_LENGTH based on the predictions:

(le_d <- data.frame( # make le_d a data frame of
   le <- effect  (   # the effect
      "POR_LENGTH",  # of POR_LENGTH
      m_01)))        # in m_01

  POR_LENGTH          fit           se        lower        upper
1          1 5.525199e-01 2.069478e-02 5.116986e-01 5.926452e-01
2         52 1.925762e-03 6.668753e-04 9.765240e-04 3.794212e-03
3        100 4.409307e-06 3.257399e-06 1.036429e-06 1.875843e-05
4        150 7.820963e-09 8.977002e-09 8.245941e-10 7.417887e-08
5        200 1.387229e-11 2.160024e-11 6.557839e-13 2.934510e-10

plot(le,            # plot the effect of POR_LENGTH
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

5.4 What if we log-transform `POR_LENGTH`?

Ok, those results were pretty bad – let’s do this again because maybe it’s really smarter to transform the possessor lengths and, given the shape of this variable’s histogram and because no length is 0, we try the log transformation:

# the predictor(s)/response on its/their own
table(d$GENITIVE)


  of    s
2720  880

d$POR_LENGTH_LOG <- log2(d$POR_LENGTH)
hist(d$POR_LENGTH_LOG); hist(d$POR_LENGTH_LOG, breaks="FD")

# the predictor(s) w/ the response
spineplot(d$GENITIVE ~ d$POR_LENGTH_LOG)

Let’s fit our 2nd regression model:

summary(m_02 <- glm(      # make/summarize the gen. linear model m_02:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   POR_LENGTH_LOG,        # & the predictor POR_LENGTH_LOG
   family=binomial,       # resp = binary
   data=d,                # those vars = in x
   na.action=na.exclude)) # skip cases with NA/missing data


Call:
glm(formula = GENITIVE ~ 1 + POR_LENGTH_LOG, family = binomial,
    data = d, na.action = na.exclude)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)     1.90960    0.17609   10.85   <2e-16 ***
POR_LENGTH_LOG -0.90493    0.05325  -16.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 3643.1  on 3598  degrees of freedom
AIC: 3647.1

Number of Fisher Scoring iterations: 4

pchisq(              # compute the area under the chi-squared curve
   q=4004.3-3643.1,  # for this chi-squared value: null - residual deviance
   df=3599-3598,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side

[1] 1.542725e-80

drop1(m_02,      # drop each droppable predictor at a time from m_02 &
   test="Chisq") # test its significance w/ a LRT

Single term deletions

Model:
GENITIVE ~ 1 + POR_LENGTH_LOG
               Df Deviance    AIC    LRT  Pr(>Chi)
<none>              3643.1 3647.1
POR_LENGTH_LOG  1   4004.3 4006.3 361.18 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confint(m_02) # confidence intervals

                 Estimate     2.5 %     97.5 %
(Intercept)     1.9096029  1.567704  2.2581724
POR_LENGTH_LOG -0.9049291 -1.010777 -0.8019605

Let’s make ‘predictions’ and evaluate them:

d$PRED_PP_S <- predict( # make d$PRED_PP_S the result of predicting
   m_02,                 # from m_02
   type="response")      # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PRED_CAT <- factor(         # make d$PRED_CAT the result of checking
   ifelse(d$PRED_PP_S>=0.5,   # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"
(c_m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PRED_CAT)) # predicted orders in the columns

    PREDS
OBS    of    s
  of 2614  106
  s   809   71

c( # evaluate the confusion matrix
   "Prec. for s"     =c_m[ "s", "s"] / sum(c_m[    , "s"]),
   "Acc./rec. for s" =c_m[ "s", "s"] / sum(c_m[ "s",    ]),
   "Prec. for of"    =c_m["of","of"] / sum(c_m[    ,"of"]),
   "Acc./rec. for of"=c_m["of","of"] / sum(c_m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PRED_CAT))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
      0.40112994       0.08068182       0.76365761       0.96102941
  Acc. (overall)
      0.74583333

c(R2s(m_02),
  "Cohen's kappa"=cohens.kappa(c_m)[[1]],
  "C"=C.score(d$GENITIVE, d$PRED_PP_S))

  McFadden R-squared Nagelkerke R-squared        Cohen's kappa
          0.09019761           0.14222054           0.05715463
                   C
          0.70785741

Again, we also add our column PRED_PP_OBS with, for each case, the predicted probability of the level that was observed in the data:

d$PRED_PP_OBS <- ifelse( # d$PRED_PP_OBS is determined by ifelse
   d$GENITIVE=="s",       # if the obs genitive is the 2nd level of the response
     d$PRED_PP_S,        # take its predicted probability
   1-d$PRED_PP_S)        # otherwise take 1 minus its predicted probability
head(d)

  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_ANIMACY_confl POR_FINAL_SIB  POR_DEF  PRED_PP_S PRED_CAT PRED_PP_OBS
1         anim/coll        absent definite 0.19169521       of   0.8083048
2         anim/coll        absent definite 0.10660805       of   0.8933919
3         anim/coll       present definite 0.22777155       of   0.7722285
4         anim/coll        absent definite 0.08754694       of   0.9124531
5         anim/coll        absent definite 0.30891765       of   0.3089176
6         anim/coll        absent definite 0.34731689       of   0.3473169
  POR_LENGTH_LOG
1       3.700440
2       4.459432
3       3.459432
4       4.700440
5       3.000000
6       2.807355

Let’s visualize the nature of the effect of POR_LENGTH_LOG based on the predictions:

(ll_d <- data.frame(   # make ll_d a data frame of
   ll <- effect(        # the effect
      "POR_LENGTH_LOG", # of POR_LENGTH_LOG
      m_02)))           # in m_02

  POR_LENGTH_LOG         fit          se       lower       upper
1              0 0.870974529 0.019788298 0.826995055 0.905057520
2              2 0.524915543 0.019097226 0.487415805 0.562136377
3              4 0.153149017 0.007572996 0.138887878 0.168587815
4              6 0.028749293 0.004293408 0.021429263 0.038471473
5              8 0.004821536 0.001237943 0.002913467 0.007969237

plot(ll,            # plot the effect of POR_LENGTH_LOG
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

5.5 Excursus: deviance derived from logloss

deviance(m_02)

[1] 3643.097

sum(-log(d$PRED_PP_OBS)) * 2

[1] 3643.097

m_02$deviance

[1] 3643.097

5.6 Write-up

To determine whether the choice of a genitive varies as a function of the length of the possessor, a generalized linear model was fit with GENITIVE as the response variable and POR_LENGTH as a predictor; however, the initial model resulted in extremely poor fit probably in part because of the very long right tail of the predictor, which was therefore logged to the base of 2. A new model with this predictor was highly significant (LR=361.18, df=1, p<0.001) but explains only a small amount of the response variable (Nagelkerke R²=0.142, McFadden’s R²=0.101, Cohen’s κ=0.057, C=0.7088) with a fairly bad amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.

	Estimate	95%-CI	se	z	p_2-tailed
Intercept (POR_LENGTH=0)	1.91	[0.27, 0.53]	0.18	10.85	<0.001
POR_LENGTH_0→1	-0.9	[-1.33, -0.9]	0.05	-16.99	<0.001

The model indicates a negative correlation between s-genitives and possessor length: the longer the possessor, the less likely s-genitives become and the more likely of-genitives become. [Show a plot]

6 Homework

To prepare for next session, read SFLWR³, Sections 5.3.4-5.3.5 on multifactorial binary logistic regression modeling.

7 Session info

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] effects_4.2-2  car_3.1-3      carData_3.0-5  STGmisc_1.01   Rcpp_1.0.14
[6] magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] Matrix_1.7-3       jsonlite_2.0.0     survey_4.4-2       splines_4.5.0
 [5] boot_1.3-31        yaml_2.3.10        fastmap_1.2.0      lattice_0.22-7
 [9] Formula_1.2-5      knitr_1.50         rbibutils_2.3      htmlwidgets_1.6.4
[13] MASS_7.3-65        nloptr_2.2.1       insight_1.1.0      nnet_7.3-20
[17] minqa_1.2.8        DBI_1.2.3          rlang_1.1.6        xfun_0.52
[21] estimability_1.5.1 cli_3.6.4          Rdpack_2.6.4       digest_0.6.37
[25] grid_4.5.0         rstudioapi_0.17.1  lme4_1.1-37        nlme_3.1-168
[29] reformulas_0.4.0   evaluate_1.0.3     mitools_2.4        survival_3.8-3
[33] abind_1.4-8        colorspace_2.1-1   rmarkdown_2.29     tools_4.5.0
[37] htmltools_0.5.8.1

1 Intro

2 Deviance & baseline(s)

3 A binary predictor

3.1 Exploration & preparation

3.2 Modeling & numerical interpretation

3.3 Visual interpretation

3.4 Excursus: deviance derived from logloss

3.5 Write-up

4 A categorical predictor

4.1 Exploration & preparation

4.2 Modeling & numerical interpretation

4.3 Visual interpretation

4.4 Excursus: deviance derived from logloss

4.5 Excursus: model comparison

4.6 Write-up

5 A numeric predictor

5.1 Exploration & preparation

5.2 Modeling & numerical interpretation

5.3 Visual interpretation

5.4 What if we log-transform POR_LENGTH?

5.5 Excursus: deviance derived from logloss

5.6 Write-up

6 Homework

7 Session info

5.4 What if we log-transform `POR_LENGTH`?