1 Intro

Let’s load a data set for a few binary logistic regression modeling examples:

rm(list=ls(all.names=TRUE))
library(car); library(effects)
source("helpers/202_R2.r"); source("helpers/202_C.score.r") # load small helper functions
summary(d <- read.delim(  # summarize d, the result of loading
   file="inputfiles/202_04-05_GENs.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors
##       CASE      GENITIVE  SPEAKER       MODALITY      POR_LENGTH
##  Min.   :   2   of:2720   nns:2666   spoken :1685   Min.   :  1.00
##  1st Qu.:1006   s : 880   ns : 934   written:1915   1st Qu.:  8.00
##  Median :2018                                       Median : 11.00
##  Mean   :2012                                       Mean   : 14.58
##  3rd Qu.:3017                                       3rd Qu.: 17.00
##  Max.   :4040                                       Max.   :204.00
##    PUM_LENGTH         POR_ANIMACY   POR_FINAL_SIB        POR_DEF
##  Min.   :  2.00   animate   : 920   absent :2721   definite  :2349
##  1st Qu.:  6.00   collective: 607   present: 879   indefinite:1251
##  Median :  9.00   inanimate :1671
##  Mean   : 10.35   locative  : 243
##  3rd Qu.: 13.00   temporal  : 159
##  Max.   :109.00

Here’s what the columns are/contain:

  • the column CASE is a simple case/row number just so that each case has a unique ID;
  • the column GENITIVE is the binary response variable: it’s whether a speaker used an of-genitive (e.g., the carpossessum of my fatherpossessor) or an s-genitive (my fatherpossessor’s carpossessum);
  • the column SPEAKER is a binary predictor: it’s whether the speaker who produced a genitive was a native speaker of English (ns) or a learner of English (nns);
  • the column MODALITY is a binary predictor: it’s whether the speaker produced a genitive in speaking (spoken) or in writing (written);
  • the column POR_LENGTH is a numeric predictor: it’s the length of the possessor in the genitive in characters (e.g., my father has 9 characters (incl. the space));
  • the column PUM_LENGTH is a numeric predictor: it’s the length of the possessum in the genitive in characters (e.g., the car has 7 characters (incl. the space));
  • the column POR_ANIMACY is a categorical predictor: it’s the semantic category of the possessor: animate (e.g., my father’s car) vs. collective (e.g., the government’s initiative) vs. inanimate (e.g., the table’s surface) vs. locative (e.g., Boston’s population) vs. temporal (e.g., this year’s summer);
  • the column POR_FINAL_SIB is a binary predictor: it’s whether the possessor ends in a sibilant (e.g., the fish’s eyes) or not (e.g., the pig’s eyes);
  • the column POR_DEF is a binary predictor: it’s whether the possessor is definite (e.g., the President’s speech) or indefinite (e.g., a bystander’s reaction);

As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.

2 Deviance & baseline(s)

Let’s already compute the baselines for what will be the response variable in all case studies, GENITIVE:

(baselines <- c(
   "baseline 1"=max(          # make baselines[1] the highest
      prop.table(             # proportion in the
         table(d$GENITIVE))), # frequency table of the response
   "baseline 2"=sum(             # make baselines[2] the sum of the
      prop.table(                # proportions in the
         table(d$GENITIVE))^2))) # frequency table of the response squared
## baseline 1 baseline 2
##  0.7555556  0.6306173

As for deviance, in the case of lm, we defined deviance as the sum of squared residuals and we said we want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. What is deviance here in the case of a glm? To understand that, we first compute a null model:

summary(m.00 <- glm(GENITIVE ~ 1, family=binomial, data=d, na.action=na.exclude))
##
## Call:
## glm(formula = GENITIVE ~ 1, family = binomial, data = d, na.action = na.exclude)
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.12847    0.03878   -29.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 4004.3  on 3599  degrees of freedom
## Residual deviance: 4004.3  on 3599  degrees of freedom
## AIC: 4006.3
##
## Number of Fisher Scoring iterations: 4
deviance(m.00)
## [1] 4004.273
sum(residuals(m.00)^2)
## [1] 4004.273

We see that the deviance of m.00 is still the sum of this model’s squared residuals and the model’s “null deviance” (and, if a model is a null model, also what is called “residual deviance”). But what are squared residuals here? We will discuss that in more detail after we have fitted our first ‘real model’ …

3 A binary predictor

Does the choice of a genitive construction vary as a function of the modality in which the construction was produced (spoken vs. written)?

3.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
table(d$GENITIVE)
##
##   of    s
## 2720  880
table(d$MODALITY)
##
##  spoken written
##    1685    1915
# the predictor(s) w/ the response
(mod_x_gen <- table(d$MODALITY, d$GENITIVE))
##
##             of    s
##   spoken  1232  453
##   written 1488  427

3.2 Modeling & numerical interpretation

How would we have dealt with this in Ling 201? First, we would have computed a chi-squared test like this, where the chi-squared value is computed involving differences of observed and expected values:

chisq.test(mod_x_gen, correct=FALSE)
##
##  Pearson's Chi-squared test
##
## data:  mod_x_gen
## X-squared = 10.21, df = 1, p-value = 0.001397

Second, we might have computed the odds ratio and the logged odds ratio for the s-genitives:

(odds.ratio <-
   (mod_x_gen[2,2] / mod_x_gen[2,1]) /
   (mod_x_gen[1,2] / mod_x_gen[1,1]))
## [1] 0.7804363
log(odds.ratio)
## [1] -0.2479022

How do we do this now? With a binary logistic regression, which returns the G2-statistic, which is computed based on ratios of observed and expected values (not differences like chi-squared) and which also returns the (logged) odds ratio. Let’s fit our regression model:

summary(m.01 <- glm(      # make/summarize the gen. linear model m.01:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   MODALITY,              # & the predictor MODALITY
   family=binomial,       # resp = binary
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data
##
## Call:
## glm(formula = GENITIVE ~ 1 + MODALITY, family = binomial, data = d,
##     na.action = na.exclude)
##
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -1.00050    0.05495 -18.208  < 2e-16 ***
## MODALITYwritten -0.24790    0.07767  -3.192  0.00141 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 4004.3  on 3599  degrees of freedom
## Residual deviance: 3994.1  on 3598  degrees of freedom
## AIC: 3998.1
##
## Number of Fisher Scoring iterations: 4
pchisq(              # compute the area under the chi-squared curve
   q=4004.3-3994.1,  # for this chi-squared value: null - residual deviance
   df=3599-3598,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side
## [1] 0.001404407
drop1(m.01,      # drop each droppable predictor at a time from m.01 &
   test="Chisq") # test its significance w/ a LRT
## Single term deletions
##
## Model:
## GENITIVE ~ 1 + MODALITY
##          Df Deviance    AIC    LRT Pr(>Chi)
## <none>        3994.1 3998.1
## MODALITY  1   4004.3 4006.3 10.194 0.001409 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confint(m.01) # confidence intervals
##                   Estimate      2.5 %      97.5 %
## (Intercept)     -1.0005020 -1.1091198 -0.89367488
## MODALITYwritten -0.2479022 -0.4002591 -0.09571937

(Note how similar the chi-squared value and the G2-statistic and their p-values are.) Let’s make ‘predictions’:

d$PREDS.NUM.2 <- predict( # make d$PREDS.NUM.2 the result of predicting
   m.01,                  # from m.01
   type="response")       # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds:
unique(predict(m.01), 6)
## [1] -1.000502 -1.248404
d$PREDS.CAT <- factor(        # make d$PREDS.CAT the result of checking
   ifelse(d$PREDS.NUM.2>=0.5, # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"

Actually, the much shorter though less intuitive version of generating d$PREDS.CAT is this:

d$PREDS.CAT <- factor(levels(d$GENITIVE)[1+(d$PREDS.NUM.2>=0.5)])

On top of that, let’s also add something else, namely a column called PREDS.NUM.obs that contains, for each case, the predicted probability of the level that was observed in the data, i.e. what the model should have predicted. How does that work? Consider the head of x:

head(d)
##   CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
## 1    2       of     nns   spoken         13          7  collective
## 2    3       of     nns   spoken         22          7     animate
## 3    4       of     nns   spoken         11          8     animate
## 4    5       of     nns   spoken         26          4  collective
## 5    6        s     nns   spoken          8          4     animate
## 6    7        s     nns   spoken          7          3     animate
##   POR_FINAL_SIB  POR_DEF PREDS.NUM.2 PREDS.CAT
## 1        absent definite   0.2688427        of
## 2        absent definite   0.2688427        of
## 3       present definite   0.2688427        of
## 4        absent definite   0.2688427        of
## 5        absent definite   0.2688427        of
## 6        absent definite   0.2688427        of

In the first case, the model predicted s with a probability of 0.2688427 and, thus of with a probability of 1-0.2688427=0.7311573. The actual observation, what the speaker really said, is of so PREDS.NUM.obs[1] should be 0.7311573. In the fifth case, the model again predicted s with a probability of 0.2688427 and, thus of with a probability of 1-0.2688427=0.7311573. The actual observation, what the speaker really said, is s so PREDS.NUM.obs[2] should be 0.2688427. How do we do that efficiently for all cases (and in the spirit of Section 3.5.2, efficiently means ‘without a loop’)?

d$PREDS.NUM.obs <- ifelse( # d$PREDS.NUM.obs is determined by ifelse
   d$GENITIVE=="s",        # if the obs genitive is the 2nd level of the response
     d$PREDS.NUM.2,        # take its predicted probability
   1-d$PREDS.NUM.2)        # otherwise take 1 minus its predicted probability
head(d)
##   CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
## 1    2       of     nns   spoken         13          7  collective
## 2    3       of     nns   spoken         22          7     animate
## 3    4       of     nns   spoken         11          8     animate
## 4    5       of     nns   spoken         26          4  collective
## 5    6        s     nns   spoken          8          4     animate
## 6    7        s     nns   spoken          7          3     animate
##   POR_FINAL_SIB  POR_DEF PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1        absent definite   0.2688427        of     0.7311573
## 2        absent definite   0.2688427        of     0.7311573
## 3       present definite   0.2688427        of     0.7311573
## 4        absent definite   0.2688427        of     0.7311573
## 5        absent definite   0.2688427        of     0.2688427
## 6        absent definite   0.2688427        of     0.2688427

Here, too, a shorter but much less intuitive version is available; if you have time to waste, try and find it, but for some weird reason it’s a brain teaser, consider yourself warned.

d$PREDS.NUM.obs <- abs(d$PREDS.NUM.2 - as.numeric(d$GENITIVE!="s"))

We’ll get to why this is useful in a moment … But let’s now evaluate the predictive capacity of the model:

(c.m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PREDS.CAT)) # predicted orders in the columns
##     PREDS
## OBS    of
##   of 2720
##   s   880
# evaluating the confusion matrix here in the usual way doesn't make much sense,
# which is why we discuss that in the next section
c("Nagelkerke's R2"=R2(m.01),      # R-squared &
  "C"              =C.score(m.01)) # C
## Nagelkerke's R2               C
##     0.004212717     0.530915775

3.3 Visual interpretation

Let’s visualize the nature of the effect of MODALITY based on the predictions:

(mo.d <- data.frame( # make mo.d a data frame of
   mo <- effect(     # the effect
      "MODALITY",    # of MODALITY
      m.01)))        # in m.01
##   MODALITY       fit          se     lower     upper
## 1   spoken 0.2688427 0.010800758 0.2482073 0.2905308
## 2  written 0.2229765 0.009511784 0.2048903 0.2421729
plot(mo,            # plot the effect of MODALITY
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

3.4 Excursus: deviance derived from logloss

We saw above that the deviance of a glm is again the sum of squared residuals, which means we again want to reduce the deviance of our model relative to a null model by adding (hopefully good) predictors. But what is deviance here in the case of a glm exactly, how it is computed? To understand, let’s look at the predictions of m.01 again:

head(d[,c(4,10:12)])
##   MODALITY PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1   spoken   0.2688427        of     0.7311573
## 2   spoken   0.2688427        of     0.7311573
## 3   spoken   0.2688427        of     0.7311573
## 4   spoken   0.2688427        of     0.7311573
## 5   spoken   0.2688427        of     0.2688427
## 6   spoken   0.2688427        of     0.2688427

What would the best model ever have returned for these 6 cases? Something like this, where the model

  • predicts all instances of s correctly because it returns a predicted probability of s of essentially 1;
  • predicts all instances of of correctly because it returns a predicted probability of s of essentially 0:
##   GENITIVE PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1       of     0.00001        of       0.99999
## 2       of     0.00001        of       0.99999
## 3       of     0.00001        of       0.99999
## 4       of     0.00001        of       0.99999
## 5        s     0.99999         s       0.99999
## 6        s     0.99999         s       0.99999

That means we could use 1-PREDS.NUM.obs to quantify how much off from a perfect prediction the model is, but there is an alternative: We could taking the negative logs of PREDS.NUM.obs because then

  • if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 1 – i.e. the model is good – then we end up -logging something that’s very close to 1, which makes the negative logs really close to 0, meaning ‘we’re not far off’:
-log(0.8)
## [1] 0.2231436
-log(0.9)
## [1] 0.1053605
-log(0.95)
## [1] 0.05129329
  • if the predicted probability of what was actually observed, which should be close to 1, is in fact close to 0 – i.e. the model is bad – then we end up -logging something that’s very close to 0, which makes the negative logs grow and grow, meaning ‘we’re quite far off’:
-log(0.2)
## [1] 1.609438
-log(0.1)
## [1] 2.302585
-log(0.05)
## [1] 2.995732

I am calling these values contributions to logloss (because the mean of all those values is a evaluation metric called logloss). But what does this have to do with deviance? This:

deviance(m.01) # same as 
## [1] 3994.079
# the sum of all contributions to logloss times 2
sum(-log(d$PREDS.NUM.obs)) * 2
## [1] 3994.079

That means, the deviance of such a glm, what the summary calls “residual deviance”, expresses how far all the predicted probabilities of the events that should have been predicted are from their theoretical ideals of (very near to) 0 and (very near to) 1! That actually means we can also again express how much the deviance of the null model, the null deviance, is reduced by the predictors in percent:

  (m.00$deviance     -m.01$deviance) / m.00$deviance # same as
# (m.01$null.deviance-m.01$deviance) / m.01$null.deviance
## [1] 0.002545694

This is also an R2-value – not the one that is mostly used and that we computed above, Nagelkerke’s R2, but one that is called McFadden’s R2.

3.5 Write-up

To determine whether the choice of a genitive varies as a function of the modality (spoken vs. written), a generalized linear model was fit with GENITIVE as the response variable and MODALITY as a predictor. The model was highly significant (LR=10.19, df=1, p<0.002) but explains very little of the response variable (Nagelkerke R2=0.004, McFadden’s R2=0.003, C=0.53) and has no predictive power: the model only ever predicts of-genitives. The coefficients indicate that s-genitives are more likely in spoken data than in written data, as shown in the summary table shown here. [Show a plot]

Estimate 95%-CI se z p2-tailed
Intercept (MODALITY=spoken) -1 [-1.11, -0.89] 0.05 -18.21 <0.001
MODALITYspokenwritten -0.25 [-0.4, -0.1] 0.08 -3.19 <0.002

4 A categorical predictor

Does the choice of a genitive construction vary as a function on the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal)?

4.1 Exploration & preparation

Some exploration of the relevant variables:

# the predictor(s)/response on its/their own
table(d$GENITIVE)
##
##   of    s
## 2720  880
table(d$POR_ANIMACY)
##
##    animate collective  inanimate   locative   temporal
##        920        607       1671        243        159
# the predictor(s) w/ the response
table(d$POR_ANIMACY, d$GENITIVE)
##
##                of    s
##   animate     370  550
##   collective  408  199
##   inanimate  1638   33
##   locative    199   44
##   temporal    105   54

4.2 Modeling & numerical interpretation

Let’s fit our regression model:

summary(m.01 <- glm(      # make/summarize the gen. linear model m.01:
   GENITIVE ~ 1 +         # GENITIVE ~ an overall intercept (1)
   POR_ANIMACY,           # & the predictor POR_ANIMACY
   family=binomial,       # resp = binary
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data
##
## Call:
## glm(formula = GENITIVE ~ 1 + POR_ANIMACY, family = binomial,
##     data = d, na.action = na.exclude)
##
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)            0.39642    0.06724   5.896 3.73e-09 ***
## POR_ANIMACYcollective -1.11438    0.10953 -10.174  < 2e-16 ***
## POR_ANIMACYinanimate  -4.30114    0.18823 -22.851  < 2e-16 ***
## POR_ANIMACYlocative   -1.90553    0.17965 -10.607  < 2e-16 ***
## POR_ANIMACYtemporal   -1.06139    0.18045  -5.882 4.06e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 4004.3  on 3599  degrees of freedom
## Residual deviance: 2766.0  on 3595  degrees of freedom
## AIC: 2776
##
## Number of Fisher Scoring iterations: 6
pchisq(              # compute the area under the chi-squared curve
   q=4004.3-2766,    # for this chi-squared value: null - residual deviance
   df=3599-3595,     # at this df: null - residual df
   lower.tail=FALSE) # only using the right/upper tail/side
## [1] 7.926259e-267
drop1(m.01,      # drop each droppable predictor at a time from m.01 &
   test="Chisq") # test its significance w/ a LRT
## Single term deletions
##
## Model:
## GENITIVE ~ 1 + POR_ANIMACY
##             Df Deviance    AIC    LRT  Pr(>Chi)
## <none>           2766.0 2776.0
## POR_ANIMACY  4   4004.3 4006.3 1238.3 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confint(m.01) # confidence intervals
##                         Estimate      2.5 %     97.5 %
## (Intercept)            0.3964153  0.2651511  0.5288144
## POR_ANIMACYcollective -1.1143776 -1.3303631 -0.9008787
## POR_ANIMACYinanimate  -4.3011390 -4.6879154 -3.9477968
## POR_ANIMACYlocative   -1.9055305 -2.2682451 -1.5625579
## POR_ANIMACYtemporal   -1.0613916 -1.4207695 -0.7121414

Let’s make (numerical and categorical) ‘predictions’ and evaluate them:

d$PREDS.NUM.2 <- predict( # make d$PREDS.NUM.2 the result of predicting
   m.01,                  # from m.01
   type="response")       # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PREDS.CAT <- factor(        # make d$PREDS.CAT the result of checking
   ifelse(d$PREDS.NUM.2>=0.5, # if the predicted prob. of "s" is >=0.5
      levels(d$GENITIVE)[2],  # if yes, predict "s"
      levels(d$GENITIVE)[1])) # otherwise, predict "of"
(c.m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PREDS.CAT)) # predicted orders in the columns
##     PREDS
## OBS    of    s
##   of 2350  370
##   s   330  550
c( # evaluate the confusion matrix
   "Prec. for s"     =c.m[ "s", "s"] / sum(c.m[    , "s"]),
   "Acc./rec. for s" =c.m[ "s", "s"] / sum(c.m[ "s",    ]),
   "Prec. for of"    =c.m["of","of"] / sum(c.m[    ,"of"]),
   "Acc./rec. for of"=c.m["of","of"] / sum(c.m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PREDS.CAT))
##      Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
##        0.5978261        0.6250000        0.8768657        0.8639706
##   Acc. (overall)
##        0.8055556
c("Nagelkerke's R2"=R2(m.01),
  "McFadden's R2"  =(m.00$deviance-m.01$deviance) / m.00$deviance,
  "C"              =C.score(m.01)) # R-squared values & C
## Nagelkerke's R2   McFadden's R2               C
##       0.4336234       0.3092390       0.8472389

Again, we also add our column PREDS.NUM.obs with, for each case, the predicted probability of the level that was observed in the data:

d$PREDS.NUM.obs <- ifelse( # d$PREDS.NUM.obs is determined by ifelse
   d$GENITIVE=="s",        # if the obs genitive is the 2nd level of the response
     d$PREDS.NUM.2,        # take its predicted probability
   1-d$PREDS.NUM.2)        # otherwise take 1 minus its predicted probability
head(d)
##   CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
## 1    2       of     nns   spoken         13          7  collective
## 2    3       of     nns   spoken         22          7     animate
## 3    4       of     nns   spoken         11          8     animate
## 4    5       of     nns   spoken         26          4  collective
## 5    6        s     nns   spoken          8          4     animate
## 6    7        s     nns   spoken          7          3     animate
##   POR_FINAL_SIB  POR_DEF PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1        absent definite   0.3278418        of     0.6721582
## 2        absent definite   0.5978261         s     0.4021739
## 3       present definite   0.5978261         s     0.4021739
## 4        absent definite   0.3278418        of     0.6721582
## 5        absent definite   0.5978261         s     0.5978261
## 6        absent definite   0.5978261         s     0.5978261

4.3 Visual interpretation

Let’s visualize the nature of the effect of POR_ANIMACY based on the predictions:

(an.d <- data.frame( # make an.d a data frame of
   an <- effect(      # the effect
      "POR_ANIMACY",  # of POR_ANIMACY
      m.01)))         # in m.01
##   POR_ANIMACY        fit          se      lower      upper
## 1     animate 0.59782609 0.016165922 0.56577463 0.62906282
## 2  collective 0.32784185 0.019053448 0.29164055 0.36621363
## 3   inanimate 0.01974865 0.003403457 0.01407325 0.02764863
## 4    locative 0.18106996 0.024702646 0.13756935 0.23458435
## 5    temporal 0.33962264 0.037557428 0.27028269 0.41659580
plot(an,            # plot the effect of POR_ANIMACY
   type="response", # but plot predicted probabilities
   ylim=c(0, 1),    # ylim: the whole range of possible probabilities
   grid=TRUE)       # & w/ a grid

4.4 Excursus: deviance derived from logloss

The way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:

deviance(m.01)
## [1] 2765.995
sum(-log(d$PREDS.NUM.obs)) * 2
## [1] 2765.995
m.01$deviance
## [1] 2765.995

4.5 Excursus: model comparison

Given a lot of previous literature on the genitive alternation, you might actually have hypothesized that the only thing one needs is a distinction between animate (and maybe collective) and all inanimate possessors. Let’s use model comparison to check this. First, we create a version of POR_ANIMACY that conflates levels as desired:

d$POR_ANIMACY.confl <- d$POR_ANIMACY # create a copy of POR_ANIMACY
levels(d$POR_ANIMACY.confl) <-       # overwrite the copy's levels such that
   c("anim/coll", "anim/coll",       # "animate" and "collective" become "anim/coll"
      "inanim/loc/temp", "inanim/loc/temp", "inanim/loc/temp") # all others become "inanim/loc/temp"
table(POR_ANIMACY=d$POR_ANIMACY, POR_ANIMACY.confl=d$POR_ANIMACY.confl) # check you did it right
##             POR_ANIMACY.confl
## POR_ANIMACY  anim/coll inanim/loc/temp
##   animate          920               0
##   collective       607               0
##   inanimate          0            1671
##   locative           0             243
##   temporal           0             159
head(d <- d[,c(1:7,13,8:12)])
##   CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
## 1    2       of     nns   spoken         13          7  collective
## 2    3       of     nns   spoken         22          7     animate
## 3    4       of     nns   spoken         11          8     animate
## 4    5       of     nns   spoken         26          4  collective
## 5    6        s     nns   spoken          8          4     animate
## 6    7        s     nns   spoken          7          3     animate
##   POR_ANIMACY.confl POR_FINAL_SIB  POR_DEF PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1         anim/coll        absent definite   0.3278418        of     0.6721582
## 2         anim/coll        absent definite   0.5978261         s     0.4021739
## 3         anim/coll       present definite   0.5978261         s     0.4021739
## 4         anim/coll        absent definite   0.3278418        of     0.6721582
## 5         anim/coll        absent definite   0.5978261         s     0.5978261
## 6         anim/coll        absent definite   0.5978261         s     0.5978261

We already have the model with POR_ANIMACY as the only predictor so we only need to fit a new model with POR_ANIMACY.confl as the only predictor:

summary(m.01.animconfl <- glm( # make/summarize the gen. linear model m.01.animconfl:
   GENITIVE ~ 1 + POR_ANIMACY.confl,    # ORDER ~ an overall intercept (1) & POR_ANIMACY.confl
   family=binomial, data=d, na.action=na.exclude)) # the other arguments
##
## Call:
## glm(formula = GENITIVE ~ 1 + POR_ANIMACY.confl, family = binomial,
##     data = d, na.action = na.exclude)
##
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -0.03799    0.05119  -0.742    0.458
## POR_ANIMACY.conflinanim/loc/temp -2.65829    0.10377 -25.616   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 4004.3  on 3599  degrees of freedom
## Residual deviance: 3093.4  on 3598  degrees of freedom
## AIC: 3097.4
##
## Number of Fisher Scoring iterations: 5

… and then we do a comparison with anova just like above, the only difference being the argument to test (because we’re now fitting a generalized linear model).

anova(m.01,        # compare the model w/ the 5-level predictor POR_ANIMACY
   m.01.animconfl, # & the model w/ the 2-level predictor POR_ANIMACY.confl
   test="Chisq")   # using a LRT
## Analysis of Deviance Table
##
## Model 1: GENITIVE ~ 1 + POR_ANIMACY
## Model 2: GENITIVE ~ 1 + POR_ANIMACY.confl
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)
## 1      3595     2766.0
## 2      3598     3093.4 -3  -327.39 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the two models is hugely significant.

4.6 Write-up

To determine whether the choice of a genitive varies as a function of the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal), a generalized linear model was fit with GENITIVE as the response variable and POR_ANIMACY as a predictor. The model was highly significant (LR=1238.3, df=4, p<0.001) and explains a decent amount of the response variable (Nagelkerke R2=0.434, McFadden’s R2=0.309, C=0.85) with a good amount of predictive power: [list precision & accuracy/recall scores for both genitives]. The summary table of the model is shown here.

Estimate 95%-CI se z p2-tailed
Intercept (POR_ANIMACY=animate) 0.40 [0.27, 0.53] 0.07 5.9 <0.001
POR_ANIMACYanimatecollective -1.11 [-1.33, -0.9] 0.11 -10.17 <0.002
POR_ANIMACYanimateinanimate -4.30 [-4.69, -3.95] 0.19 -22.85 <0.002
POR_ANIMACYanimatelocative -1.91 [-2.27, -1.56] 0.18 -10.61 <0.002
POR_ANIMACYanimatetemporal -1.06 [-1.42, -0.71] 0.18 -5.88 <0.002

The model predicts s-genitives for animate possessors but of-genitives for collective/temporal possessors, then locative possessors, and then nearly exclusively for inanimate possessors. [Show a plot] A post-hoc comparison of the above model with one in which POS_ANIMACY was reduced to 2 levels (animate/collective vs. inanimate/locative/temporal) showed that this reduced predictor is highly significantly worse (LR=327.39, p<0.0001).

5 A numeric predictor

Does the choice of a genitive construction vary as a function on the length of the possessor (in characters)?

5.1 Exploration & preparation

# the predictor(s)/response on its/their own
table(d$GENITIVE)
##
##   of    s
## 2720  880
hist(d$POR_LENGTH); hist(d$POR_LENGTH, breaks="FD")