Let’s load a data set for a few binary logistic regression modeling examples:
rm(list=ls(all.names=TRUE))
library(car); library(effects)
source("helpers/202_R2.r"); source("helpers/202_C.score.r") # load small helper functions
summary(d <- read.delim( # summarize d, the result of loading
file="inputfiles/202_04-05_GENs.csv", # this file
stringsAsFactors=TRUE)) # change categorical variables into factors
## CASE GENITIVE SPEAKER MODALITY POR_LENGTH
## Min. : 2 of:2720 nns:2666 spoken :1685 Min. : 1.00
## 1st Qu.:1006 s : 880 ns : 934 written:1915 1st Qu.: 8.00
## Median :2018 Median : 11.00
## Mean :2012 Mean : 14.58
## 3rd Qu.:3017 3rd Qu.: 17.00
## Max. :4040 Max. :204.00
## PUM_LENGTH POR_ANIMACY POR_FINAL_SIB POR_DEF
## Min. : 2.00 animate : 920 absent :2721 definite :2349
## 1st Qu.: 6.00 collective: 607 present: 879 indefinite:1251
## Median : 9.00 inanimate :1671
## Mean : 10.35 locative : 243
## 3rd Qu.: 13.00 temporal : 159
## Max. :109.00
Here’s what the columns are/contain:
CASE
is a simple case/row number just so
that each case has a unique ID;GENITIVE
is the binary response variable:
it’s whether a speaker used an of-genitive (e.g., the
carpossessum of my fatherpossessor) or an
s-genitive (my fatherpossessor’s
carpossessum);SPEAKER
is a binary predictor: it’s whether
the speaker who produced a genitive was a native speaker of English
(ns) or a learner of English (nns);MODALITY
is a binary predictor: it’s whether
the speaker produced a genitive in speaking (spoken) or in
writing (written);POR_LENGTH
is a numeric predictor: it’s the
length of the possessor in the genitive in characters (e.g., my
father has 9 characters (incl. the space));PUM_LENGTH
is a numeric predictor: it’s the
length of the possessum in the genitive in characters (e.g., the
car has 7 characters (incl. the space));POR_ANIMACY
is a categorical predictor: it’s
the semantic category of the possessor: animate (e.g., my
father’s car) vs. collective (e.g., the government’s
initiative) vs. inanimate (e.g., the table’s
surface) vs. locative (e.g., Boston’s population)
vs. temporal (e.g., this year’s summer);POR_FINAL_SIB
is a binary predictor: it’s
whether the possessor ends in a sibilant (e.g., the fish’s
eyes) or not (e.g., the pig’s eyes);POR_DEF
is a binary predictor: it’s whether
the possessor is definite (e.g., the President’s speech) or
indefinite (e.g., a bystander’s reaction);As in the book, we will first discuss some didactically motivated monofactorial models before we go over a ‘proper’ multifactorial modeling approach.
Let’s already compute the baselines for what will be the response
variable in all case studies, GENITIVE
:
(baselines <- c(
"baseline 1"=max( # make baselines[1] the highest
prop.table( # proportion in the
table(d$GENITIVE))), # frequency table of the response
"baseline 2"=sum( # make baselines[2] the sum of the
prop.table( # proportions in the
table(d$GENITIVE))^2))) # frequency table of the response squared
## baseline 1 baseline 2
## 0.7555556 0.6306173
As for deviance, in the case of lm
, we defined deviance
as the sum of squared residuals and we said we want to reduce the
deviance of our model relative to a null model by adding (hopefully
good) predictors. What is deviance here in the case of a
glm
? To understand that, we first compute a null model:
summary(m.00 <- glm(GENITIVE ~ 1, family=binomial, data=d, na.action=na.exclude))
##
## Call:
## glm(formula = GENITIVE ~ 1, family = binomial, data = d, na.action = na.exclude)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.12847 0.03878 -29.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4004.3 on 3599 degrees of freedom
## Residual deviance: 4004.3 on 3599 degrees of freedom
## AIC: 4006.3
##
## Number of Fisher Scoring iterations: 4
deviance(m.00)
## [1] 4004.273
sum(residuals(m.00)^2)
## [1] 4004.273
We see that the deviance of m.00
is still the sum of
this model’s squared residuals and the model’s “null deviance” (and, if
a model is a null model, also what is called “residual deviance”). But
what are squared residuals here? We will discuss that in more detail
after we have fitted our first ‘real model’ …
Does the choice of a genitive construction vary as a function of the modality in which the construction was produced (spoken vs. written)?
Some exploration of the relevant variables:
# the predictor(s)/response on its/their own
table(d$GENITIVE)
##
## of s
## 2720 880
table(d$MODALITY)
##
## spoken written
## 1685 1915
# the predictor(s) w/ the response
(mod_x_gen <- table(d$MODALITY, d$GENITIVE))
##
## of s
## spoken 1232 453
## written 1488 427
How would we have dealt with this in Ling 201? First, we would have computed a chi-squared test like this, where the chi-squared value is computed involving differences of observed and expected values:
chisq.test(mod_x_gen, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: mod_x_gen
## X-squared = 10.21, df = 1, p-value = 0.001397
Second, we might have computed the odds ratio and the logged odds ratio for the s-genitives:
(odds.ratio <-
(mod_x_gen[2,2] / mod_x_gen[2,1]) /
(mod_x_gen[1,2] / mod_x_gen[1,1]))
## [1] 0.7804363
log(odds.ratio)
## [1] -0.2479022
How do we do this now? With a binary logistic regression, which returns the G2-statistic, which is computed based on ratios of observed and expected values (not differences like chi-squared) and which also returns the (logged) odds ratio. Let’s fit our regression model:
summary(m.01 <- glm( # make/summarize the gen. linear model m.01:
GENITIVE ~ 1 + # GENITIVE ~ an overall intercept (1)
MODALITY, # & the predictor MODALITY
family=binomial, # resp = binary
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data
##
## Call:
## glm(formula = GENITIVE ~ 1 + MODALITY, family = binomial, data = d,
## na.action = na.exclude)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.00050 0.05495 -18.208 < 2e-16 ***
## MODALITYwritten -0.24790 0.07767 -3.192 0.00141 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4004.3 on 3599 degrees of freedom
## Residual deviance: 3994.1 on 3598 degrees of freedom
## AIC: 3998.1
##
## Number of Fisher Scoring iterations: 4
pchisq( # compute the area under the chi-squared curve
q=4004.3-3994.1, # for this chi-squared value: null - residual deviance
df=3599-3598, # at this df: null - residual df
lower.tail=FALSE) # only using the right/upper tail/side
## [1] 0.001404407
drop1(m.01, # drop each droppable predictor at a time from m.01 &
test="Chisq") # test its significance w/ a LRT
## Single term deletions
##
## Model:
## GENITIVE ~ 1 + MODALITY
## Df Deviance AIC LRT Pr(>Chi)
## <none> 3994.1 3998.1
## MODALITY 1 4004.3 4006.3 10.194 0.001409 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confint(m.01) # confidence intervals
## Estimate 2.5 % 97.5 %
## (Intercept) -1.0005020 -1.1091198 -0.89367488
## MODALITYwritten -0.2479022 -0.4002591 -0.09571937
(Note how similar the chi-squared value and the G2-statistic and their p-values are.) Let’s make ‘predictions’:
d$PREDS.NUM.2 <- predict( # make d$PREDS.NUM.2 the result of predicting
m.01, # from m.01
type="response") # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds:
unique(predict(m.01), 6)
## [1] -1.000502 -1.248404
d$PREDS.CAT <- factor( # make d$PREDS.CAT the result of checking
ifelse(d$PREDS.NUM.2>=0.5, # if the predicted prob. of "s" is >=0.5
levels(d$GENITIVE)[2], # if yes, predict "s"
levels(d$GENITIVE)[1])) # otherwise, predict "of"
Actually, the much shorter though less intuitive version of
generating d$PREDS.CAT
is this:
d$PREDS.CAT <- factor(levels(d$GENITIVE)[1+(d$PREDS.NUM.2>=0.5)])
On top of that, let’s also add something else, namely a column called
PREDS.NUM.obs
that contains, for each case, the predicted
probability of the level that was observed in the data, i.e. what the
model should have predicted. How does that work? Consider the
head of x
:
head(d)
## CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
## 1 2 of nns spoken 13 7 collective
## 2 3 of nns spoken 22 7 animate
## 3 4 of nns spoken 11 8 animate
## 4 5 of nns spoken 26 4 collective
## 5 6 s nns spoken 8 4 animate
## 6 7 s nns spoken 7 3 animate
## POR_FINAL_SIB POR_DEF PREDS.NUM.2 PREDS.CAT
## 1 absent definite 0.2688427 of
## 2 absent definite 0.2688427 of
## 3 present definite 0.2688427 of
## 4 absent definite 0.2688427 of
## 5 absent definite 0.2688427 of
## 6 absent definite 0.2688427 of
In the first case, the model predicted s with a probability
of 0.2688427 and, thus of with a probability of
1-0.2688427=0.7311573. The actual observation, what the speaker really
said, is of so PREDS.NUM.obs[1]
should be
0.7311573. In the fifth case, the model again predicted s with
a probability of 0.2688427 and, thus of with a probability of
1-0.2688427=0.7311573. The actual observation, what the speaker really
said, is s so PREDS.NUM.obs[2]
should be
0.2688427. How do we do that efficiently for all cases (and in the
spirit of Section 3.5.2, efficiently means ‘without a loop’)?
d$PREDS.NUM.obs <- ifelse( # d$PREDS.NUM.obs is determined by ifelse
d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response
d$PREDS.NUM.2, # take its predicted probability
1-d$PREDS.NUM.2) # otherwise take 1 minus its predicted probability
head(d)
## CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
## 1 2 of nns spoken 13 7 collective
## 2 3 of nns spoken 22 7 animate
## 3 4 of nns spoken 11 8 animate
## 4 5 of nns spoken 26 4 collective
## 5 6 s nns spoken 8 4 animate
## 6 7 s nns spoken 7 3 animate
## POR_FINAL_SIB POR_DEF PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1 absent definite 0.2688427 of 0.7311573
## 2 absent definite 0.2688427 of 0.7311573
## 3 present definite 0.2688427 of 0.7311573
## 4 absent definite 0.2688427 of 0.7311573
## 5 absent definite 0.2688427 of 0.2688427
## 6 absent definite 0.2688427 of 0.2688427
Here, too, a shorter but much less intuitive version is available; if you have time to waste, try and find it, but for some weird reason it’s a brain teaser, consider yourself warned.
d$PREDS.NUM.obs <- abs(d$PREDS.NUM.2 - as.numeric(d$GENITIVE!="s"))
We’ll get to why this is useful in a moment … But let’s now evaluate the predictive capacity of the model:
(c.m <- table( # confusion matrix: cross-tabulate
"OBS" =d$GENITIVE, # observed genitives in the rows
"PREDS"=d$PREDS.CAT)) # predicted orders in the columns
## PREDS
## OBS of
## of 2720
## s 880
# evaluating the confusion matrix here in the usual way doesn't make much sense,
# which is why we discuss that in the next section
c("Nagelkerke's R2"=R2(m.01), # R-squared &
"C" =C.score(m.01)) # C
## Nagelkerke's R2 C
## 0.004212717 0.530915775
Let’s visualize the nature of the effect of MODALITY
based on the predictions:
(mo.d <- data.frame( # make mo.d a data frame of
mo <- effect( # the effect
"MODALITY", # of MODALITY
m.01))) # in m.01
## MODALITY fit se lower upper
## 1 spoken 0.2688427 0.010800758 0.2482073 0.2905308
## 2 written 0.2229765 0.009511784 0.2048903 0.2421729
plot(mo, # plot the effect of MODALITY
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
grid=TRUE) # & w/ a grid
We saw above that the deviance of a glm
is again the sum
of squared residuals, which means we again want to reduce the deviance
of our model relative to a null model by adding (hopefully good)
predictors. But what is deviance here in the case of a glm
exactly, how it is computed? To understand, let’s look at the
predictions of m.01
again:
head(d[,c(4,10:12)])
## MODALITY PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1 spoken 0.2688427 of 0.7311573
## 2 spoken 0.2688427 of 0.7311573
## 3 spoken 0.2688427 of 0.7311573
## 4 spoken 0.2688427 of 0.7311573
## 5 spoken 0.2688427 of 0.2688427
## 6 spoken 0.2688427 of 0.2688427
What would the best model ever have returned for these 6 cases? Something like this, where the model
## GENITIVE PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1 of 0.00001 of 0.99999
## 2 of 0.00001 of 0.99999
## 3 of 0.00001 of 0.99999
## 4 of 0.00001 of 0.99999
## 5 s 0.99999 s 0.99999
## 6 s 0.99999 s 0.99999
That means we could use 1-PREDS.NUM.obs
to quantify how
much off from a perfect prediction the model is, but there is an
alternative: We could taking the negative logs of
PREDS.NUM.obs
because then
-log(0.8)
## [1] 0.2231436
-log(0.9)
## [1] 0.1053605
-log(0.95)
## [1] 0.05129329
-log(0.2)
## [1] 1.609438
-log(0.1)
## [1] 2.302585
-log(0.05)
## [1] 2.995732
I am calling these values contributions to logloss (because the mean of all those values is a evaluation metric called logloss). But what does this have to do with deviance? This:
deviance(m.01) # same as
## [1] 3994.079
# the sum of all contributions to logloss times 2
sum(-log(d$PREDS.NUM.obs)) * 2
## [1] 3994.079
That means, the deviance of such a glm
, what the summary
calls “residual deviance”, expresses how far all the predicted
probabilities of the events that should have been predicted are
from their theoretical ideals of (very near to) 0 and (very near to) 1!
That actually means we can also again express how much the deviance of
the null model, the null deviance, is reduced by the predictors in
percent:
(m.00$deviance -m.01$deviance) / m.00$deviance # same as
# (m.01$null.deviance-m.01$deviance) / m.01$null.deviance
## [1] 0.002545694
This is also an R2-value – not the one that is mostly used and that we computed above, Nagelkerke’s R2, but one that is called McFadden’s R2.
To determine whether the choice of a genitive varies as a function of
the modality (spoken vs. written), a generalized
linear model was fit with GENITIVE
as the response variable
and MODALITY
as a predictor. The model was highly
significant (LR=10.19, df=1, p<0.002) but
explains very little of the response variable (Nagelkerke
R2=0.004, McFadden’s R2=0.003,
C=0.53) and has no predictive power: the model only ever
predicts of-genitives. The coefficients indicate that
s-genitives are more likely in spoken data than in written
data, as shown in the summary table shown here. [Show a plot]
Estimate | 95%-CI | se | z | p2-tailed | |
---|---|---|---|---|---|
Intercept (MODALITY=spoken) | -1 | [-1.11, -0.89] | 0.05 | -18.21 | <0.001 |
MODALITYspoken→written | -0.25 | [-0.4, -0.1] | 0.08 | -3.19 | <0.002 |
Does the choice of a genitive construction vary as a function on the degree/kind of animacy of the possessor (animate vs. collective vs. inanimate vs. locative vs. temporal)?
Some exploration of the relevant variables:
# the predictor(s)/response on its/their own
table(d$GENITIVE)
##
## of s
## 2720 880
table(d$POR_ANIMACY)
##
## animate collective inanimate locative temporal
## 920 607 1671 243 159
# the predictor(s) w/ the response
table(d$POR_ANIMACY, d$GENITIVE)
##
## of s
## animate 370 550
## collective 408 199
## inanimate 1638 33
## locative 199 44
## temporal 105 54
Let’s fit our regression model:
summary(m.01 <- glm( # make/summarize the gen. linear model m.01:
GENITIVE ~ 1 + # GENITIVE ~ an overall intercept (1)
POR_ANIMACY, # & the predictor POR_ANIMACY
family=binomial, # resp = binary
data=d, # those vars are in d
na.action=na.exclude)) # skip cases with NA/missing data
##
## Call:
## glm(formula = GENITIVE ~ 1 + POR_ANIMACY, family = binomial,
## data = d, na.action = na.exclude)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.39642 0.06724 5.896 3.73e-09 ***
## POR_ANIMACYcollective -1.11438 0.10953 -10.174 < 2e-16 ***
## POR_ANIMACYinanimate -4.30114 0.18823 -22.851 < 2e-16 ***
## POR_ANIMACYlocative -1.90553 0.17965 -10.607 < 2e-16 ***
## POR_ANIMACYtemporal -1.06139 0.18045 -5.882 4.06e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4004.3 on 3599 degrees of freedom
## Residual deviance: 2766.0 on 3595 degrees of freedom
## AIC: 2776
##
## Number of Fisher Scoring iterations: 6
pchisq( # compute the area under the chi-squared curve
q=4004.3-2766, # for this chi-squared value: null - residual deviance
df=3599-3595, # at this df: null - residual df
lower.tail=FALSE) # only using the right/upper tail/side
## [1] 7.926259e-267
drop1(m.01, # drop each droppable predictor at a time from m.01 &
test="Chisq") # test its significance w/ a LRT
## Single term deletions
##
## Model:
## GENITIVE ~ 1 + POR_ANIMACY
## Df Deviance AIC LRT Pr(>Chi)
## <none> 2766.0 2776.0
## POR_ANIMACY 4 4004.3 4006.3 1238.3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Confint(m.01) # confidence intervals
## Estimate 2.5 % 97.5 %
## (Intercept) 0.3964153 0.2651511 0.5288144
## POR_ANIMACYcollective -1.1143776 -1.3303631 -0.9008787
## POR_ANIMACYinanimate -4.3011390 -4.6879154 -3.9477968
## POR_ANIMACYlocative -1.9055305 -2.2682451 -1.5625579
## POR_ANIMACYtemporal -1.0613916 -1.4207695 -0.7121414
Let’s make (numerical and categorical) ‘predictions’ and evaluate them:
d$PREDS.NUM.2 <- predict( # make d$PREDS.NUM.2 the result of predicting
m.01, # from m.01
type="response") # predicted probabilities of "s"
# if you don't say type="response", predict gives you log odds
d$PREDS.CAT <- factor( # make d$PREDS.CAT the result of checking
ifelse(d$PREDS.NUM.2>=0.5, # if the predicted prob. of "s" is >=0.5
levels(d$GENITIVE)[2], # if yes, predict "s"
levels(d$GENITIVE)[1])) # otherwise, predict "of"
(c.m <- table( # confusion matrix: cross-tabulate
"OBS" =d$GENITIVE, # observed genitives in the rows
"PREDS"=d$PREDS.CAT)) # predicted orders in the columns
## PREDS
## OBS of s
## of 2350 370
## s 330 550
c( # evaluate the confusion matrix
"Prec. for s" =c.m[ "s", "s"] / sum(c.m[ , "s"]),
"Acc./rec. for s" =c.m[ "s", "s"] / sum(c.m[ "s", ]),
"Prec. for of" =c.m["of","of"] / sum(c.m[ ,"of"]),
"Acc./rec. for of"=c.m["of","of"] / sum(c.m["of", ]),
"Acc. (overall)" =mean(d$GENITIVE==d$PREDS.CAT))
## Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
## 0.5978261 0.6250000 0.8768657 0.8639706
## Acc. (overall)
## 0.8055556
c("Nagelkerke's R2"=R2(m.01),
"McFadden's R2" =(m.00$deviance-m.01$deviance) / m.00$deviance,
"C" =C.score(m.01)) # R-squared values & C
## Nagelkerke's R2 McFadden's R2 C
## 0.4336234 0.3092390 0.8472389
Again, we also add our column PREDS.NUM.obs
with, for
each case, the predicted probability of the level that was observed in
the data:
d$PREDS.NUM.obs <- ifelse( # d$PREDS.NUM.obs is determined by ifelse
d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response
d$PREDS.NUM.2, # take its predicted probability
1-d$PREDS.NUM.2) # otherwise take 1 minus its predicted probability
head(d)
## CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
## 1 2 of nns spoken 13 7 collective
## 2 3 of nns spoken 22 7 animate
## 3 4 of nns spoken 11 8 animate
## 4 5 of nns spoken 26 4 collective
## 5 6 s nns spoken 8 4 animate
## 6 7 s nns spoken 7 3 animate
## POR_FINAL_SIB POR_DEF PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1 absent definite 0.3278418 of 0.6721582
## 2 absent definite 0.5978261 s 0.4021739
## 3 present definite 0.5978261 s 0.4021739
## 4 absent definite 0.3278418 of 0.6721582
## 5 absent definite 0.5978261 s 0.5978261
## 6 absent definite 0.5978261 s 0.5978261
Let’s visualize the nature of the effect of POR_ANIMACY
based on the predictions:
(an.d <- data.frame( # make an.d a data frame of
an <- effect( # the effect
"POR_ANIMACY", # of POR_ANIMACY
m.01))) # in m.01
## POR_ANIMACY fit se lower upper
## 1 animate 0.59782609 0.016165922 0.56577463 0.62906282
## 2 collective 0.32784185 0.019053448 0.29164055 0.36621363
## 3 inanimate 0.01974865 0.003403457 0.01407325 0.02764863
## 4 locative 0.18106996 0.024702646 0.13756935 0.23458435
## 5 temporal 0.33962264 0.037557428 0.27028269 0.41659580
plot(an, # plot the effect of POR_ANIMACY
type="response", # but plot predicted probabilities
ylim=c(0, 1), # ylim: the whole range of possible probabilities
grid=TRUE) # & w/ a grid
The way the contributions to logloss and the deviance are computed do not change just because we have (a) different predictor(s) now: The deviance is still the sum of all contributions to logloss times 2:
deviance(m.01)
## [1] 2765.995
sum(-log(d$PREDS.NUM.obs)) * 2
## [1] 2765.995
m.01$deviance
## [1] 2765.995
Given a lot of previous literature on the genitive alternation, you
might actually have hypothesized that the only thing one needs is a
distinction between animate (and maybe collective) and all inanimate
possessors. Let’s use model comparison to check this. First, we create a
version of POR_ANIMACY
that conflates levels as
desired:
d$POR_ANIMACY.confl <- d$POR_ANIMACY # create a copy of POR_ANIMACY
levels(d$POR_ANIMACY.confl) <- # overwrite the copy's levels such that
c("anim/coll", "anim/coll", # "animate" and "collective" become "anim/coll"
"inanim/loc/temp", "inanim/loc/temp", "inanim/loc/temp") # all others become "inanim/loc/temp"
table(POR_ANIMACY=d$POR_ANIMACY, POR_ANIMACY.confl=d$POR_ANIMACY.confl) # check you did it right
## POR_ANIMACY.confl
## POR_ANIMACY anim/coll inanim/loc/temp
## animate 920 0
## collective 607 0
## inanimate 0 1671
## locative 0 243
## temporal 0 159
head(d <- d[,c(1:7,13,8:12)])
## CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
## 1 2 of nns spoken 13 7 collective
## 2 3 of nns spoken 22 7 animate
## 3 4 of nns spoken 11 8 animate
## 4 5 of nns spoken 26 4 collective
## 5 6 s nns spoken 8 4 animate
## 6 7 s nns spoken 7 3 animate
## POR_ANIMACY.confl POR_FINAL_SIB POR_DEF PREDS.NUM.2 PREDS.CAT PREDS.NUM.obs
## 1 anim/coll absent definite 0.3278418 of 0.6721582
## 2 anim/coll absent definite 0.5978261 s 0.4021739
## 3 anim/coll present definite 0.5978261 s 0.4021739
## 4 anim/coll absent definite 0.3278418 of 0.6721582
## 5 anim/coll absent definite 0.5978261 s 0.5978261
## 6 anim/coll absent definite 0.5978261 s 0.5978261
We already have the model with POR_ANIMACY
as the only
predictor so we only need to fit a new model with
POR_ANIMACY.confl
as the only predictor:
summary(m.01.animconfl <- glm( # make/summarize the gen. linear model m.01.animconfl:
GENITIVE ~ 1 + POR_ANIMACY.confl, # ORDER ~ an overall intercept (1) & POR_ANIMACY.confl
family=binomial, data=d, na.action=na.exclude)) # the other arguments
##
## Call:
## glm(formula = GENITIVE ~ 1 + POR_ANIMACY.confl, family = binomial,
## data = d, na.action = na.exclude)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.03799 0.05119 -0.742 0.458
## POR_ANIMACY.conflinanim/loc/temp -2.65829 0.10377 -25.616 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4004.3 on 3599 degrees of freedom
## Residual deviance: 3093.4 on 3598 degrees of freedom
## AIC: 3097.4
##
## Number of Fisher Scoring iterations: 5
… and then we do a comparison with anova
just like
above, the only difference being the argument to test
(because we’re now fitting a generalized linear model).
anova(m.01, # compare the model w/ the 5-level predictor POR_ANIMACY
m.01.animconfl, # & the model w/ the 2-level predictor POR_ANIMACY.confl
test="Chisq") # using a LRT
## Analysis of Deviance Table
##
## Model 1: GENITIVE ~ 1 + POR_ANIMACY
## Model 2: GENITIVE ~ 1 + POR_ANIMACY.confl
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3595 2766.0
## 2 3598 3093.4 -3 -327.39 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference between the two models is hugely significant.
To determine whether the choice of a genitive varies as a function of
the degree/kind of animacy of the possessor (animate
vs. collective vs. inanimate vs. locative
vs. temporal), a generalized linear model was fit with
GENITIVE
as the response variable and
POR_ANIMACY
as a predictor. The model was highly
significant (LR=1238.3, df=4, p<0.001) and
explains a decent amount of the response variable (Nagelkerke
R2=0.434, McFadden’s R2=0.309,
C=0.85) with a good amount of predictive power: [list precision
& accuracy/recall scores for both genitives]. The summary table of
the model is shown here.
Estimate | 95%-CI | se | z | p2-tailed | |
---|---|---|---|---|---|
Intercept (POR_ANIMACY=animate) | 0.40 | [0.27, 0.53] | 0.07 | 5.9 | <0.001 |
POR_ANIMACYanimate→collective | -1.11 | [-1.33, -0.9] | 0.11 | -10.17 | <0.002 |
POR_ANIMACYanimate→inanimate | -4.30 | [-4.69, -3.95] | 0.19 | -22.85 | <0.002 |
POR_ANIMACYanimate→locative | -1.91 | [-2.27, -1.56] | 0.18 | -10.61 | <0.002 |
POR_ANIMACYanimate→temporal | -1.06 | [-1.42, -0.71] | 0.18 | -5.88 | <0.002 |
The model predicts s-genitives for animate possessors but
of-genitives for collective/temporal possessors, then locative
possessors, and then nearly exclusively for inanimate possessors. [Show
a plot] A post-hoc comparison of the above model with one in which
POS_ANIMACY
was reduced to 2 levels
(animate/collective vs. inanimate/locative/temporal)
showed that this reduced predictor is highly significantly worse
(LR=327.39, p<0.0001).