We are dealing with the same data set as in the sessions on binary logistic regression (_input/genitives.csv, see _input/genitives.r); specifically, we are asking, does the choice of a genitive construction (of vs. s) vary as a function of
all the predictors that are already part of the data frame, i.e.
the categorical predictors SPEAKER, MODALITY, POR_ANIMACY, POR_FINAL_SIB, POR_DEF;
the numeric predictors POR_LENGTH and PUM_LENGTH;
an additional new length-based predictor, namely how the length of the possessor POR_LENGTH compares to the length of the possessum (PUM_LENGTH) (expressed as a difference); since such a comparison variable doesn’t exist yet in our data set, we need to create it first.
rm(list=ls(all.names=TRUE))library(magrittr); library(tree)summary(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
CASE GENITIVE SPEAKER MODALITY POR_LENGTH
Min. : 2 of:2720 nns:2666 spoken :1685 Min. : 1.00
1st Qu.:1006 s : 880 ns : 934 written:1915 1st Qu.: 8.00
Median :2018 Median : 11.00
Mean :2012 Mean : 14.58
3rd Qu.:3017 3rd Qu.: 17.00
Max. :4040 Max. :204.00
PUM_LENGTH POR_ANIMACY POR_FINAL_SIB POR_DEF
Min. : 2.00 animate : 920 absent :2721 definite :2349
1st Qu.: 6.00 collective: 607 present: 879 indefinite:1251
Median : 9.00 inanimate :1671
Mean : 10.35 locative : 243
3rd Qu.: 13.00 temporal : 159
Max. :109.00
Let’s already compute the baselines for what will be the response variable, GENITIVE:
(baselines <-c("baseline 1"=max( # make baselines[1] the highestprop.table( # proportion in thetable(d$GENITIVE))), # frequency table of the response"baseline 2"=sum( # make baselines[2] the sum of theprop.table( # proportions in thetable(d$GENITIVE))^2))) # frequency table of the response squared
baseline 1 baseline 2
0.7555556 0.6306173
Let’s also compute the deviance of a null tree, which is the same as the deviance of the null glm – because it just looks at the distribution of the response variable:
var n dev yval splits.cutleft splits.cutright
1 POR_ANIMACY 3600 4004.2730 of :c :abde
2 <leaf> 1671 324.3722 of
3 POR_LENGTH 1929 2645.4618 of <13.5 >13.5
6 POR_ANIMACY 1380 1906.6784 s :bde :a
12 LEN_PORmPUM_LOG 705 944.7437 of <-1.11548 >-1.11548
24 <leaf> 175 212.9484 s
25 <leaf> 530 638.8172 of
13 <leaf> 675 844.7594 s
7 POR_ANIMACY 549 549.9959 of :bde :a
14 <leaf> 304 147.5062 of
15 LEN_PORmPUM_LOG 245 322.1890 of <0.864615 >0.864615
30 <leaf> 87 112.0883 s
31 <leaf> 158 161.9321 of
yprob.of yprob.s
1 0.75555556 0.24444444
2 0.98025135 0.01974865
3 0.56091239 0.43908761
6 0.46594203 0.53405797
12 0.60709220 0.39290780
24 0.29714286 0.70285714
25 0.70943396 0.29056604
13 0.31851852 0.68148148
7 0.79963570 0.20036430
14 0.93421053 0.06578947
15 0.63265306 0.36734694
30 0.34482759 0.65517241
31 0.79113924 0.20886076
But we want to plot this so we can try and interpret it:
plot(cart.1); grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(cart.1, pretty=4, all=TRUE) # add abbrev. labels to it
How well does the tree do in terms of its predictions?
d$PREDS.PP.2<-# make d$PREDS.PP.2 the result ofpredict( # predicting cart.1# from cart.1 )[,"s"] # predicted probabilities of "s"d$PREDS.CAT <-predict( # make d$PREDS.CAT the result of predicting cart.1, # from cart.1type="class") # categorical predictions(c.m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PREDS.CAT)) # predicted genitives in the columns
PREDS
OBS of s
of 2423 297
s 240 640
c( # evaluate the confusion matrix"Prec. for s"=c.m[ "s", "s"] /sum(c.m[ , "s"]),"Acc./rec. for s"=c.m[ "s", "s"] /sum(c.m[ "s", ]),"Prec. for of"=c.m["of","of"] /sum(c.m[ ,"of"]),"Acc./rec. for of"=c.m["of","of"] /sum(c.m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PREDS.CAT))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.6830309 0.7272727 0.9098761 0.8908088
Acc. (overall)
0.8508333
Note that that means you could compute R2-values even for trees (even if I don’t ever see that being done; I don’t know why (not)):
We can also add the column for the logloss/deviance calculations:
d$PREDS.PP.obs <-ifelse( # d$PREDS.PP.obs is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PREDS.PP.2, # take its predicted probability1-d$PREDS.PP.2) # otherwise take 1 minus its predicted probabilityhead(d)
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
2 3 of nns spoken 22 7 animate
3 4 of nns spoken 11 8 animate
4 5 of nns spoken 26 4 collective
5 6 s nns spoken 8 4 animate
6 7 s nns spoken 7 3 animate
POR_FINAL_SIB POR_DEF LEN_PORmPUM_LOG PREDS.PP.2 PREDS.CAT PREDS.PP.obs
1 absent definite 0.8930848 0.29056604 of 0.7094340
2 absent definite 1.6520767 0.20886076 of 0.7911392
3 present definite 0.4594316 0.68148148 s 0.3185185
4 absent definite 2.7004397 0.06578947 of 0.9342105
5 absent definite 1.0000000 0.68148148 s 0.6814815
6 absent definite 1.2223924 0.68148148 s 0.6814815
Can we still get to the deviance of the tree from the contributions to logloss?
sum(-log(d$PREDS.PP.obs)) *2# yes, that is deviance(cart.1)
[1] 2442.424
mean(-log(d$PREDS.PP.obs)) # thus, this is the logloss for cart.1
[1] 0.3392255
3.2 Is pruning necessary?
Should we, or do we need to, prune this tree?
# because cross-validation involves randomization, we ...set.seed(1) # ... set the random number seedpruning <-# make pruningcv.tree( # the result of cross-validating cart.1, # the tree in cart.1FUN=prune.misclass) # based on the number of misclassificationsplot(pruning$size, # plot the pruned tree sizes pruning$dev, # against the deviances those tree sizes come w/type="b"); grid() # using points & lines; add a grid
So we go with 5:
cart.1.pruned <-prune.tree( # make cart.1.pruned the pruning of cart.1, # a version of cart.1 pruned down tobest=5) # only 5 terminal nodesplot(cart.1.pruned) # plot the classification treeaxis(2); mtext("Deviance", 2, 3); grid() # add a useful y-axis + gridtext(cart.1.pruned, pretty=4, all=TRUE) # add all full labels to it
So, what the plot and the output from cart.1.pruned and cart.1.pruned$frame say is the following:
predict of-genitives most of the time, especially
when the possessor is inanimate (which is correct more than 98% of the time!); this makes sense because animate/human possessors usually take s-genitives;
when the possessor is >13 characters; this makes sense because the of-genitive allows such long possessors to go to the end (satisfying short-before-long);
predict s-genitives when the length of the possessor is ≤13 characters and
with possessors that are animate, or
with possessors that are
collective, locative, or temporal and
at least two words shorter than the possessum.
How much of a hit do we take when it comes to ‘prediction’ accuracy?
d$PREDS.PP.2.pruned <-# make d$PREDS.PP.2.pruned the result ofpredict( # predicting cart.1.pruned # from cart.1.pruned )[,"s"] # predicted probabilities of "s"d$PREDS.CAT.pruned <-predict( # make d$PREDS.CAT.pruned the result of predicting cart.1.pruned, # from cart.1.prunedtype="class") # categorical predictions(c.m.pr <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PREDS.CAT.pruned)) # predicted genitives in the columns
PREDS
OBS of s
of 2453 267
s 297 583
c( # evaluate the confusion matrix"Prec. for s"=c.m.pr[ "s", "s"] /sum(c.m.pr[ , "s"]),"Acc./rec. for s"=c.m.pr[ "s", "s"] /sum(c.m.pr[ "s", ]),"Prec. for of"=c.m.pr["of","of"] /sum(c.m.pr[ ,"of"]),"Acc./rec. for of"=c.m.pr["of","of"] /sum(c.m.pr["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PREDS.CAT.pruned))
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
0.6858824 0.6625000 0.8920000 0.9018382
Acc. (overall)
0.8433333
What effect does that have on logloss and the deviance?
d$PREDS.PP.obs.pruned <-ifelse( # d$PREDS.PP.obs.pruned is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PREDS.PP.2.pruned, # take its predicted probability1-d$PREDS.PP.2.pruned) # otherwise take 1 minus its predicted probabilityhead(d)
CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1 2 of nns spoken 13 7 collective
2 3 of nns spoken 22 7 animate
3 4 of nns spoken 11 8 animate
4 5 of nns spoken 26 4 collective
5 6 s nns spoken 8 4 animate
6 7 s nns spoken 7 3 animate
POR_FINAL_SIB POR_DEF LEN_PORmPUM_LOG PREDS.PP.2 PREDS.CAT PREDS.PP.obs
1 absent definite 0.8930848 0.29056604 of 0.7094340
2 absent definite 1.6520767 0.20886076 of 0.7911392
3 present definite 0.4594316 0.68148148 s 0.3185185
4 absent definite 2.7004397 0.06578947 of 0.9342105
5 absent definite 1.0000000 0.68148148 s 0.6814815
6 absent definite 1.2223924 0.68148148 s 0.6814815
PREDS.PP.2.pruned PREDS.CAT.pruned PREDS.PP.obs.pruned
1 0.2905660 of 0.7094340
2 0.2003643 of 0.7996357
3 0.6814815 s 0.3185185
4 0.2003643 of 0.7996357
5 0.6814815 s 0.6814815
6 0.6814815 s 0.6814815
sum(-log(d$PREDS.PP.obs.pruned)) *2# yes, that is deviance(cart.1.pruned)
[1] 2570.893
mean(-log(d$PREDS.PP.obs.pruned)) # thus, this is the logloss for cart.1.pruned
[1] 0.3570685
So the pruning made
the deviance/logloss go up by 5.26%;
the R-squared value go down by 0.0823%,
but in terms of precision and accuracy/recall we really only take a notable hit in the accuracy/recall for s-genitives (from 72.7% down to 66.3%); you need to determine whether your application can live with that decrease in accuracy/deviance and the potential degree to which the pruning lowers the risk of overfitting.
4 Session info
sessionInfo()
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] tree_1.0-44 STGmisc_1.0 Rcpp_1.0.14 magrittr_2.0.3
loaded via a namespace (and not attached):
[1] digest_0.6.37 fastmap_1.2.0 xfun_0.51 knitr_1.50
[5] htmltools_0.5.8.1 rmarkdown_2.29 cli_3.6.4 rstudioapi_0.17.1
[9] tools_4.4.3 evaluate_1.0.3 yaml_2.3.10 rlang_1.1.5
[13] jsonlite_2.0.0 htmlwidgets_1.6.4 MASS_7.3-65
Source Code
---title: "Ling 202: session 10: trees (key)"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2025-06-04 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 5 fig-height: 5 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---# IntroWe are dealing with the same data set as in the sessions on binary logistic regression ([_input/genitives.csv](_input/genitives.csv), see [_input/genitives.r](_input/genitives.r)); specifically, we are asking, does the choice of a genitive construction (*of* vs. *s*) vary as a function of* all the predictors that are already part of the data frame, i.e. + the categorical predictors `SPEAKER`, `MODALITY`, `POR_ANIMACY`, `POR_FINAL_SIB`, `POR_DEF`; + the numeric predictors `POR_LENGTH` and `PUM_LENGTH`;* an additional new length-based predictor, namely how the length of the possessor `POR_LENGTH` compares to the length of the possessum (`PUM_LENGTH`) (expressed as a difference); since such a comparison variable doesn't exist yet in our data set, we need to create it first.```{r}rm(list=ls(all.names=TRUE))library(magrittr); library(tree)summary(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd$LEN_PORmPUM_LOG <-log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)```# Deviance & baseline(s)Let's already compute the baselines for what will be the response variable, `GENITIVE`:```{r}(baselines <-c("baseline 1"=max( # make baselines[1] the highestprop.table( # proportion in thetable(d$GENITIVE))), # frequency table of the response"baseline 2"=sum( # make baselines[2] the sum of theprop.table( # proportions in thetable(d$GENITIVE))^2))) # frequency table of the response squared```Let's also compute the deviance of a null tree, which is the same as the deviance of the null glm -- because it just looks at the distribution of the response variable:```{r}deviance(cart.0<-tree(GENITIVE ~1, data=d))deviance(glm(GENITIVE ~1, family=binomial, data=d))```# Fitting trees & interpretation## A regular treeHow about we fit a 'regular' classification tree?```{r}summary(cart.1<-tree( # summarize an object called cart.1 GENITIVE ~# a classification tree of GENITIVE POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG +# numeric predictors POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF, # categorical predictorsdata=d)) # contains deviance as well```What are the (harder to interpret) exact results?```{r}cart.1cart.1$frame```But we want to plot this so we can try and interpret it:```{r}#| fig-width: 8#| fig-height: 8plot(cart.1); grid() # plot the classification treeaxis(2); mtext("Deviance", 2, 3) # add a useful y-axistext(cart.1, pretty=4, all=TRUE) # add abbrev. labels to it```How well does the tree do in terms of its predictions?```{r}d$PREDS.PP.2<-# make d$PREDS.PP.2 the result ofpredict( # predicting cart.1# from cart.1 )[,"s"] # predicted probabilities of "s"d$PREDS.CAT <-predict( # make d$PREDS.CAT the result of predicting cart.1, # from cart.1type="class") # categorical predictions(c.m <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PREDS.CAT)) # predicted genitives in the columnsc( # evaluate the confusion matrix"Prec. for s"=c.m[ "s", "s"] /sum(c.m[ , "s"]),"Acc./rec. for s"=c.m[ "s", "s"] /sum(c.m[ "s", ]),"Prec. for of"=c.m["of","of"] /sum(c.m[ ,"of"]),"Acc./rec. for of"=c.m["of","of"] /sum(c.m["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PREDS.CAT))```Note that that means you *could* compute *R*^2^-values even for trees (even if I don't ever see that being done; I don't know why (not)):```{r}(deviance(cart.0)-deviance(cart.1))/deviance(cart.0) # McFadden's R2```Of course we can also compute Cohen's *κ* and *C*:```{r}c("Cohen's kappa"=cohens.kappa(c.m)[[1]],"C-score"=C.score(d$GENITIVE, d$PREDS.PP.2))```We can also add the column for the logloss/deviance calculations:```{r}d$PREDS.PP.obs <-ifelse( # d$PREDS.PP.obs is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PREDS.PP.2, # take its predicted probability1-d$PREDS.PP.2) # otherwise take 1 minus its predicted probabilityhead(d)```Can we still get to the deviance of the tree from the contributions to logloss?```{r}sum(-log(d$PREDS.PP.obs)) *2# yes, that is deviance(cart.1)mean(-log(d$PREDS.PP.obs)) # thus, this is the logloss for cart.1```## Is pruning necessary?Should we, or do we need to, prune this tree?```{r}# because cross-validation involves randomization, we ...set.seed(1) # ... set the random number seedpruning <-# make pruningcv.tree( # the result of cross-validating cart.1, # the tree in cart.1FUN=prune.misclass) # based on the number of misclassificationsplot(pruning$size, # plot the pruned tree sizes pruning$dev, # against the deviances those tree sizes come w/type="b"); grid() # using points & lines; add a grid```So we go with 5:```{r}#| fig-height: 7cart.1.pruned <-prune.tree( # make cart.1.pruned the pruning of cart.1, # a version of cart.1 pruned down tobest=5) # only 5 terminal nodesplot(cart.1.pruned) # plot the classification treeaxis(2); mtext("Deviance", 2, 3); grid() # add a useful y-axis + gridtext(cart.1.pruned, pretty=4, all=TRUE) # add all full labels to it```So, what the plot and the output from `cart.1.pruned` and `cart.1.pruned$frame` say is the following:* predict *of*-genitives most of the time, especially + when the possessor is inanimate (which is correct more than 98% of the time!); this makes sense because animate/human possessors usually take *s*-genitives; + when the possessor is >13 characters; this makes sense because the *of*-genitive allows such long possessors to go to the end (satisfying short-before-long);* predict *s*-genitives when the length of the possessor is ≤13 characters and + with possessors that are animate, or + with possessors that are - collective, locative, or temporal and - at least two words shorter than the possessum.```{r}#| echo: false#| eval: falsenode.03<-which(d$POR_ANIMACY!="inanimate")node.06<-which(d$POR_LENGTH <13.5)node.13<-which(d$POR_ANIMACY=="animate")node.24<-which(d$LEN_PORmPUM_LOG <-1.11548)# 1st condition for ss.1<- node.03%>%intersect(node.06) %>%intersect(node.13)# s.1 <- Reduce(intersect, list(node.03, node.06, node.13))# 2nd condition for ss.2<- node.03%>%intersect(node.06) %>%intersect(node.13) %>%intersect(node.24)# s.1 <- Reduce(intersect, list(node.03, node.06, node.13, node.24))# check/interpret s.1:nrow(d[s.1,])# check/interpret s.2:qwe <- d[s.2,](qwe$POR_LENGTH-qwe$PUM_LENGTH) %>%sort(decreasing=TRUE)```How much of a hit do we take when it comes to 'prediction' accuracy?```{r}d$PREDS.PP.2.pruned <-# make d$PREDS.PP.2.pruned the result ofpredict( # predicting cart.1.pruned # from cart.1.pruned )[,"s"] # predicted probabilities of "s"d$PREDS.CAT.pruned <-predict( # make d$PREDS.CAT.pruned the result of predicting cart.1.pruned, # from cart.1.prunedtype="class") # categorical predictions(c.m.pr <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PREDS.CAT.pruned)) # predicted genitives in the columnsc( # evaluate the confusion matrix"Prec. for s"=c.m.pr[ "s", "s"] /sum(c.m.pr[ , "s"]),"Acc./rec. for s"=c.m.pr[ "s", "s"] /sum(c.m.pr[ "s", ]),"Prec. for of"=c.m.pr["of","of"] /sum(c.m.pr[ ,"of"]),"Acc./rec. for of"=c.m.pr["of","of"] /sum(c.m.pr["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PREDS.CAT.pruned))(deviance(cart.0)-deviance(cart.1.pruned))/deviance(cart.0) # McFadden's R2```What effect does that have on logloss and the deviance?```{r}d$PREDS.PP.obs.pruned <-ifelse( # d$PREDS.PP.obs.pruned is determined by ifelse d$GENITIVE=="s", # if the obs genitive is the 2nd level of the response d$PREDS.PP.2.pruned, # take its predicted probability1-d$PREDS.PP.2.pruned) # otherwise take 1 minus its predicted probabilityhead(d)sum(-log(d$PREDS.PP.obs.pruned)) *2# yes, that is deviance(cart.1.pruned)mean(-log(d$PREDS.PP.obs.pruned)) # thus, this is the logloss for cart.1.pruned```So the pruning made* the deviance/logloss go up by 5.26%;* the R-squared value go down by 0.0823%,but in terms of precision and accuracy/recall we really only take a notable hit in the accuracy/recall for *s*-genitives (from 72.7% down to 66.3%); you need to determine whether your application can live with that decrease in accuracy/deviance and the potential degree to which the pruning lowers the risk of overfitting.# Session info```{r}sessionInfo()```