Ling 202: session 10: trees (key)

Author

Affiliations

UC Santa Barbara

JLU Giessen

Published

04 Jun 2025 12-34-56

1 Intro

We are dealing with the same data set as in the sessions on binary logistic regression (_input/genitives.csv, see _input/genitives.r); specifically, we are asking, does the choice of a genitive construction (of vs. s) vary as a function of

all the predictors that are already part of the data frame, i.e.
- the categorical predictors SPEAKER, MODALITY, POR_ANIMACY, POR_FINAL_SIB, POR_DEF;
- the numeric predictors POR_LENGTH and PUM_LENGTH;
an additional new length-based predictor, namely how the length of the possessor POR_LENGTH compares to the length of the possessum (PUM_LENGTH) (expressed as a difference); since such a comparison variable doesn’t exist yet in our data set, we need to create it first.

rm(list=ls(all.names=TRUE))
library(magrittr); library(tree)
summary(d <- read.delim(   # summarize d, the result of loading
   file="_input/genitives.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

      CASE      GENITIVE  SPEAKER       MODALITY      POR_LENGTH
 Min.   :   2   of:2720   nns:2666   spoken :1685   Min.   :  1.00
 1st Qu.:1006   s : 880   ns : 934   written:1915   1st Qu.:  8.00
 Median :2018                                       Median : 11.00
 Mean   :2012                                       Mean   : 14.58
 3rd Qu.:3017                                       3rd Qu.: 17.00
 Max.   :4040                                       Max.   :204.00
   PUM_LENGTH         POR_ANIMACY   POR_FINAL_SIB        POR_DEF
 Min.   :  2.00   animate   : 920   absent :2721   definite  :2349
 1st Qu.:  6.00   collective: 607   present: 879   indefinite:1251
 Median :  9.00   inanimate :1671
 Mean   : 10.35   locative  : 243
 3rd Qu.: 13.00   temporal  : 159
 Max.   :109.00

d$LEN_PORmPUM_LOG <- log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)

2 Deviance & baseline(s)

Let’s already compute the baselines for what will be the response variable, GENITIVE:

(baselines <- c(
   "baseline 1"=max(          # make baselines[1] the highest
      prop.table(             # proportion in the
         table(d$GENITIVE))), # frequency table of the response
   "baseline 2"=sum(             # make baselines[2] the sum of the
      prop.table(                # proportions in the
         table(d$GENITIVE))^2))) # frequency table of the response squared

baseline 1 baseline 2
 0.7555556  0.6306173

Let’s also compute the deviance of a null tree, which is the same as the deviance of the null glm – because it just looks at the distribution of the response variable:

deviance(cart.0 <- tree(GENITIVE ~ 1, data=d))

[1] 4004.273

deviance(glm(GENITIVE ~ 1, family=binomial, data=d))

[1] 4004.273

3 Fitting trees & interpretation

3.1 A regular tree

How about we fit a ‘regular’ classification tree?

summary(cart.1 <- tree(   # summarize an object called cart.1
   GENITIVE ~             # a classification tree of GENITIVE
   POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG + # numeric predictors
   POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF, # categorical predictors
   data=d)) # contains deviance as well


Classification tree:
tree(formula = GENITIVE ~ POR_LENGTH + PUM_LENGTH + LEN_PORmPUM_LOG +
    POR_ANIMACY + SPEAKER + MODALITY + POR_FINAL_SIB + POR_DEF,
    data = d)
Variables actually used in tree construction:
[1] "POR_ANIMACY"     "POR_LENGTH"      "LEN_PORmPUM_LOG"
Number of terminal nodes:  7
Residual mean deviance:  0.6798 = 2442 / 3593
Misclassification error rate: 0.1492 = 537 / 3600

What are the (harder to interpret) exact results?

cart.1

node), split, n, deviance, yval, (yprob)
      * denotes terminal node

 1) root 3600 4004.0 of ( 0.75556 0.24444 )
   2) POR_ANIMACY: inanimate 1671  324.4 of ( 0.98025 0.01975 ) *
   3) POR_ANIMACY: animate,collective,locative,temporal 1929 2645.0 of ( 0.56091 0.43909 )
     6) POR_LENGTH < 13.5 1380 1907.0 s ( 0.46594 0.53406 )
      12) POR_ANIMACY: collective,locative,temporal 705  944.7 of ( 0.60709 0.39291 )
        24) LEN_PORmPUM_LOG < -1.11548 175  212.9 s ( 0.29714 0.70286 ) *
        25) LEN_PORmPUM_LOG > -1.11548 530  638.8 of ( 0.70943 0.29057 ) *
      13) POR_ANIMACY: animate 675  844.8 s ( 0.31852 0.68148 ) *
     7) POR_LENGTH > 13.5 549  550.0 of ( 0.79964 0.20036 )
      14) POR_ANIMACY: collective,locative,temporal 304  147.5 of ( 0.93421 0.06579 ) *
      15) POR_ANIMACY: animate 245  322.2 of ( 0.63265 0.36735 )
        30) LEN_PORmPUM_LOG < 0.864615 87  112.1 s ( 0.34483 0.65517 ) *
        31) LEN_PORmPUM_LOG > 0.864615 158  161.9 of ( 0.79114 0.20886 ) *

cart.1$frame

               var    n       dev yval splits.cutleft splits.cutright
1      POR_ANIMACY 3600 4004.2730   of             :c           :abde
2           <leaf> 1671  324.3722   of
3       POR_LENGTH 1929 2645.4618   of          <13.5           >13.5
6      POR_ANIMACY 1380 1906.6784    s           :bde              :a
12 LEN_PORmPUM_LOG  705  944.7437   of      <-1.11548       >-1.11548
24          <leaf>  175  212.9484    s
25          <leaf>  530  638.8172   of
13          <leaf>  675  844.7594    s
7      POR_ANIMACY  549  549.9959   of           :bde              :a
14          <leaf>  304  147.5062   of
15 LEN_PORmPUM_LOG  245  322.1890   of      <0.864615       >0.864615
30          <leaf>   87  112.0883    s
31          <leaf>  158  161.9321   of
     yprob.of    yprob.s
1  0.75555556 0.24444444
2  0.98025135 0.01974865
3  0.56091239 0.43908761
6  0.46594203 0.53405797
12 0.60709220 0.39290780
24 0.29714286 0.70285714
25 0.70943396 0.29056604
13 0.31851852 0.68148148
7  0.79963570 0.20036430
14 0.93421053 0.06578947
15 0.63265306 0.36734694
30 0.34482759 0.65517241
31 0.79113924 0.20886076

But we want to plot this so we can try and interpret it:

plot(cart.1); grid()             # plot the classification tree
axis(2); mtext("Deviance", 2, 3) # add a useful y-axis
text(cart.1, pretty=4, all=TRUE) # add abbrev. labels to it

How well does the tree do in terms of its predictions?

d$PREDS.PP.2 <- # make d$PREDS.PP.2 the result of
   predict(     # predicting
      cart.1    # from cart.1
   )[,"s"]      # predicted probabilities of "s"
d$PREDS.CAT <- predict( # make d$PREDS.CAT the result of predicting
   cart.1,              # from cart.1
   type="class")        # categorical predictions
(c.m <- table(           # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,   # observed genitives in the rows
   "PREDS"=d$PREDS.CAT)) # predicted genitives in the columns

    PREDS
OBS    of    s
  of 2423  297
  s   240  640

c( # evaluate the confusion matrix
   "Prec. for s"     =c.m[ "s", "s"] / sum(c.m[    , "s"]),
   "Acc./rec. for s" =c.m[ "s", "s"] / sum(c.m[ "s",    ]),
   "Prec. for of"    =c.m["of","of"] / sum(c.m[    ,"of"]),
   "Acc./rec. for of"=c.m["of","of"] / sum(c.m["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PREDS.CAT))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
       0.6830309        0.7272727        0.9098761        0.8908088
  Acc. (overall)
       0.8508333

Note that that means you could compute R²-values even for trees (even if I don’t ever see that being done; I don’t know why (not)):

(deviance(cart.0)-deviance(cart.1))/deviance(cart.0) # McFadden's R2

[1] 0.3900456

Of course we can also compute Cohen’s κ and C:

c("Cohen's kappa"=cohens.kappa(c.m)[[1]],
  "C-score"      =C.score(d$GENITIVE, d$PREDS.PP.2))

Cohen's kappa       C-score
    0.6048307     0.8858888

We can also add the column for the logloss/deviance calculations:

d$PREDS.PP.obs <- ifelse( # d$PREDS.PP.obs is determined by ifelse
   d$GENITIVE=="s",        # if the obs genitive is the 2nd level of the response
     d$PREDS.PP.2,        # take its predicted probability
   1-d$PREDS.PP.2)        # otherwise take 1 minus its predicted probability
head(d)

  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_FINAL_SIB  POR_DEF LEN_PORmPUM_LOG PREDS.PP.2 PREDS.CAT PREDS.PP.obs
1        absent definite       0.8930848 0.29056604        of    0.7094340
2        absent definite       1.6520767 0.20886076        of    0.7911392
3       present definite       0.4594316 0.68148148         s    0.3185185
4        absent definite       2.7004397 0.06578947        of    0.9342105
5        absent definite       1.0000000 0.68148148         s    0.6814815
6        absent definite       1.2223924 0.68148148         s    0.6814815

Can we still get to the deviance of the tree from the contributions to logloss?

sum(-log(d$PREDS.PP.obs)) * 2 # yes, that is deviance(cart.1)

[1] 2442.424

mean(-log(d$PREDS.PP.obs)) # thus, this is the logloss for cart.1

[1] 0.3392255

3.2 Is pruning necessary?

Should we, or do we need to, prune this tree?

# because cross-validation involves randomization, we ...
set.seed(1)               # ... set the random number seed
pruning <-                # make pruning
   cv.tree(               # the result of cross-validating
      cart.1,             # the tree in cart.1
      FUN=prune.misclass) # based on the number of misclassifications
plot(pruning$size,     # plot the pruned tree sizes
     pruning$dev,      # against the deviances those tree sizes come w/
     type="b"); grid() # using points & lines; add a grid

So we go with 5:

cart.1.pruned <- prune.tree( # make cart.1.pruned the pruning of
   cart.1,                   # a version of cart.1 pruned down to
   best=5)                   # only 5 terminal nodes
plot(cart.1.pruned)                      # plot the classification tree
axis(2); mtext("Deviance", 2, 3); grid() # add a useful y-axis + grid
text(cart.1.pruned, pretty=4, all=TRUE)  # add all full labels to it

So, what the plot and the output from cart.1.pruned and cart.1.pruned$frame say is the following:

predict of-genitives most of the time, especially
- when the possessor is inanimate (which is correct more than 98% of the time!); this makes sense because animate/human possessors usually take s-genitives;
- when the possessor is >13 characters; this makes sense because the of-genitive allows such long possessors to go to the end (satisfying short-before-long);
predict s-genitives when the length of the possessor is ≤13 characters and
- with possessors that are animate, or
- with possessors that are
  - collective, locative, or temporal and
  - at least two words shorter than the possessum.

How much of a hit do we take when it comes to ‘prediction’ accuracy?

d$PREDS.PP.2.pruned <- # make d$PREDS.PP.2.pruned the result of
   predict(            # predicting
      cart.1.pruned    # from cart.1.pruned
   )[,"s"]             # predicted probabilities of "s"
d$PREDS.CAT.pruned <- predict( # make d$PREDS.CAT.pruned the result of predicting
   cart.1.pruned,              # from cart.1.pruned
   type="class")               # categorical predictions
(c.m.pr <- table(               # confusion matrix: cross-tabulate
   "OBS"  =d$GENITIVE,          # observed genitives in the rows
   "PREDS"=d$PREDS.CAT.pruned)) # predicted genitives in the columns

    PREDS
OBS    of    s
  of 2453  267
  s   297  583

c( # evaluate the confusion matrix
   "Prec. for s"     =c.m.pr[ "s", "s"] / sum(c.m.pr[    , "s"]),
   "Acc./rec. for s" =c.m.pr[ "s", "s"] / sum(c.m.pr[ "s",    ]),
   "Prec. for of"    =c.m.pr["of","of"] / sum(c.m.pr[    ,"of"]),
   "Acc./rec. for of"=c.m.pr["of","of"] / sum(c.m.pr["of",    ]),
   "Acc. (overall)"  =mean(d$GENITIVE==d$PREDS.CAT.pruned))

     Prec. for s  Acc./rec. for s     Prec. for of Acc./rec. for of
       0.6858824        0.6625000        0.8920000        0.9018382
  Acc. (overall)
       0.8433333

(deviance(cart.0)-deviance(cart.1.pruned))/deviance(cart.0) # McFadden's R2

[1] 0.3579626

What effect does that have on logloss and the deviance?

d$PREDS.PP.obs.pruned <- ifelse( # d$PREDS.PP.obs.pruned is determined by ifelse
   d$GENITIVE=="s",               # if the obs genitive is the 2nd level of the response
     d$PREDS.PP.2.pruned,        # take its predicted probability
   1-d$PREDS.PP.2.pruned)        # otherwise take 1 minus its predicted probability
head(d)

  CASE GENITIVE SPEAKER MODALITY POR_LENGTH PUM_LENGTH POR_ANIMACY
1    2       of     nns   spoken         13          7  collective
2    3       of     nns   spoken         22          7     animate
3    4       of     nns   spoken         11          8     animate
4    5       of     nns   spoken         26          4  collective
5    6        s     nns   spoken          8          4     animate
6    7        s     nns   spoken          7          3     animate
  POR_FINAL_SIB  POR_DEF LEN_PORmPUM_LOG PREDS.PP.2 PREDS.CAT PREDS.PP.obs
1        absent definite       0.8930848 0.29056604        of    0.7094340
2        absent definite       1.6520767 0.20886076        of    0.7911392
3       present definite       0.4594316 0.68148148         s    0.3185185
4        absent definite       2.7004397 0.06578947        of    0.9342105
5        absent definite       1.0000000 0.68148148         s    0.6814815
6        absent definite       1.2223924 0.68148148         s    0.6814815
  PREDS.PP.2.pruned PREDS.CAT.pruned PREDS.PP.obs.pruned
1         0.2905660               of           0.7094340
2         0.2003643               of           0.7996357
3         0.6814815                s           0.3185185
4         0.2003643               of           0.7996357
5         0.6814815                s           0.6814815
6         0.6814815                s           0.6814815

sum(-log(d$PREDS.PP.obs.pruned)) * 2 # yes, that is deviance(cart.1.pruned)

[1] 2570.893

mean(-log(d$PREDS.PP.obs.pruned)) # thus, this is the logloss for cart.1.pruned

[1] 0.3570685

So the pruning made

the deviance/logloss go up by 5.26%;
the R-squared value go down by 0.0823%,

but in terms of precision and accuracy/recall we really only take a notable hit in the accuracy/recall for s-genitives (from 72.7% down to 66.3%); you need to determine whether your application can live with that decrease in accuracy/deviance and the potential degree to which the pruning lowers the risk of overfitting.

4 Session info

sessionInfo()

R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] tree_1.0-44    STGmisc_1.0    Rcpp_1.0.14    magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] digest_0.6.37     fastmap_1.2.0     xfun_0.51         knitr_1.50
 [5] htmltools_0.5.8.1 rmarkdown_2.29    cli_3.6.4         rstudioapi_0.17.1
 [9] tools_4.4.3       evaluate_1.0.3    yaml_2.3.10       rlang_1.1.5
[13] jsonlite_2.0.0    htmlwidgets_1.6.4 MASS_7.3-65