Ling 204: session 10

Author
Affiliations

UC Santa Barbara

JLU Giessen

Published

11 Mar 2026 12-34-56

1 Session 10: Hierarchical cluster analysis

Most methods in this course have involved supervised learning methods, i.e. methods where an outcome, a response variable is ‘present’/known and can be used to train some kind of model to predict the outcome in (ideally unseen) data; often such methods are used in hypothesis-testing contexts. The topic of this session is unsupervised learning, i.e. methods where we are trying to discover structure in data, an approach often used in hypothesis-generating contexts. The general approach we will be concerned with is clustering, i.e. approaches that try to find groups – clusters – in data such that the groups have a high degree of within-group/cluster similarity and a low degree of between group/cluster similarity.1

The application of hierarchical cluster analysis we’re dealing with in this session is actually somewhat atypical because it is one where that method is used in what is really more of a hypothesis-testing context. In a Construction Grammar study, Bencini & Goldberg (2000) tested whether subjects sorted sentences into groups on the basis of the verbs in these sentences (H0) or on the basis of the syntactic form of these sentences (H1). In a replication, Gries & Wulff (2005) gave 16 cards with each one sentence on it to 20 subjects. The 16 sentences crossed four verbs (V1, V2, V3, V4) and four constructions (C1, C2, C3, C4), and the subjects were asked to sort the 16 sentences into four stacks of four cards each. For one kind of subsequent evaluation, they counted how often each sentence was grouped together into one stack with each other sentence: the more similar two sentences are to each other, the more often they would be grouped together into a stack. As a result, they obtained a symmetric 16×16 matrix of sentences, which is available in _input/sorting.csv. Load the data and make sure you get the row names because it’s the rows that will be clustered.

rm(list=ls(all.names=TRUE))
pkgs2load <- c("amap", "cluster", "fpc", "magrittr", "pvclust")
suppressMessages(sapply(pkgs2load, library,
   character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)
    amap  cluster      fpc magrittr  pvclust
    TRUE     TRUE     TRUE     TRUE     TRUE 
d <- read.delim(           # make d the result of loading
   file="_input/sorting.csv", # this file
   stringsAsFactors=TRUE,  # change categorical variables into factors
   row.names=1)            # make the first column the row names
source("_helpers/entropy.r")

We want to see whether the sortings by the subjects are more compatible with a verb- or a construction-based sorting style. Let’s take a quick look at the symmetric matrix just to see that the value ranges make sense:

d[1:6,1:6]
     V1C1 V1C2 V1C3 V1C4 V2C1 V2C2
V1C1   20    5    5    4   14    1
V1C2    5   20    6    3    3   13
V1C3    5    6   20    5    1    0
V1C4    4    3    5   20    2    0
V2C1   14    3    1    2   20    3
V2C2    1   13    0    0    3   20
apply(        # apply to
   MARGIN=2,  # every column of
   X=d,       # the data frame d
   FUN=range) # the function range
     V1C1 V1C2 V1C3 V1C4 V2C1 V2C2 V2C3 V2C4 V3C1 V3C2 V3C3 V3C4 V4C1 V4C2 V4C3 V4C4
[1,]    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
[2,]   20   20   20   20   20   20   20   20   20   20   20   20   20   20   20   20

1.1 The similarity/distance matrix

Let’s create a similarity matrix using a distance-based approach (because all row and column values are in the interval [0, 20]):

dist_matrix <- Dist(   # make dist_matrix the result of Dist
   d,                  # here applied to the whole data frame
   method="euclidean", # Euclidean distance, and ...
   upper=TRUE)         # make it symmetric (just for human consumption)
dist_matrix %>% as.matrix %>% "["(1:6,1:6) %>% round(3)
       V1C1   V1C2   V1C3   V1C4   V2C1   V2C2
V1C1  0.000 35.749 35.299 35.071 11.314 40.423
V1C2 35.749  0.000 34.612 36.222 36.878 14.283
V1C3 35.299 34.612  0.000 30.822 36.824 39.950
V1C4 35.071 36.222 30.822  0.000 36.524 38.884
V2C1 11.314 36.878 36.824 36.524  0.000 38.884
V2C2 40.423 14.283 39.950 38.884 38.884  0.000

1.2 A cluster analysis using base::hclust

Let’s compute the hierarchical cluster analyses:

clustering <- hclust(
   dist_matrix,
   method="ward.D2") # other good option: "complete"

What does it look like?

op <- par(mar=c(1, 4, 1, 1)) # customize the plotting margins
plot(main="", sub="", xlab="",
   clustering)
(whats_where1 <- rect.hclust(
   clustering,
   k=4,
   border=rainbow(4)))
par(op) # restore defaults
[[1]]
V1C2 V2C2 V3C2 V4C2
   2    6   10   14

[[2]]
V1C1 V2C1 V3C1 V4C1
   1    5    9   13

[[3]]
V1C4 V2C4 V3C4 V4C4
   4    8   12   16

[[4]]
V1C3 V2C3 V3C3 V4C3
   3    7   11   15 
Figure 1: A hierarchical agglomerative cluster analysis

The result could hardly be more obvious, but how could one show this really quickly? By tabulating verbs and constructions in each cluster:

lapply( # apply to every element of
   whats_where1,    # whats_where1
   \(af) c(         # an anonymous function creating a vector of the
      names(af) %>% substr(1, 2) %>% table,  # table of verbs per cluster, & the
      names(af) %>% substr(3, 4) %>% table)) # table of constructions per cluster
[[1]]
V1 V2 V3 V4 C2
 1  1  1  1  4

[[2]]
V1 V2 V3 V4 C1
 1  1  1  1  4

[[3]]
V1 V2 V3 V4 C4
 1  1  1  1  4

[[4]]
V1 V2 V3 V4 C3
 1  1  1  1  4 

Clearly, the elements of each cluster always contain each verb but just one construction.

A different way to get the same information uses cutree:

# cut the tree into k=4 clusters
(whats_where2 <- cutree(clustering, 4))
V1C1 V1C2 V1C3 V1C4 V2C1 V2C2 V2C3 V2C4 V3C1 V3C2 V3C3 V3C4 V4C1 V4C2 V4C3 V4C4
   1    2    3    4    1    2    3    4    1    2    3    4    1    2    3    4 
# show for each cluster what's in it:
tapply(                 # apply to
   names(whats_where2), # the elements clustered
   whats_where2,        # a grouping by cluster
   c)                   # and list them
$`1`
[1] "V1C1" "V2C1" "V3C1" "V4C1"

$`2`
[1] "V1C2" "V2C2" "V3C2" "V4C2"

$`3`
[1] "V1C3" "V2C3" "V3C3" "V4C3"

$`4`
[1] "V1C4" "V2C4" "V3C4" "V4C4"
# create tables similar to the above ones:
table(whats_where2, substr(names(whats_where2), 1, 2)) # the verbs

whats_where2 V1 V2 V3 V4
           1  1  1  1  1
           2  1  1  1  1
           3  1  1  1  1
           4  1  1  1  1
table(whats_where2, substr(names(whats_where2), 3, 4)) # the constructions

whats_where2 C1 C2 C3 C4
           1  4  0  0  0
           2  0  4  0  0
           3  0  0  4  0
           4  0  0  0  4

For something we consider later, I am storing this last one as a data structure eval_re_constr

eval_re_constr <- table(whats_where2, substr(names(whats_where2), 3, 4))

… and compute the sum of all entropies for each cluster to be stored in a vector called entropy_of_eval_re_constr. Here, since the clustering is perfectly aligned with the constructions, entropy_of_eval_re_constr is 0:

entropy_of_eval_re_constr <- apply(
   eval_re_constr,
   1,
   entropy) %>% sum

We’ll get back to this.

1.3 Average silhouette widths

One interesting way to determine good numbers of clusters is the notion of average silhouette width, a validation metric of sorts that can help to interpret a cluster solution and that is similar in spirit to the logic underlying in ANOVA in how it is based on the relation of within-cluster similarity to between-cluster similarity. A cluster solution is good if its clusters have

  • high degrees of within-cluster similarity and
  • low degrees of between-cluster similarity (i.e. of similarity to elements in other clusters).

The function cluster::silhouette allows us to compute silhouette widths for a solution with a specific number of clusters, which we can then average to get – with some information loss! – a single score for a specific cluster solution, an average silhouette width:

aver_silh_widths <- setNames(2:15, 2:15) # other numbers make no sense w/ 16 elements
for (i in aver_silh_widths) { # for each possible number of clusters
   aver_silh_widths[i-1] <-   # save into the vector aver_silh_widths the result of
      silhouette(                     # computing silhouette widths for
         x=cutree(clustering, i),     # a solution w/ i clusters
         dist=dist_matrix)        %>% # with this distance matrix
      "["(,"sil_width")           %>% # to, then, take the silhouette widths
      mean                            # and compute their mean
}

What have we just created and what does it suggest? Check out Figure 2:

op <- par(mar=c(4, 4, 1, 1)) # customize the plotting margins
plot(type="b", pch=16,
   xlab="Possible numbers of clusters"  , ylab="Average silhouette width",
   x=as.numeric(names(aver_silh_widths)), y=aver_silh_widths,
   ylim=c(0, max(aver_silh_widths)))
grid()
par(op)
Figure 2: Average silhouette width for our distance matrix

Pretty good evidence we want to stick with 4 clusters …

1.4 Excursus 1: all silhouette widths

At some point, you might want to look into the package fpc, which has the above statistics but also some more that can be interesting. In particular, since averaging silhouette widths loses some information, it can be interesting to compute and inspect all of them. Here’s how we could do this with fpc. (Obviously, we could already do this above with cluster::silhouette, but fpc::cluster.stats is a bit more comprehensive and does weighted averaging for us already.)

First, we again set up a collector structure, but this time it needs to be a list because we will collect differently many silhouette widths for each clustering:

all_silh_widths <- setNames(
   vector(mode="list", length=14),
   2:15)

Then, we loop over all possible numbers of clusters like before, but this time we use fpc::cluster.stats to get all individual silhouette widths, not just the overall averages:

for (i in 2:15) { # for each possible number of clusters
   all_silh_widths[[i-1]] <- # save into the list all_silh_widths the result of
      cluster.stats(         # applying this function to
         clustering=cutree(clustering, i), # a solution w/ clusters
         d=dist_matrix)$     # and this distance matrix
      clus.avg.silwidths     # to, then, take ALL silhouette widths, not just ave
}
lapply(all_silh_widths, round, 5) # what we created
$`2`
      1       2
0.20884 0.61849

$`3`
      1       2       3
0.58053 0.61546 0.28536

$`4`
      1       2       3       4
0.56863 0.61094 0.49716 0.57555

$`5`
      1       2       3       4       5
0.54731 0.59350 0.23234 0.56632 0.00000

$`6`
      1       2       3       4       5       6
0.26268 0.59350 0.23234 0.56632 0.00000 0.00000

$`7`
      1       2       3       4       5       6       7
0.26268 0.20178 0.23234 0.56632 0.00000 0.00000 0.00000

$`8`
      1       2       3       4       5       6       7       8
0.26268 0.20178 0.00000 0.55378 0.21298 0.00000 0.00000 0.00000

$`9`
      1       2       3       4       5       6       7       8       9
0.26268 0.00000 0.00000 0.55378 0.37828 0.21298 0.00000 0.00000 0.00000

$`10`
      1       2       3       4       5       6       7       8       9      10
0.26268 0.00000 0.00000 0.05325 0.37828 0.21298 0.16129 0.00000 0.00000 0.00000

$`11`
      1       2       3       4       5       6       7       8       9      10
0.17580 0.00000 0.00000 0.05325 0.37828 0.21298 0.16129 0.00000 0.00000 0.00000
     11
0.00000

$`12`
      1       2       3       4       5       6       7       8       9      10
0.17580 0.00000 0.00000 0.00000 0.37828 0.21298 0.14230 0.00000 0.00000 0.00000
     11      12
0.00000 0.00000

$`13`
      1       2       3       4       5       6       7       8       9      10
0.17580 0.00000 0.00000 0.00000 0.37828 0.21298 0.00000 0.00000 0.00000 0.00000
     11      12      13
0.00000 0.00000 0.00000

$`14`
      1       2       3       4       5       6       7       8       9      10
0.17580 0.00000 0.00000 0.00000 0.37828 0.00000 0.00000 0.00000 0.00000 0.00000
     11      12      13      14
0.00000 0.00000 0.00000 0.00000

$`15`
      1       2       3       4       5       6       7       8       9      10
0.00000 0.00000 0.00000 0.00000 0.00000 0.37828 0.00000 0.00000 0.00000 0.00000
     11      12      13      14      15
0.00000 0.00000 0.00000 0.00000 0.00000 

Now, the average silhouette widths we computed above are the means of these but weighted (hence the weighted.mean) by cluster sizes (hence the w=table(...)):

aver_silh_widths_from_all <- rep(NA, 14); for (i in 2:15) { 
   aver_silh_widths_from_all[i-1] <-
      all_silh_widths[[i-1]] %>%
      weighted.mean(w=table(cutree(clustering, i)))
}
all.equal(
   aver_silh_widths,
   aver_silh_widths_from_all,
   check.attributes=FALSE)
[1] TRUE

1.5 Excursus 2: more/other cluster diagnostics

The function fpc::cluster.stats also provides a bunch of other measures one could use. Here’s a quick demo that heavily borrows from above but collects more values; I encourage you to look at this in more detail later. We first set up a matrix to collect various results (including, again, average silhouette widths):

more_diagnostics <- matrix( # g
   rep(NA, 14*4), nrow=14, dimnames=list(
      CLUSTERS=2:15,
      MEASURES=c("AveSilhWidth", "Dunn", "Dunn2", "Entropy")))

And then we loop exactly like above, just collecting more stuff:

for (i in 2:15) {
   curr <-              # save into curr the result of
      cluster.stats(    # applying this function to
         clustering=cutree(clustering, i), # a solution w/ clusters
         d=dist_matrix) # and this distance matrix
   more_diagnostics[i-1,] <-         # save into a row of the matrix
      as.numeric(curr[c(21, 24:26)]) # these parts of curr
}
more_diagnostics %>% round(2)
        MEASURES
CLUSTERS AveSilhWidth Dunn Dunn2 Entropy
      2          0.31 0.51  0.81    1.26
      3          0.44 0.78  0.86    1.41
      4          0.56 0.96  1.48    2.00
      5          0.47 0.91  1.01    1.25
      6          0.38 0.86  0.96    1.21
      7          0.27 0.79  0.83    1.10
      8          0.25 0.74  0.90    1.11
      9          0.26 0.68  0.90    1.04
      10         0.15 0.56  0.87    1.06
      11         0.12 0.48  0.86    0.99
      12         0.11 0.44  0.97    1.12
      13         0.10 0.39  0.98    1.01
      14         0.07 0.33  1.08    1.10
      15         0.05 0.24  1.26    1.26

Again, the result is fairly unambiguous: Every column says, 4 clusters it is. (And in fact even the structure of the 4 clusters makes sense, which I am not discussing here.)

One can actually compute p-values for such cluster analyses (even though that doesn’t seem to be widespread). We can use the function pvclust::pvclust for this as follows and plot the resulting object. In that plot, we want to focus on the red values which are 100times;(1-p*), i.e. values above 95 are ‘significant’:

hca.pear.pvc <- pvclust(    # compute p-value for
   data=d,                  # these data
   method.dist="euclidean", # w/ this similarity/distance metric
   method.hclust="ward.D2", # w/ this amalgamation rule
   nboot=2000,              # w/ 2000 bootstrap replications
   parallel=TRUE,           # w/ multiple threads &
   iseed=1)                 # this random number seed
Creating a temporary cluster...done:
socket cluster with 23 nodes on host 'localhost'
Multiscale bootstrap... Done.
plot(hca.pear.pvc); pvrect(hca.pear.pvc); pvpick(hca.pear.pvc)
$clusters
$clusters[[1]]
[1] "V1C4" "V2C4" "V3C4" "V4C4"

$clusters[[2]]
[1] "V1C2" "V2C2" "V3C2" "V4C2"

$clusters[[3]]
[1] "V1C1" "V2C1" "V3C1" "V4C1"

$clusters[[4]]
[1] "V1C3" "V2C3" "V3C3" "V4C3"


$edges
[1]  7 10 11 12
Figure 3: P-values for the verb-cx sorting data

This approach (and its additional highlighting with pvrect), too, supports 4 clusters as the smallest number of clusters, and of course we also again see (from pvpick) that the clusters are perfectly construction-based.

1.6 Excursus 3: a simulation for this specific result

It might be interesting to see what the chances are to come up with the perfectly construction-based sorting that we got. We could

  • generate a random distribution of each of the 16 stimuli to 1 of 4 cluster of 4 stimuli each;
  • tabulate them in such a way that we see each such cluster’s constructional ‘purity’;
  • quantify that ‘purity’ on the basis of entropy (as before);
  • do this a large number of times (e.g., 20,000) times to get a null hypothesis distribution …
  • which we can compare to our actually observed value:
entropy_collector <- rep(NA, 20000)
set.seed(1); for (i in seq(entropy_collector)) {
   curr_clustering <- table(
      rep(1:4, each=4),
      rownames(d) %>% sample %>% substr(3, 4))
   # and then we compute & collect the entropy just as for the real solution:
   entropy_collector[i] <- apply( curr_clustering, 1, entropy) %>% sum
}

What are the results? As Figure 4 shows, they are pretty clear …

op <- par(mar=c(4, 4, 1, 1)) # customize the plotting margins
hist(entropy_collector, main="", xlim=c(0, 8))         # the simulated entropy sums
abline(v=entropy_of_eval_re_constr, lty=2, col="blue") # the observed entropy sum
text(entropy_of_eval_re_constr, 4000, "obs", col="blue", pos=4) # its label
c("p"=sum(entropy_collector==0)) # the p-value resulting from this
par(op) # restore defaults
p
0 
Figure 4: Histogram of summed entropies of 20,000 simulated clusterings

2 Session info

sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

Random number generation:
 RNG:     L'Ecuyer-CMRG
 Normal:  Inversion
 Sample:  Rejection

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] pvclust_2.2-0   fpc_2.2-13      cluster_2.1.8.1 amap_0.8-20
[5] STGmisc_1.03    Rcpp_1.1.0      magrittr_2.0.4

loaded via a namespace (and not attached):
 [1] cli_3.6.5         knitr_1.50        rlang_1.1.6       xfun_0.54
 [5] DEoptimR_1.1-4    mclust_6.1.2      diptest_0.77-2    jsonlite_2.0.0
 [9] prabclus_2.3-4    htmltools_0.5.8.1 nnet_7.3-20       stats4_4.5.2
[13] rmarkdown_2.30    modeltools_0.2-24 grid_4.5.2        evaluate_1.0.5
[17] MASS_7.3-65       fastmap_1.2.0     yaml_2.3.10       robustbase_0.99-6
[21] htmlwidgets_1.6.4 rstudioapi_0.17.1 kernlab_0.9-33    flexmix_2.3-20
[25] lattice_0.22-7    digest_0.6.39     class_7.3-23      parallel_4.5.2
[29] tools_4.5.2      

Footnotes

  1. This should sound very familiar especially to people who first learned about linear modeling from an ANOVA perspective, which involves comparing within- and between-group variances.↩︎