Most methods in this course have involved supervised learning methods, i.e. methods where an outcome, a response variable is ‘present’/known and can be used to train some kind of model to predict the outcome in (ideally unseen) data; often such methods are used in hypothesis-testing contexts. The topic of this session is unsupervised learning, i.e. methods where we are trying to discover structure in data, an approach often used in hypothesis-generating contexts. The general approach we will be concerned with is clustering, i.e. approaches that try to find groups – clusters – in data such that the groups have a high degree of within-group/cluster similarity and a low degree of between group/cluster similarity.1
The application of hierarchical cluster analysis we’re dealing with in this session is actually somewhat atypical because it is one where that method is used in what is really more of a hypothesis-testing context. In a Construction Grammar study, Bencini & Goldberg (2000) tested whether subjects sorted sentences into groups on the basis of the verbs in these sentences (H0) or on the basis of the syntactic form of these sentences (H1). In a replication, Gries & Wulff (2005) gave 16 cards with each one sentence on it to 20 subjects. The 16 sentences crossed four verbs (V1, V2, V3, V4) and four constructions (C1, C2, C3, C4), and the subjects were asked to sort the 16 sentences into four stacks of four cards each. For one kind of subsequent evaluation, they counted how often each sentence was grouped together into one stack with each other sentence: the more similar two sentences are to each other, the more often they would be grouped together into a stack. As a result, they obtained a symmetric 16×16 matrix of sentences, which is available in _input/sorting.csv. Load the data and make sure you get the row names because it’s the rows that will be clustered.
d <-read.delim( # make d the result of loadingfile="_input/sorting.csv", # this filestringsAsFactors=TRUE, # change categorical variables into factorsrow.names=1) # make the first column the row namessource("_helpers/entropy.r")
We want to see whether the sortings by the subjects are more compatible with a verb- or a construction-based sorting style. Let’s take a quick look at the symmetric matrix just to see that the value ranges make sense:
Let’s create a similarity matrix using a distance-based approach (because all row and column values are in the interval [0, 20]):
dist_matrix <-Dist( # make dist_matrix the result of Dist d, # here applied to the whole data framemethod="euclidean", # Euclidean distance, and ...upper=TRUE) # make it symmetric (just for human consumption)dist_matrix %>% as.matrix %>%"["(1:6,1:6) %>%round(3)
Figure 1: A hierarchical agglomerative cluster analysis
The result could hardly be more obvious, but how could one show this really quickly? By tabulating verbs and constructions in each cluster:
lapply( # apply to every element of whats_where1, # whats_where1 \(af) c( # an anonymous function creating a vector of thenames(af) %>%substr(1, 2) %>% table, # table of verbs per cluster, & thenames(af) %>%substr(3, 4) %>% table)) # table of constructions per cluster
# show for each cluster what's in it:tapply( # apply tonames(whats_where2), # the elements clustered whats_where2, # a grouping by cluster c) # and list them
… and compute the sum of all entropies for each cluster to be stored in a vector called entropy_of_eval_re_constr. Here, since the clustering is perfectly aligned with the constructions, entropy_of_eval_re_constr is 0:
entropy_of_eval_re_constr <-apply( eval_re_constr,1, entropy) %>% sum
We’ll get back to this.
1.3 Average silhouette widths
One interesting way to determine good numbers of clusters is the notion of average silhouette width, a validation metric of sorts that can help to interpret a cluster solution and that is similar in spirit to the logic underlying in ANOVA in how it is based on the relation of within-cluster similarity to between-cluster similarity. A cluster solution is good if its clusters have
high degrees of within-cluster similarity and
low degrees of between-cluster similarity (i.e. of similarity to elements in other clusters).
The function cluster::silhouette allows us to compute silhouette widths for a solution with a specific number of clusters, which we can then average to get – with some information loss! – a single score for a specific cluster solution, an average silhouette width:
aver_silh_widths <-setNames(2:15, 2:15) # other numbers make no sense w/ 16 elementsfor (i in aver_silh_widths) { # for each possible number of clusters aver_silh_widths[i-1] <-# save into the vector aver_silh_widths the result ofsilhouette( # computing silhouette widths forx=cutree(clustering, i), # a solution w/ i clustersdist=dist_matrix) %>%# with this distance matrix"["(,"sil_width") %>%# to, then, take the silhouette widths mean # and compute their mean}
What have we just created and what does it suggest? Check out Figure 2:
op <-par(mar=c(4, 4, 1, 1)) # customize the plotting marginsplot(type="b", pch=16,xlab="Possible numbers of clusters" , ylab="Average silhouette width",x=as.numeric(names(aver_silh_widths)), y=aver_silh_widths,ylim=c(0, max(aver_silh_widths)))grid()par(op)
Figure 2: Average silhouette width for our distance matrix
Pretty good evidence we want to stick with 4 clusters …
1.4 Excursus 1: all silhouette widths
At some point, you might want to look into the package fpc, which has the above statistics but also some more that can be interesting. In particular, since averaging silhouette widths loses some information, it can be interesting to compute and inspect all of them. Here’s how we could do this with fpc. (Obviously, we could already do this above with cluster::silhouette, but fpc::cluster.stats is a bit more comprehensive and does weighted averaging for us already.)
First, we again set up a collector structure, but this time it needs to be a list because we will collect differently many silhouette widths for each clustering:
Then, we loop over all possible numbers of clusters like before, but this time we use fpc::cluster.stats to get all individual silhouette widths, not just the overall averages:
for (i in2:15) { # for each possible number of clusters all_silh_widths[[i-1]] <-# save into the list all_silh_widths the result ofcluster.stats( # applying this function toclustering=cutree(clustering, i), # a solution w/ clustersd=dist_matrix)$# and this distance matrix clus.avg.silwidths # to, then, take ALL silhouette widths, not just ave}lapply(all_silh_widths, round, 5) # what we created
Now, the average silhouette widths we computed above are the means of these but weighted (hence the weighted.mean) by cluster sizes (hence the w=table(...)):
aver_silh_widths_from_all <-rep(NA, 14); for (i in2:15) { aver_silh_widths_from_all[i-1] <- all_silh_widths[[i-1]] %>%weighted.mean(w=table(cutree(clustering, i)))}all.equal( aver_silh_widths, aver_silh_widths_from_all,check.attributes=FALSE)
[1] TRUE
1.5 Excursus 2: more/other cluster diagnostics
The function fpc::cluster.stats also provides a bunch of other measures one could use. Here’s a quick demo that heavily borrows from above but collects more values; I encourage you to look at this in more detail later. We first set up a matrix to collect various results (including, again, average silhouette widths):
And then we loop exactly like above, just collecting more stuff:
for (i in2:15) { curr <-# save into curr the result ofcluster.stats( # applying this function toclustering=cutree(clustering, i), # a solution w/ clustersd=dist_matrix) # and this distance matrix more_diagnostics[i-1,] <-# save into a row of the matrixas.numeric(curr[c(21, 24:26)]) # these parts of curr}more_diagnostics %>%round(2)
Again, the result is fairly unambiguous: Every column says, 4 clusters it is. (And in fact even the structure of the 4 clusters makes sense, which I am not discussing here.)
One can actually compute p-values for such cluster analyses (even though that doesn’t seem to be widespread). We can use the function pvclust::pvclust for this as follows and plot the resulting object. In that plot, we want to focus on the red values which are 100times;(1-p*), i.e. values above 95 are ‘significant’:
hca.pear.pvc <-pvclust( # compute p-value fordata=d, # these datamethod.dist="euclidean", # w/ this similarity/distance metricmethod.hclust="ward.D2", # w/ this amalgamation rulenboot=2000, # w/ 2000 bootstrap replicationsparallel=TRUE, # w/ multiple threads &iseed=1) # this random number seed
Creating a temporary cluster...done:
socket cluster with 23 nodes on host 'localhost'
Multiscale bootstrap... Done.
This approach (and its additional highlighting with pvrect), too, supports 4 clusters as the smallest number of clusters, and of course we also again see (from pvpick) that the clusters are perfectly construction-based.
1.6 Excursus 3: a simulation for this specific result
It might be interesting to see what the chances are to come up with the perfectly construction-based sorting that we got. We could
generate a random distribution of each of the 16 stimuli to 1 of 4 cluster of 4 stimuli each;
tabulate them in such a way that we see each such cluster’s constructional ‘purity’;
quantify that ‘purity’ on the basis of entropy (as before);
do this a large number of times (e.g., 20,000) times to get a null hypothesis distribution …
which we can compare to our actually observed value:
entropy_collector <-rep(NA, 20000)set.seed(1); for (i inseq(entropy_collector)) { curr_clustering <-table(rep(1:4, each=4),rownames(d) %>% sample %>%substr(3, 4))# and then we compute & collect the entropy just as for the real solution: entropy_collector[i] <-apply( curr_clustering, 1, entropy) %>% sum}
What are the results? As Figure 4 shows, they are pretty clear …
op <-par(mar=c(4, 4, 1, 1)) # customize the plotting marginshist(entropy_collector, main="", xlim=c(0, 8)) # the simulated entropy sumsabline(v=entropy_of_eval_re_constr, lty=2, col="blue") # the observed entropy sumtext(entropy_of_eval_re_constr, 4000, "obs", col="blue", pos=4) # its labelc("p"=sum(entropy_collector==0)) # the p-value resulting from thispar(op) # restore defaults
p
0
Figure 4: Histogram of summed entropies of 20,000 simulated clusterings
2 Session info
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
Random number generation:
RNG: L'Ecuyer-CMRG
Normal: Inversion
Sample: Rejection
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] pvclust_2.2-0 fpc_2.2-13 cluster_2.1.8.1 amap_0.8-20
[5] STGmisc_1.03 Rcpp_1.1.0 magrittr_2.0.4
loaded via a namespace (and not attached):
[1] cli_3.6.5 knitr_1.50 rlang_1.1.6 xfun_0.54
[5] DEoptimR_1.1-4 mclust_6.1.2 diptest_0.77-2 jsonlite_2.0.0
[9] prabclus_2.3-4 htmltools_0.5.8.1 nnet_7.3-20 stats4_4.5.2
[13] rmarkdown_2.30 modeltools_0.2-24 grid_4.5.2 evaluate_1.0.5
[17] MASS_7.3-65 fastmap_1.2.0 yaml_2.3.10 robustbase_0.99-6
[21] htmlwidgets_1.6.4 rstudioapi_0.17.1 kernlab_0.9-33 flexmix_2.3-20
[25] lattice_0.22-7 digest_0.6.39 class_7.3-23 parallel_4.5.2
[29] tools_4.5.2
Footnotes
This should sound very familiar especially to people who first learned about linear modeling from an ANOVA perspective, which involves comparing within- and between-group variances.↩︎
Source Code
---title: "Ling 204: session 10"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2026-03-11 12:34:56 PDT"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 5 fig-height: 5 fig-format: png fig-dpi: 300 fig-align: center embed-resources: true link-external-newwindow: trueexecute: cache: false echo: true eval: true warning: false---# Session 10: Hierarchical cluster analysis {#sec-s10}Most methods in this course have involved **supervised learning** methods, i.e. methods where an outcome, a response variable is 'present'/known and can be used to train some kind of model to predict the outcome in (ideally unseen) data; often such methods are used in hypothesis-testing contexts. The topic of this session is **unsupervised learning**, i.e. methods where we are trying to discover structure in data, an approach often used in hypothesis-generating contexts. The general approach we will be concerned with is clustering, i.e. approaches that try to find groups -- clusters -- in data such that the groups have a high degree of within-group/cluster similarity and a low degree of between group/cluster similarity.^[This should sound very familiar especially to people who first learned about linear modeling from an ANOVA perspective, which involves comparing within- and between-group variances.]The application of hierarchical cluster analysis we're dealing with in this session is actually somewhat atypical because it is one where that method is used in what is really more of a hypothesis-testing context. In a Construction Grammar study, [Bencini & Goldberg (2000)](https://doi.org/10.1006/jmla.2000.2757) tested whether subjects sorted sentences into groups on the basis of the verbs in these sentences (H~0~) or on the basis of the syntactic form of these sentences (H~1~). In a replication, [Gries & Wulff (2005)](https://www.stgries.info/research/2005_STG-SW_ForgnLgCxs_ARCL.pdf) gave 16 cards with each one sentence on it to 20 subjects. The 16 sentences crossed four verbs (V1, V2, V3, V4) and four constructions (C1, C2, C3, C4), and the subjects were asked to sort the 16 sentences into four stacks of four cards each. For one kind of subsequent evaluation, they counted how often each sentence was grouped together into one stack with each other sentence: the more similar two sentences are to each other, the more often they would be grouped together into a stack. As a result, they obtained a symmetric 16×16 matrix of sentences, which is available in [_input/sorting.csv]([_input/sorting.csv). Load the data and make sure you get the row names because it's the rows that will be clustered.```{r}rm(list=ls(all.names=TRUE))pkgs2load <-c("amap", "cluster", "fpc", "magrittr", "pvclust")suppressMessages(sapply(pkgs2load, library,character.only=TRUE, logical.return=TRUE, quietly=TRUE)); rm(pkgs2load)d <-read.delim( # make d the result of loadingfile="_input/sorting.csv", # this filestringsAsFactors=TRUE, # change categorical variables into factorsrow.names=1) # make the first column the row namessource("_helpers/entropy.r")```We want to see whether the sortings by the subjects are more compatible with a verb- or a construction-based sorting style. Let's take a quick look at the symmetric matrix just to see that the value ranges make sense:```{r}d[1:6,1:6]apply( # apply toMARGIN=2, # every column ofX=d, # the data frame dFUN=range) # the function range```## The similarity/distance matrixLet's create a similarity matrix using a distance-based approach (because all row and column values are in the interval [0, 20]):```{r}dist_matrix <-Dist( # make dist_matrix the result of Dist d, # here applied to the whole data framemethod="euclidean", # Euclidean distance, and ...upper=TRUE) # make it symmetric (just for human consumption)dist_matrix %>% as.matrix %>%"["(1:6,1:6) %>%round(3)```## A cluster analysis using `base::hclust`Let's compute the hierarchical cluster analyses:```{r}clustering <-hclust( dist_matrix,method="ward.D2") # other good option: "complete"```What does it look like?```{r}#| label: fig-clustering#| fig-cap: A hierarchical agglomerative cluster analysis#| results: holdop <-par(mar=c(1, 4, 1, 1)) # customize the plotting marginsplot(main="", sub="", xlab="", clustering)(whats_where1 <-rect.hclust( clustering,k=4,border=rainbow(4)))par(op) # restore defaults```The result could hardly be more obvious, but how could one show this really quickly? By tabulating verbs and constructions in each cluster:```{r}lapply( # apply to every element of whats_where1, # whats_where1 \(af) c( # an anonymous function creating a vector of thenames(af) %>%substr(1, 2) %>% table, # table of verbs per cluster, & thenames(af) %>%substr(3, 4) %>% table)) # table of constructions per cluster```Clearly, the elements of each cluster always contain each verb but just one construction.A different way to get the same information uses `cutree`:```{r}# cut the tree into k=4 clusters(whats_where2 <-cutree(clustering, 4))# show for each cluster what's in it:tapply( # apply tonames(whats_where2), # the elements clustered whats_where2, # a grouping by cluster c) # and list them# create tables similar to the above ones:table(whats_where2, substr(names(whats_where2), 1, 2)) # the verbstable(whats_where2, substr(names(whats_where2), 3, 4)) # the constructions```For something we consider later, I am storing this last one as a data structure `eval_re_constr` ...```{r}eval_re_constr <-table(whats_where2, substr(names(whats_where2), 3, 4))```... and compute the sum of all entropies for each cluster to be stored in a vector called `entropy_of_eval_re_constr`. Here, since the clustering is perfectly aligned with the constructions, `entropy_of_eval_re_constr` is 0:```{r}entropy_of_eval_re_constr <-apply( eval_re_constr,1, entropy) %>% sum```We'll get back to this.## Average silhouette widthsOne interesting way to determine good numbers of clusters is the notion of **average silhouette width**, a validation metric of sorts that can help to interpret a cluster solution and that is similar in spirit to the logic underlying in ANOVA in how it is based on the relation of within-cluster similarity to between-cluster similarity. A cluster solution is good if its clusters have* high degrees of within-cluster similarity and* low degrees of between-cluster similarity (i.e. of similarity to elements in other clusters).The function `cluster::silhouette` allows us to compute silhouette widths for a solution with a specific number of clusters, which we can then average to get -- with some information loss! -- a single score for a specific cluster solution, an average silhouette width:```{r}aver_silh_widths <-setNames(2:15, 2:15) # other numbers make no sense w/ 16 elementsfor (i in aver_silh_widths) { # for each possible number of clusters aver_silh_widths[i-1] <-# save into the vector aver_silh_widths the result ofsilhouette( # computing silhouette widths forx=cutree(clustering, i), # a solution w/ i clustersdist=dist_matrix) %>%# with this distance matrix"["(,"sil_width") %>%# to, then, take the silhouette widths mean # and compute their mean}```What have we just created and what does it suggest? Check out @fig-avesilhwidths:```{r}#| label: fig-avesilhwidths#| fig-cap: Average silhouette width for our distance matrix#| results: holdop <-par(mar=c(4, 4, 1, 1)) # customize the plotting marginsplot(type="b", pch=16,xlab="Possible numbers of clusters" , ylab="Average silhouette width",x=as.numeric(names(aver_silh_widths)), y=aver_silh_widths,ylim=c(0, max(aver_silh_widths)))grid()par(op)```Pretty good evidence we want to stick with 4 clusters ...## Excursus 1: *all* silhouette widthsAt some point, you might want to look into the package `fpc`, which has the above statistics but also some more that can be interesting. In particular, since averaging silhouette widths loses some information, it can be interesting to compute and inspect all of them. Here's how we could do this with `fpc`. (Obviously, we could already do this above with `cluster::silhouette`, but `fpc::cluster.stats` is a bit more comprehensive and does weighted averaging for us already.)First, we again set up a collector structure, but this time it needs to be a list because we will collect differently many silhouette widths for each clustering:```{r}all_silh_widths <-setNames(vector(mode="list", length=14),2:15)```Then, we loop over all possible numbers of clusters like before, but this time we use `fpc::cluster.stats` to get all individual silhouette widths, not just the overall averages:```{r}for (i in2:15) { # for each possible number of clusters all_silh_widths[[i-1]] <-# save into the list all_silh_widths the result ofcluster.stats( # applying this function toclustering=cutree(clustering, i), # a solution w/ clustersd=dist_matrix)$# and this distance matrix clus.avg.silwidths # to, then, take ALL silhouette widths, not just ave}lapply(all_silh_widths, round, 5) # what we created```Now, the average silhouette widths we computed above are the means of these but *weighted* (hence the `weighted.mean`) by *cluster sizes* (hence the `w=table(...)`):```{r}aver_silh_widths_from_all <-rep(NA, 14); for (i in2:15) { aver_silh_widths_from_all[i-1] <- all_silh_widths[[i-1]] %>%weighted.mean(w=table(cutree(clustering, i)))}all.equal( aver_silh_widths, aver_silh_widths_from_all,check.attributes=FALSE)``````{r}#| eval: false#| echo: false# note: these two are the same:tapply(silhouette(x=cutree(clustering, 4),dist=dist_matrix) %>%"["(,"sil_width"),cutree(clustering, 4), mean)cluster.stats(clustering=cutree(clustering, 4),d=dist_matrix)$clus.avg.silwidths```## Excursus 2: more/other cluster diagnosticsThe function `fpc::cluster.stats` also provides a bunch of other measures one could use. Here's a quick demo that heavily borrows from above but collects more values; I encourage you to look at this in more detail later. We first set up a matrix to collect various results (including, again, average silhouette widths):```{r}more_diagnostics <-matrix( # grep(NA, 14*4), nrow=14, dimnames=list(CLUSTERS=2:15,MEASURES=c("AveSilhWidth", "Dunn", "Dunn2", "Entropy")))```And then we loop exactly like above, just collecting more stuff:```{r}for (i in2:15) { curr <-# save into curr the result ofcluster.stats( # applying this function toclustering=cutree(clustering, i), # a solution w/ clustersd=dist_matrix) # and this distance matrix more_diagnostics[i-1,] <-# save into a row of the matrixas.numeric(curr[c(21, 24:26)]) # these parts of curr}more_diagnostics %>%round(2)```Again, the result is fairly unambiguous: Every column says, 4 clusters it is. (And in fact even the structure of the 4 clusters makes sense, which I am not discussing here.)One can actually compute *p*-values for such cluster analyses (even though that doesn't seem to be widespread). We can use the function `pvclust::pvclust` for this as follows and plot the resulting object. In that plot, we want to focus on the red values which are 100*times;(1-*p*), i.e. values above 95 are 'significant':```{r}#| label: fig-pval_sortings#| fig-cap: P-values for the verb-cx sorting datahca.pear.pvc <-pvclust( # compute p-value fordata=d, # these datamethod.dist="euclidean", # w/ this similarity/distance metricmethod.hclust="ward.D2", # w/ this amalgamation rulenboot=2000, # w/ 2000 bootstrap replicationsparallel=TRUE, # w/ multiple threads &iseed=1) # this random number seedplot(hca.pear.pvc); pvrect(hca.pear.pvc); pvpick(hca.pear.pvc)```This approach (and its additional highlighting with `pvrect`), too, supports 4 clusters as the smallest number of clusters, and of course we also again see (from `pvpick`) that the clusters are perfectly construction-based.## Excursus 3: a simulation for this specific resultIt might be interesting to see what the chances are to come up with the perfectly construction-based sorting that we got. We could* generate a random distribution of each of the 16 stimuli to 1 of 4 cluster of 4 stimuli each;* tabulate them in such a way that we see each such cluster's constructional 'purity';* quantify that 'purity' on the basis of entropy (as before);* do this a large number of times (e.g., 20,000) times to get a null hypothesis distribution ...* which we can compare to our actually observed value:```{r}entropy_collector <-rep(NA, 20000)set.seed(1); for (i inseq(entropy_collector)) { curr_clustering <-table(rep(1:4, each=4),rownames(d) %>% sample %>%substr(3, 4))# and then we compute & collect the entropy just as for the real solution: entropy_collector[i] <-apply( curr_clustering, 1, entropy) %>% sum}```What are the results? As @fig-simulatedentropies shows, they are pretty clear ...```{r}#| label: fig-simulatedentropies#| fig-cap: Histogram of summed entropies of 20,000 simulated clusterings#| results: holdop <-par(mar=c(4, 4, 1, 1)) # customize the plotting marginshist(entropy_collector, main="", xlim=c(0, 8)) # the simulated entropy sumsabline(v=entropy_of_eval_re_constr, lty=2, col="blue") # the observed entropy sumtext(entropy_of_eval_re_constr, 4000, "obs", col="blue", pos=4) # its labelc("p"=sum(entropy_collector==0)) # the p-value resulting from thispar(op) # restore defaults```# Session info```{r}sessionInfo()```