In session 03, we fit a linear model to answer the question whether the logged reaction time to a word (in ms) varies as a function of
the Zipf frequency of that word (ZIPFFREQ);
the language that word was presented in (LANGUAGE: english vs. spanish);
the speaker group that words was presented to (GROUP: english vs. heritage);
any pairwise interaction of these predictors;
the three-way interaction of these predictors.
Let’s reload those data and apply the same transformation(s) etc. there as we did earlier:
rm(list=ls(all.names=TRUE)); library(cluster)str(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
d <- d[complete.cases(d),]d$LANGUAGE <-factor(d$LANGUAGE, # for didactic reasons only, I amlevels=levels(d$LANGUAGE)[2:1]) # changing the order of the levelsd$RT_log <-log2(d$RT)
Let’s refit our initial model from there …
summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+ ZIPFFREQ*LANGUAGE*GROUP, # RT ~ these predictors & their interactionsdata=d, na.action=na.exclude)) # those vars are in d, skip NAs
… and let’s add the predictions of this model to the data frame d:
d$PRED_LM <-predict(m_01)
Let’s create a distance matrix that compares each case to each other case while only considering the main effects we also put into our regression model:
distances <-daisy( # compute a distance matrixx=d[,c("ZIPFFREQ", "LANGUAGE", "GROUP")], # of thismetric="gower", # use the Gower metrictype=list( # specify the"numeric"="ZIPFFREQ", # variable"symm"=c("LANGUAGE", "GROUP"))) %>%# types, thenas.matrix(ncol=nrow(d)) # make that a symmetric matrixstr(distances) # check the ...
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.2615 0.3899 0.4004 0.6246 1.0000
Let’s make predictions for each case that are based on a weighted mean of the observed reaction times of all those cases whose distance from a current case are among the 10 smallest:
d$PRED_NNB <-rep(NA, nrow(d)) # establish the collector columnfor (i inseq(nrow(d))) { # for each case curr <- distances[i,] # get all distance values curr[i] <-NA# insert NA for this case to itself curr <-exp(-curr) # reverse polarity to get similarity values tenth_highest_sim <- curr %>% unique %>%sort(decreasing=TRUE) %>%"["(10) picky <-which(curr >= tenth_highest_sim) # establish the cut-off point d$PRED_NNB[i] <-# compute & store the predictionweighted.mean( # with a weighted meanx=d$RT[picky], # of all reaction timesw=curr[picky], # weighted by their sim to the current casena.rm=TRUE) # remove the 1 NA point}
Let’s compare this to the predictions of our linear model:
Again, the results are really similar, which is kinda amazing given how low-tech the nearest-neighbor approach actually is …
2 Application to a binary classification task
In session 05, we fit a binary logistic regression model to answer the question whether the choice of a genitive construction (of vs. s) varies as a function of
a length-based short-before-long effect, but, this time, if we hypothesize a short-before-long effect, maybe we should not just be looking at the length of the possessor, but how the length of the possessor compares to the length of the possessum;
the language that word was presented in (POR_ANIMACY: animate vs. collective vs. inanimate vs. locative vs. temporal);
whether the speakers are non-native speakers or native speakers of English (SPEAKER: nns vs. ns);
any pairwise interaction of these predictors;
the three-way interaction of these predictors.
Let’s reload those data and apply the same transformation(s) etc. there as we did earlier:
rm(list=ls(all.names=TRUE))str(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
Let’s recompute the baseline and null deviance from there, …
(baselines <-c("baseline 1"=max( # make baselines[1] the highestprop.table( # proportion in thetable(d$GENITIVE))), # frequency table of the response"baseline 2"=sum( # make baselines[2] the sum of theprop.table( # proportions in thetable(d$GENITIVE))^2))) # frequency table of the response squared
summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+ LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # GENITIVE ~ these predictorsdata=d, family=binomial, na.action=na.exclude)) # vars are in d, resp = binary, skip NAs
Let’s create a distance matrix that compares each case to each other case while only considering the main effects we also put into our regression model:
distances <-daisy( # compute a distance matrixx=d[,c("LEN_PORmPUM_LOG", "POR_ANIMACY", "SPEAKER")], # of thismetric="gower", # use the Gower metrictype=list( # specify"numeric"="LEN_PORmPUM_LOG", # variable"factor"=c("POR_ANIMACY", "SPEAKER"))) %>%# types, thenas.matrix(ncol=nrow(d)) # make that a symmetric matrixstr(distances) # check the ...
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.3408 0.3820 0.4058 0.6717 0.9914
Let’s make predictions for each case that are based on a weighted mean of the observed reaction times of all those cases whose distance from a current case are among the 10 smallest:
d$PRED_NNB_PP_S <-rep(NA, nrow(d)) # establish the collector columnfor (i inseq(nrow(d))) { # for each case curr <- distances[i,] # get all distance values curr[i] <-NA# insert NA for this case to itself curr <-exp(-curr) # reverse polarity to get similarity values tenth_highest_sim <- curr %>% unique %>%sort(decreasing=TRUE) %>%"["(10) picky <-which(curr >= tenth_highest_sim) # establish the cut-off point d$PRED_NNB_PP_S[i] <-# compute & store the predictionweighted.mean( # with a weighted meanx=d$GENITIVE[picky]=="s", # of s-genitivesw=curr[picky], # weighted by their sim to the current casena.rm=TRUE) # remove any missing data}d$PRED_NNB_CAT <-factor(ifelse(d$PRED_NNB_PP_S>=0.5, levels(d$GENITIVE)[2], levels(d$GENITIVE)[1]))
Let’s compare this to the predictions of our generalized linear model:
list("Cohen's kappa"=cohens.kappa(preds_comparer), # 0.875645783"Matthew's cc "=MCC(preds_comparer), # 0.8762758"'Accuracy'"=mean(d$PRED_GLM_CAT==d$PRED_NNB_CAT)) # 0.954444444
$`Cohen's kappa`
Cohen's kappa se lower upper p (2-tailed)
0.875645783 0.009486673 0.857052246 0.894239320 0.000000000
$`Matthew's cc `
MCC/phi
0.8762758
$`'Accuracy'`
[1] 0.9544444
Let’s compare the deviances of the two kinds of predictions side by side. For that, we first need to compute PRED_PP_obs and its nearest-neighbor equivalent:
d$PRED_GLM_PP_obs <-# d$PRED_GLM_PP_obs is determined by ifelseifelse(d$GENITIVE=="s", # if the obs order is the 2nd level of the response d$PRED_GLM_PP_S, # take its predicted prob.1-d$PRED_GLM_PP_S) # otherwise take 1 minus its predicted prob.d$PRED_NNB_PP_obs <-# d$PRED_NNB_PP_obs is determined by ifelseifelse(d$GENITIVE=="s", # if the obs order is the 2nd level of the response d$PRED_NNB_PP_S, # take its 'predicted prob.'1-d$PRED_NNB_PP_S) # otherwise take 1 minus its 'predicted prob.'
Then, we compute the contributions to logloss from those columns:
And then we can compare the deviance of the glm approach to that of the nearest-neighbor one:
sum(d$CONTR2LL_GLM)*2# same as deviance(m_01)
[1] 2404.267
sum(d$CONTR2LL_NNB)*2# a deviance equivalent
[1] Inf
Why does that happen?
summary(d$PRED_NNB_PP_obs)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.6248 0.9230 0.7875 1.0000 1.0000
How do we address that? We change the 0-values to half the otherwise smallest ‘predicted probability’:
d$PRED_NNB_PP_obs[d$PRED_NNB_PP_obs==0] <-# change the 0s to d$PRED_NNB_PP_obs %>%# of all the 'predicted probabilities', unique %>%# the unique ones sort %>%# when sorted in ascending order"["(2) %>%# the 2nd value (i.e. the next one after 0)"/"(2) # and divide that by 2d$CONTR2LL_NNB <--log(d$PRED_NNB_PP_obs) # recompute contribs to loglosssum(d$CONTR2LL_NNB)*2# a deviance equivalent
[1] 2459.323
What R-squareds (McFadden) result from this?
list("McFadden for glm"= (deviance_null -deviance(m_01)) / deviance_null,"McFadden for nnb"= (deviance_null -sum(d$CONTR2LL_NNB)*2) / deviance_null)
$`McFadden for glm`
[1] 0.3995747
$`McFadden for nnb`
[1] 0.3858254
Very similar. Let’s look at the real confusion matrices (observed vs. predicted):
list("Confusion matrix for the glm"=c_m_GLM <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_GLM_CAT), # predicted genitives in the columns"Its evaluation"=c( # evaluate the confusion matrix"Prec. for s"=c_m_GLM[ "s", "s"] /sum(c_m_GLM[ , "s"]),"Acc./rec. for s"=c_m_GLM[ "s", "s"] /sum(c_m_GLM[ "s", ]),"Prec. for of"=c_m_GLM["of","of"] /sum(c_m_GLM[ ,"of"]),"Acc./rec. for of"=c_m_GLM["of","of"] /sum(c_m_GLM["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_GLM_CAT),"p over baseline"=sum(dbinom(3024:3600, 3600, baselines[1]))),"Confusion matrix for the nnb"=c_m_NNB <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_NNB_CAT), # predicted genitives in the columns"Its evaluation"=c( # evaluate the confusion matrix"Prec. for s"=c_m_NNB[ "s", "s"] /sum(c_m_NNB[ , "s"]),"Acc./rec. for s"=c_m_NNB[ "s", "s"] /sum(c_m_NNB[ "s", ]),"Prec. for of"=c_m_NNB["of","of"] /sum(c_m_NNB[ ,"of"]),"Acc./rec. for of"=c_m_NNB["of","of"] /sum(c_m_NNB["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_NNB_CAT),"p over baseline"=sum(dbinom(3012:3600, 3600, baselines[1]))))
$`Confusion matrix for the glm`
PREDS
OBS of s
of 2450 270
s 306 574
$`Its evaluation`
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
6.800948e-01 6.522727e-01 8.889695e-01 9.007353e-01
Acc. (overall) p over baseline
8.400000e-01 3.296731e-35
$`Confusion matrix for the nnb`
PREDS
OBS of s
of 2419 301
s 287 593
$`Its evaluation`
Prec. for s Acc./rec. for s Prec. for of Acc./rec. for of
6.633110e-01 6.738636e-01 8.939394e-01 8.893382e-01
Acc. (overall) p over baseline
8.366667e-01 1.685128e-32
Again, the results are really similar, which is kinda amazing given how low-tech the nearest-neighbor approach actually is …
3 Session info
sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets compiler methods
[8] base
other attached packages:
[1] cluster_2.1.8.1 STGmisc_1.02 Rcpp_1.0.14 magrittr_2.0.3
loaded via a namespace (and not attached):
[1] digest_0.6.37 fastmap_1.2.0 xfun_0.52 knitr_1.50
[5] htmltools_0.5.8.1 rmarkdown_2.29 cli_3.6.5 rstudioapi_0.17.1
[9] tools_4.5.0 evaluate_1.0.3 yaml_2.3.10 rlang_1.1.6
[13] jsonlite_2.0.0 htmlwidgets_1.6.4 MASS_7.3-65
Source Code
---title: "Ling 202: session 09: similarity-based prediction (key)"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2025-05-21 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 6 fig-height: 6 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---# Application to a linear modelIn session 03, we fit a linear model to answer the question whether the logged reaction time to a word (in ms) varies as a function of* the Zipf frequency of that word (`ZIPFFREQ`);* the language that word was presented in (`LANGUAGE`: *english* vs. *spanish*);* the speaker group that words was presented to (`GROUP`: *english* vs. *heritage*);* any pairwise interaction of these predictors;* the three-way interaction of these predictors.Let's reload those data and apply the same transformation(s) etc. there as we did earlier:```{r}rm(list=ls(all.names=TRUE)); library(cluster)str(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd <- d[complete.cases(d),]d$LANGUAGE <-factor(d$LANGUAGE, # for didactic reasons only, I amlevels=levels(d$LANGUAGE)[2:1]) # changing the order of the levelsd$RT_log <-log2(d$RT)```Let's refit our initial model from there ...```{r}summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+ ZIPFFREQ*LANGUAGE*GROUP, # RT ~ these predictors & their interactionsdata=d, na.action=na.exclude)) # those vars are in d, skip NAs```... and let's add the predictions of this model to the data frame `d`:```{r}d$PRED_LM <-predict(m_01)```Let's create a distance matrix that compares each case to each other case while only considering the main effects we also put into our regression model:```{r}distances <-daisy( # compute a distance matrixx=d[,c("ZIPFFREQ", "LANGUAGE", "GROUP")], # of thismetric="gower", # use the Gower metrictype=list( # specify the"numeric"="ZIPFFREQ", # variable"symm"=c("LANGUAGE", "GROUP"))) %>%# types, thenas.matrix(ncol=nrow(d)) # make that a symmetric matrixstr(distances) # check the ...summary(as.numeric(distances)) # ... results```Let's make predictions for each case that are based on a weighted mean of the observed reaction times of all those cases whose distance from a current case are among the 10 smallest:```{r}d$PRED_NNB <-rep(NA, nrow(d)) # establish the collector columnfor (i inseq(nrow(d))) { # for each case curr <- distances[i,] # get all distance values curr[i] <-NA# insert NA for this case to itself curr <-exp(-curr) # reverse polarity to get similarity values tenth_highest_sim <- curr %>% unique %>%sort(decreasing=TRUE) %>%"["(10) picky <-which(curr >= tenth_highest_sim) # establish the cut-off point d$PRED_NNB[i] <-# compute & store the predictionweighted.mean( # with a weighted meanx=d$RT[picky], # of all reaction timesw=curr[picky], # weighted by their sim to the current casena.rm=TRUE) # remove the 1 NA point}```Let's compare this to the predictions of our linear model:```{r}plot(pch=16, col="#00000020",xlab="Predictions from lm", x=d$PRED_LM,xlim=c(min(c(d$PRED_LM, d$PRED_NNB)), max(c(d$PRED_LM, d$PRED_NNB))),ylab="Predictions from nb", y=d$PRED_NNB,ylim=c(min(c(d$PRED_LM, d$PRED_NNB)), max(c(d$PRED_LM, d$PRED_NNB))))abline(v=quantile(d$PRED_LM , probs=seq(0, 1, 0.1)), col="#00000020", lty=3)abline(h=quantile(d$PRED_NNB, probs=seq(0, 1, 0.1)), col="#00000020", lty=3)abline(lm(d$PRED_NNB ~ d$PRED_LM), col="#0000FF40", lwd=5)```The predictions seem very similar -- how similar?```{r}cor(d$PRED_LM, d$PRED_NNB)^2%>%round(4)```Let's compare the residuals of the two kinds of predictions side by side:```{r}#| fig-show: holdpar(mfrow=c(1,2))hist(main=paste0("Deviance=",prettyNum(round(sum((d$RT - d$PRED_LM)^2), 1), big.mark=","),"\nR-squared=", round(cor(d$RT, d$PRED_LM), 4)), d$RT - d$PRED_LM, breaks="FD")hist(main=paste0("Deviance=",prettyNum(round(sum((d$RT - d$PRED_NNB)^2), 1), big.mark=","),"\nR-squared=", round(cor(d$RT, d$PRED_NNB), 4)), d$RT - d$PRED_NNB, breaks="FD")par(mfrow=c(1,1))```Again, the results are really similar, which is kinda amazing given how low-tech the nearest-neighbor approach actually is ...# Application to a binary classification taskIn session 05, we fit a binary logistic regression model to answer the question whether the choice of a genitive construction (*of* vs. *s*) varies as a function of* a length-based short-before-long effect, but, this time, if we hypothesize a short-before-long effect, maybe we should not just be looking at the length of the possessor, but how the length of the possessor compares to the length of the possessum;* the language that word was presented in (`POR_ANIMACY`: *animate* vs. *collective* vs. *inanimate* vs. *locative* vs. *temporal*);* whether the speakers are non-native speakers or native speakers of English (SPEAKER: *nns* vs. *ns*);* any pairwise interaction of these predictors;* the three-way interaction of these predictors.Let's reload those data and apply the same transformation(s) etc. there as we did earlier:```{r}rm(list=ls(all.names=TRUE))str(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd$LEN_PORmPUM_LOG <-log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)```Let's recompute the baseline and null deviance from there, ...```{r}(baselines <-c("baseline 1"=max( # make baselines[1] the highestprop.table( # proportion in thetable(d$GENITIVE))), # frequency table of the response"baseline 2"=sum( # make baselines[2] the sum of theprop.table( # proportions in thetable(d$GENITIVE))^2))) # frequency table of the response squared(deviance_null <-deviance(m_00 <-glm(GENITIVE ~1, family=binomial, data=d, na.action=na.exclude)))```... let's refit our initial model from there, ...```{r}summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+ LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # GENITIVE ~ these predictorsdata=d, family=binomial, na.action=na.exclude)) # vars are in d, resp = binary, skip NAs```... and let's add the predictions of this model to the data frame `d`:```{r}d$PRED_GLM_PP_S <-predict(m_01, type="response")d$PRED_GLM_CAT <-factor(ifelse(d$PRED_GLM_PP_S>=0.5, levels(d$GENITIVE)[2], levels(d$GENITIVE)[1]))```Let's create a distance matrix that compares each case to each other case while only considering the main effects we also put into our regression model:```{r}distances <-daisy( # compute a distance matrixx=d[,c("LEN_PORmPUM_LOG", "POR_ANIMACY", "SPEAKER")], # of thismetric="gower", # use the Gower metrictype=list( # specify"numeric"="LEN_PORmPUM_LOG", # variable"factor"=c("POR_ANIMACY", "SPEAKER"))) %>%# types, thenas.matrix(ncol=nrow(d)) # make that a symmetric matrixstr(distances) # check the ...summary(as.numeric(distances)) # ... results```Let's make predictions for each case that are based on a weighted mean of the observed reaction times of all those cases whose distance from a current case are among the 10 smallest:```{r}d$PRED_NNB_PP_S <-rep(NA, nrow(d)) # establish the collector columnfor (i inseq(nrow(d))) { # for each case curr <- distances[i,] # get all distance values curr[i] <-NA# insert NA for this case to itself curr <-exp(-curr) # reverse polarity to get similarity values tenth_highest_sim <- curr %>% unique %>%sort(decreasing=TRUE) %>%"["(10) picky <-which(curr >= tenth_highest_sim) # establish the cut-off point d$PRED_NNB_PP_S[i] <-# compute & store the predictionweighted.mean( # with a weighted meanx=d$GENITIVE[picky]=="s", # of s-genitivesw=curr[picky], # weighted by their sim to the current casena.rm=TRUE) # remove any missing data}d$PRED_NNB_CAT <-factor(ifelse(d$PRED_NNB_PP_S>=0.5, levels(d$GENITIVE)[2], levels(d$GENITIVE)[1]))```Let's compare this to the predictions of our generalized linear model:```{r}plot(pch=16, col="#00000020",xlab="Predictions from glm", xlim=c(0, 1), x=d$PRED_GLM_PP_S,ylab="Predictions from nnb", ylim=c(0, 1), y=d$PRED_NNB_PP_S)abline(v=quantile(d$PRED_GLM_PP_S , probs=seq(0, 1, 0.1)), col="#00000020", lty=3)abline(v=0.5, col="#FF000040", lwd=3)abline(h=quantile(d$PRED_NNB_PP_S, probs=seq(0, 1, 0.1)), col="#00000020", lty=3)abline(h=0.5, col="#FF000040", lwd=3)abline(lm(d$PRED_NNB_PP_S ~ d$PRED_GLM_PP_S), col="#0000FF40", lwd=5)```The predictions seem very similar -- how similar?```{r}cor(d$PRED_GLM_PP_S, d$PRED_NNB_PP_S)^2%>%round(4)(preds_comparer <-table(GLM=d$PRED_GLM_CAT, NNB=d$PRED_NNB_CAT))list("Cohen's kappa"=cohens.kappa(preds_comparer), # 0.875645783"Matthew's cc "=MCC(preds_comparer), # 0.8762758"'Accuracy'"=mean(d$PRED_GLM_CAT==d$PRED_NNB_CAT)) # 0.954444444```Let's compare the deviances of the two kinds of predictions side by side. For that, we first need to compute `PRED_PP_obs` and its nearest-neighbor equivalent:```{r}d$PRED_GLM_PP_obs <-# d$PRED_GLM_PP_obs is determined by ifelseifelse(d$GENITIVE=="s", # if the obs order is the 2nd level of the response d$PRED_GLM_PP_S, # take its predicted prob.1-d$PRED_GLM_PP_S) # otherwise take 1 minus its predicted prob.d$PRED_NNB_PP_obs <-# d$PRED_NNB_PP_obs is determined by ifelseifelse(d$GENITIVE=="s", # if the obs order is the 2nd level of the response d$PRED_NNB_PP_S, # take its 'predicted prob.'1-d$PRED_NNB_PP_S) # otherwise take 1 minus its 'predicted prob.'```Then, we compute the contributions to logloss from those columns:```{r}d$CONTR2LL_GLM <--log(d$PRED_GLM_PP_obs)d$CONTR2LL_NNB <--log(d$PRED_NNB_PP_obs)```And then we can compare the deviance of the `glm` approach to that of the nearest-neighbor one:```{r}sum(d$CONTR2LL_GLM)*2# same as deviance(m_01)sum(d$CONTR2LL_NNB)*2# a deviance equivalent```Why does that happen?```{r}summary(d$PRED_NNB_PP_obs)```How do we address that? We change the 0-values to half the otherwise smallest 'predicted probability':```{r}d$PRED_NNB_PP_obs[d$PRED_NNB_PP_obs==0] <-# change the 0s to d$PRED_NNB_PP_obs %>%# of all the 'predicted probabilities', unique %>%# the unique ones sort %>%# when sorted in ascending order"["(2) %>%# the 2nd value (i.e. the next one after 0)"/"(2) # and divide that by 2d$CONTR2LL_NNB <--log(d$PRED_NNB_PP_obs) # recompute contribs to loglosssum(d$CONTR2LL_NNB)*2# a deviance equivalent```What R-squareds (McFadden) result from this?```{r}list("McFadden for glm"= (deviance_null -deviance(m_01)) / deviance_null,"McFadden for nnb"= (deviance_null -sum(d$CONTR2LL_NNB)*2) / deviance_null)```Very similar. Let's look at the real confusion matrices (observed vs. predicted):```{r}list("Confusion matrix for the glm"=c_m_GLM <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_GLM_CAT), # predicted genitives in the columns"Its evaluation"=c( # evaluate the confusion matrix"Prec. for s"=c_m_GLM[ "s", "s"] /sum(c_m_GLM[ , "s"]),"Acc./rec. for s"=c_m_GLM[ "s", "s"] /sum(c_m_GLM[ "s", ]),"Prec. for of"=c_m_GLM["of","of"] /sum(c_m_GLM[ ,"of"]),"Acc./rec. for of"=c_m_GLM["of","of"] /sum(c_m_GLM["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_GLM_CAT),"p over baseline"=sum(dbinom(3024:3600, 3600, baselines[1]))),"Confusion matrix for the nnb"=c_m_NNB <-table( # confusion matrix: cross-tabulate"OBS"=d$GENITIVE, # observed genitives in the rows"PREDS"=d$PRED_NNB_CAT), # predicted genitives in the columns"Its evaluation"=c( # evaluate the confusion matrix"Prec. for s"=c_m_NNB[ "s", "s"] /sum(c_m_NNB[ , "s"]),"Acc./rec. for s"=c_m_NNB[ "s", "s"] /sum(c_m_NNB[ "s", ]),"Prec. for of"=c_m_NNB["of","of"] /sum(c_m_NNB[ ,"of"]),"Acc./rec. for of"=c_m_NNB["of","of"] /sum(c_m_NNB["of", ]),"Acc. (overall)"=mean(d$GENITIVE==d$PRED_NNB_CAT),"p over baseline"=sum(dbinom(3012:3600, 3600, baselines[1]))))```Again, the results are really similar, which is kinda amazing given how low-tech the nearest-neighbor approach actually is ...# Session info```{r}sessionInfo()```