Let’s load a data set for an example of an ordinal model; the data are in _input/agreementratings.csv and you can find information about the variables/columns in _input/agreementratings.r. Importantly, we make sure the response variable is an ordered factor:
rm(list=ls(all.names=TRUE))library(car); library(effects); library(MASS) # & now load small helper functions:source("_helpers/R2s.r"); source("_helpers/summarize.polr.r")str(d <-read.delim( # summarize d, the result of loadingfile="_input/agreementratings.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
CASE JUDGMENT AGREEMENT NEGATION EMPHPRON SUBJORDER NUMBERGRAM PERSONGRAM
Min. : 1 -3: 620 no :2077 affirmative:2139 no :2133 gramsem:2173 plural :1989 first :1381
1st Qu.:1137 -2: 480 yes:2223 negative :2161 yes:2167 semgram:2127 singular:2311 second:1561
Median :2262 -1: 387 third :1358
Mean :2266 0 : 329
3rd Qu.:3394 1 : 355
Max. :4528 2 : 826
3 :1303
We want to determine whether the rating of a Spanish construction involving the semantic role of an experiencer is predictable from
AGREEMENT, NEGATION, EMPHPRON, SUBJORDER, NUMPERGRAM, and PERSONGRAM;
any interaction of AGREEMENT with any of the other predictors.
2 Deviance & baseline(s)
Let’s already compute a multinomial kind of baseline for what will be the response variable, JUDGMENT:
(baselines <-c("baseline 1"=max( # make baselines[1] the highestprop.table( # proportion in thetable(d$JUDGMENT))), # frequency table of the response"baseline 2"=sum( # make baselines[2] the sum of theprop.table( # proportions in thetable(d$JUDGMENT))^2))) # frequency table of the response squared
baseline 1 baseline 2
0.3030233 0.1827431
But this is actually not ideal here because it only checks whether a prediction is correct or not, but not how far off it is. If JUDGMENT is 2 and the model predicts 3, it is an incorrect prediction, but it’s also clear that the prediction is much better than if it had been -3, which neither of these baselines can reflect. To address this, let’s compute a baseline that takes the ordinal nature of the response (and the predictions) into consideration. We will use a very simple/crude simulation approach for this: We will randomly reshuffle the response variable 1000 times and compute for each reshuffling how far the random reshufflings are away from the actual data, where the ‘distance’ is operationalized as the sum of the absolute differences between observed and reshuffled ranks. This is how one might visualize that with a small histogram of those reshuffled sums-of-absolute-deviations; I am adding
the quantiles one might use as a 95% confidence interval (in orange);
the sum that corresponds to baseline 1 (in blue);
the worst possible prediction result, i.e. the score that would result from always guessing as far away as possible from the actual score, (in red):
-3 -2 -1 0 1 2 3
first no 173 112 85 50 51 81 105
yes 64 64 61 52 50 141 292
second no 190 122 86 65 71 139 136
yes 41 51 47 62 62 173 316
third no 100 70 54 43 68 124 152
yes 52 61 54 57 53 168 302
4 Modeling & numerical interpretation
Let’s fit our initial regression model:
summary(m.01<-polr( # summarize m.01, which models JUDGMENT ~1+# JUDGMENT ~ an overall intercept (1)# AGREEMENT interacting w/ all other predictors AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),Hess=TRUE, # recommended for using summarydata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data
Let’s also compute the overall p-value for this model:
anova(m.00, m.final, # compare the null model to m.finaltest="Chisq") # using a LRT
Likelihood ratio tests of ordinal regression models
Response: JUDGMENT
Model
1 1
2 AGREEMENT + SUBJORDER + NUMBERGRAM + AGREEMENT:SUBJORDER + AGREEMENT:NUMBERGRAM
Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
1 4294 15669.11
2 4289 15110.60 1 vs 2 5 558.5191 0
pchisq( # compute the area under the chi-squared curveq=558.5191, # for this chi-squared valuedf=5, # at 5 dflower.tail=FALSE) # only using the right/upper tail/side
[1] 1.848523e-118
On to model quality: We make ‘predictions’ …
d <-cbind(d, # add to d the columns resulting frompredict( # predicting m.final, # from m.finaltype="prob")) # predicted probabilities# give those columns better namesnames(d)[9:15] <-paste0( # rename those columns to the result of pasting"PREDS.PP.", # "PREDS.PP."names(d)[9:15]) # in front of the existing namesd <-cbind(d, # add to d a columnPREDS.CAT=predict( # called PREDS.CAT, the predictions m.final, # from m.final, namelytype="class")) # categorical predictionshead(d) # check the result
CASE JUDGMENT AGREEMENT NEGATION EMPHPRON SUBJORDER NUMBERGRAM PERSONGRAM
1 1 3 no affirmative yes gramsem plural third
2 2 -3 no affirmative yes gramsem plural third
3 3 3 no affirmative yes gramsem plural third
4 4 3 no affirmative yes gramsem plural third
5 5 -3 no affirmative yes gramsem plural third
6 6 2 no affirmative yes gramsem plural third
PREDS.PP.-3 PREDS.PP.-2 PREDS.PP.-1 PREDS.PP.0 PREDS.PP.1 PREDS.PP.2 PREDS.PP.3 PREDS.CAT
1 0.1827313 0.1410666 0.109845 0.08866082 0.08981659 0.1821742 0.2057054 3
2 0.1827313 0.1410666 0.109845 0.08866082 0.08981659 0.1821742 0.2057054 3
3 0.1827313 0.1410666 0.109845 0.08866082 0.08981659 0.1821742 0.2057054 3
4 0.1827313 0.1410666 0.109845 0.08866082 0.08981659 0.1821742 0.2057054 3
5 0.1827313 0.1410666 0.109845 0.08866082 0.08981659 0.1821742 0.2057054 3
6 0.1827313 0.1410666 0.109845 0.08866082 0.08981659 0.1821742 0.2057054 3
… and then we evaluate them (again, confusion matrices here are not as good as for the previous kinds of models):
(c.m <-table( # confusion matrix: cross-tabulate"OBS"=d$JUDGMENT, # observed judgments in the rows"PREDS"=d$PREDS.CAT)) # predicted judgments in the columns
That means, we’re getting 301+1121=1422 out of 4300 right, which is the accuracy of 0.3306977. But remember, this accuracy is not sensitive to how far off a prediction was: if the actual value is 2, then each of the 666 predictions of 3 is wrong but still much better than each of the 160 predictions of -3. Thus, let’s now see how our final model compares to our simulated baseline:
In terms of significance, it does really well: the observed sum of absolute differences is so small it’s never observed in the random reshuffles:
sum(collector <= sum.of.preddiffs.from.obs)
[1] 0
At the same, the effect size is again not great. Our simulated baseline meant that, on average, we’d misestimate the right judgment by 2.49 points on our 7-point scale and now, with all our modeling efforts, we’re still off by an average of 2.16 steps, not exactly great:
median(collector) /4300
[1] 2.49
sum.of.preddiffs.from.obs /4300
[1] 2.155116
Which also means that our improvement is quite small, because it’s only a 13.45% improvement in rank steps away from the right judgment:
(2.49-2.155116) /2.49
[1] 0.1344916
Plus, we need the model’s R2-values and compute both McFadden’s R2 and the more widely-used version of Nagelkerke’s R2:
Now let’s visualize the results of the two interactions based on the predictions; we begin with AGREEMENT:SUBJORDER:
agsu.d <-data.frame( # make agsu.d a data frame of agsu <-effect( # the effect"AGREEMENT:SUBJORDER", # of AGREEMENT:SUBJORDER m.final)) # in m.finalagsu.d[,1:9] # show just the predicted probabilities
plot(agsu, # plot the effect of AGREEMENT:SUBJORDERtype="probability", # but plot predicted probabilitiesylim=c(0, 1), # w/ these y-axis limitsgrid=TRUE, # & w/ a gridlines=list(multiline=TRUE), # make all lines be in one panelconfint=list(style="auto"), # make the lines come with confidence bandslattice=list(key.args=list( # make the legend ...columns=5, # ... have 5 columns &cex.title=0))) # no title
plot(agsu, # plot the effect of AGREEMENT:SUBJORDERtype="probability", # but plot predicted probabilitiesylim=c(0, 1), # w/ these y-axis limitsx.var="SUBJORDER", # w/ this variable on the x-axisgrid=TRUE, # & w/ a gridlines=list(multiline=TRUE), # make all lines be in one panelconfint=list(style="auto"), # make the lines come with confidence bandslattice=list(key.args=list( # make the legend ...columns=5, # ... have 5 columns &cex.title=0))) # no title
Then, the other interaction, AGREEMENT:NUMBERGRAM:
agnu.d <-data.frame( # make agnu.d a data frame of agnu <-effect( # the effect"AGREEMENT:NUMBERGRAM", # of AGREEMENT:NUMBERGRAM m.final)) # in m.finalagnu.d[,1:9] # show just the predicted probabilities
plot(agnu, # plot the effect of AGREEMENT:NUMBERGRAMtype="probability", # but plot predicted probabilitiesylim=c(0, 1), # w/ these y-axis limitsgrid=TRUE, # & w/ a gridlines=list(multiline=TRUE), # make all lines be in one panelconfint=list(style="auto"), # make the lines come with confidence bandslattice=list(key.args=list( # make the legend ...columns=5, # ... have 5 columns &cex.title=0))) # no title
plot(agnu, # plot the effect of AGREEMENT:NUMBERGRAMtype="probability", # but plot predicted probabilitiesylim=c(0, 1), # w/ these y-axis limitsx.var="NUMBERGRAM", # w/ this variable on the x-axisgrid=TRUE, # & w/ a gridlines=list(multiline=TRUE), # make all lines be in one panelconfint=list(style="auto"), # make the lines come with confidence bandslattice=list(key.args=list( # make the legend ...columns=5, # ... have 5 columns &cex.title=0))) # no title
6 Write-up
To determine whether the rating of a Spanish construction involving the semantic role of an experiencer is predictable from
AGREEMENT, NEGATION, EMPHPRON, SUBJORDER, NUMBERGRAM, and PERSONGRAM;
any interaction of AGREEMENT with any of the other predictors,
an ordinal model selection process was undertaken (using backwards stepwise model selection of all predictors and controls based on significance testing). The final model resulting from the elimination of ns predictors contained the interactions AGREEMENT:SUBJORDER and AGREEMENT:NUMBERGRAM and was highly significant (LR=558.52, df=5, p<0.0001) with a very low amount of explanatory/predictive power (McFadden’s R2=0.036, Nagelkerke’s R2=0.125); the overall accuracy of the model (in terms of making the one exactly correct prediction for each case) is 33.07%, but the model really only predicts the highest rating (86% accuracy/recall) and the lowest rating (48.5% accuracy/recall). The summary table for the model is shown below:
Estimate
95%-CI
se
z
p2-tailed
Intercept-3→-2
-1.50
[-1.65, -1.34]
0.08
-19.06
<0.0001
Intercept-2→-1
-0.74
[-0.88, -0.59]
0.07
-9.90
<0.0001
Intercept-1→0
-0.27
[-0.41, -0.12]
0.07
-3.65
0.0003
Intercept0→1
0.09
[-0.05, 0.23]
0.07
1.22
0.2213
Intercept1→2
0.46
[0.31, 0.60]
0.07
6.23
<0.0001
Intercept2→3
1.35
[1.20, 1.50]
0.08
17.78
<0.0001
AGREEMENTno→yes
0.71
[0.52, 0.90]
0.10
7.34
<0.0001
SUBJORDERgramsem→semgram
0.27
[0.11, 0.42]
0.08
3.42
0.0006
NUMBERGRAMplural→sing
-0.55
[-0.70, -0.39]
0.08
-6.94
<0.0001
AGREEMENTno→yes:SUBJORDERgramsem→semgram
0.40
[0.18, 0.61]
0.11
3.61
0.0003
AGREEMENTno→yes:NUMBERGRAMplural→sing
0.47
[0.25, 0.68]
0.11
4.24
<0.0001
The way the two interactions in the model are correlated with JUDGMENT are as follows [show plots]:
with regard to AGREEMENT:SUBJORDER,
when AGREEMENT is no, all levels of JUDGMENT have relatively low predicted probabilities, but
when SUBJORDER is gramsem, JUDGMENT is predicted to be -3 (with a predicted probability of 0.231);
when SUBJORDER is semgram, JUDGMENT is predicted to be +3 (with a predicted probability of 0.201);
when AGREEMENT is yes, most levels of JUDGMENT have relatively low predicted probabilities, but
the model predicts 3, followed by 2 regardless of SUBJORDER, but
the predicted probability of JUDGMENT being 3 is much higher when SUBJORDER is semgram than when it is gramsem;
AGREEMENT being yes strongly boosts the predicted probability of JUDGMENT being 3;
with regard to AGREEMENT:NUMBERGRAM,
when AGREEMENT is no, all levels of JUDGMENT have relatively low predicted probabilities, but
when NUMBERGRAM is plural, JUDGMENT is predicted to be 3 (with a predicted probability of 0.228);
when NUMBERGRAM is singular, JUDGMENT is predicted to be -3 (with a predicted probability of 0.253);
when AGREEMENT is yes, most levels of JUDGMENT have relatively low predicted probabilities, but
the model predicts 3, followed by 2 regardless of NUMBERGRAM, but
AGREEMENT being yes strongly boosts the predicted probability of JUDGMENT being 3.
7 Excursus: could we have run a linear model?
Many people run linear models in cases where, strictly speaking, one should use ordinal models (because, like here, the response variable is an ordinal ranking), and they do that most likely because (i) like me, they hate ordinal models for their complexity and/or (ii) they don’t know ordinal models (well (enough)). Thus, the question might arise whether we could have done the same here and fit a much more straightforward linear model – let’s find out in two steps: in this session, we go through the model selection process; in the next session, we look at model diagnostics.
Let’s do some linear modeling:
d$JUDGMENT <-as.numeric(as.character(d$JUDGMENT)) # or as.numeric(d$JUDGMENT)-4summary(m.01.lin <-lm( # summarize m.01.lin, which models JUDGMENT ~1+# JUDGMENT ~ an overall intercept (1)# AGREEMENT interacting w/ all other predictors AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),data=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data
Let’s fit the 2nd model with the interaction AGREEMENT:NEGATION deleted:
summary(m.02.lin <-update( # make m.02.lin an update of m.01.lin, .~. # m.01.lin, namely all of it (.~.), but then- AGREEMENT:NEGATION)) # remove this interaction
Let’s fit the 3rd model with the main effect NEGATION deleted:
summary(m.03.lin <-update( # make m.03.lin an update of m.02.lin, .~. # m.02.lin, namely all of it (.~.), but then- NEGATION)) # remove this main effect
Let’s fit the 4th model with the interaction AGREEMENT:EMPHPRON deleted:
summary(m.04.lin <-update( # make m.04.lin an update of m.03.lin, .~. # m.03.lin, namely all of it (.~.), but then- AGREEMENT:EMPHPRON)) # remove this interaction
Let’s fit the 5th model with the main effect EMPHPRON deleted:
summary(m.05.lin <-update( # make m.05.lin an update of m.04.lin, .~. # m.04.lin, namely all of it (.~.), but then- EMPHPRON)) # remove this main effect
Let’s fit the 6th model with the interaction AGREEMENT:AGREEMENT:NUMBERGRAM deleted:
summary(m.06.lin <-update( # make m.06.lin an update of m.05.lin, .~. # m.05.lin, namely all of it (.~.), but then- AGREEMENT:NUMBERGRAM)) # remove this interaction
With regard to this interaction at least, the the result from the final linear model is quite similar to the one in our final ordinal model:
when AGREEMENT is yes, then subjects rated the stimuli more highly than when AGREEMENT is no (as expected), but that difference is much higher when SUBJORDER is semgram than when it is gramsem;
from the reverse perspective, when SUBJORDER is semgram, then subjects rated the stimuli more highly than when SUBJORDER is gramsem, but that difference is much higher when AGREEMENT is yes than when it is no.
However, if we compare the R2s, and we need to compare the ones that can be compared, sort of, then there’s quite some difference:
McFadden’s R2 for the final ordinal model was 0.0356, i.e. very small;
(McFadden’s) R2 for the final linear model was 0.1219, i.e. small but considerably better.
So, the results are mixed, but not particularly encouraging in the sense of allowing us to go with the linear model. We’ll reach a more definitive verdict during the next session.
8 Homework
To prepare for next session, read SFLWR3, Section 5.6 & 5.7.
---title: "Ling 202: session 07: ordinal regression (key)"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2025-05-14 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 5 fig-height: 5 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---# IntroLet's load a data set for an example of an ordinal model; the data are in [_input/agreementratings.csv](_input/agreementratings.csv) and you can find information about the variables/columns in [_input/agreementratings.r](_input/agreementratings.r). Importantly, we make sure the response variable is an ordered factor:```{r}rm(list=ls(all.names=TRUE))library(car); library(effects); library(MASS) # & now load small helper functions:source("_helpers/R2s.r"); source("_helpers/summarize.polr.r")str(d <-read.delim( # summarize d, the result of loadingfile="_input/agreementratings.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd$JUDGMENT <-factor(d$JUDGMENT, ordered=TRUE)summary(d)```We want to determine whether the rating of a Spanish construction involving the semantic role of an experiencer is predictable from* `AGREEMENT`, `NEGATION`, `EMPHPRON`, `SUBJORDER`, `NUMPERGRAM`, and `PERSONGRAM`;* any interaction of `AGREEMENT` with any of the other predictors.# Deviance & baseline(s)Let's already compute a multinomial kind of baseline for what will be the response variable, `JUDGMENT`:```{r}(baselines <-c("baseline 1"=max( # make baselines[1] the highestprop.table( # proportion in thetable(d$JUDGMENT))), # frequency table of the response"baseline 2"=sum( # make baselines[2] the sum of theprop.table( # proportions in thetable(d$JUDGMENT))^2))) # frequency table of the response squared```But this is actually not ideal here because it only checks whether a prediction is correct or not, but not how far off it is. If `JUDGMENT` is 2 and the model predicts 3, it is an incorrect prediction, but it's also clear that the prediction is much better than if it had been -3, which neither of these baselines can reflect. To address this, let's compute a baseline that takes the ordinal nature of the response (and the predictions) into consideration. We will use a very simple/crude simulation approach for this: We will randomly reshuffle the response variable 1000 times and compute for each reshuffling how far the random reshufflings are away from the actual data, where the 'distance' is operationalized as the sum of the absolute differences between observed and reshuffled ranks. This is how one might visualize that with a small histogram of those reshuffled sums-of-absolute-deviations; I am adding* the quantiles one might use as a 95% confidence interval (in orange);* the sum that corresponds to baseline 1 (in blue);* the worst possible prediction result, i.e. the score that would result from always guessing as far away as possible from the actual score, (in red):```{r}op <-par(mar=c(4,4,1,1))collector <-numeric(1000); set.seed(1); for (i inseq(collector)) { random.judgments <-sample(d$JUDGMENT) collector[i] <-"-"(as.numeric(random.judgments),as.numeric(d$JUDGMENT)) %>% abs %>% sum}sim.eval.plot.1<-function () {hist(collector, xlim=c(0, 22000), main="")abline(v=quantile(collector, probs=c(0.025, 0.5, 0.975)), lty=2, col="orange")abline(v=sum(abs(as.numeric(d$JUDGMENT)-7)), lty=2, col="blue")abline(v=sum(pmax(7-as.numeric(d$JUDGMENT), abs(1-as.numeric(d$JUDGMENT)))), lty=2, col="red")}; sim.eval.plot.1()par(op)```With this, we also update our vector `baselines`:```{r}baselines["simulated"] <-median(collector); baselines```Let's compute the deviance of the null model:```{r}deviance(m.00<-polr(JUDGMENT ~1, Hess=TRUE, data=d, na.action=na.exclude))```# Exploration & preparationSome exploration of the relevant variables:```{r}# the predictor(s)/response on its/their own# just check the summary for the main effectstable(d$NEGATION, d$AGREEMENT)table(d$EMPHPRON, d$AGREEMENT)table(d$SUBJORDER, d$AGREEMENT)table(d$NUMBERGRAM, d$AGREEMENT)table(d$PERSONGRAM, d$AGREEMENT)# the predictor(s) w/ the responsetable(d$AGREEMENT, d$JUDGMENT)ftable(d$NEGATION, d$AGREEMENT, d$JUDGMENT)ftable(d$EMPHPRON, d$AGREEMENT, d$JUDGMENT)ftable(d$SUBJORDER, d$AGREEMENT, d$JUDGMENT)ftable(d$NUMBERGRAM, d$AGREEMENT, d$JUDGMENT)ftable(d$PERSONGRAM, d$AGREEMENT, d$JUDGMENT)```# Modeling & numerical interpretationLet's fit our initial regression model:```{r}summary(m.01<-polr( # summarize m.01, which models JUDGMENT ~1+# JUDGMENT ~ an overall intercept (1)# AGREEMENT interacting w/ all other predictors AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),Hess=TRUE, # recommended for using summarydata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data```Let's also compute the *t*-values for these coefficient values (however uninterpretable those are):```{r}(m.01.tscores <-summary(m.01)$coefficients[,"t value"])```And from that we compute the *p*-values:```{r}(m.01.pvalues <-2*pnorm(abs(m.01.tscores), lower.tail=FALSE))```Also, let's quickly check the model's overall significance value and see which predictors may get dropped:```{r}anova(update(m.01, ~1), # compare the null model generated here ad hoc m.01, test="Chisq") # to the model w/ the predictor using a LRTdrop1(m.01, # drop each droppable predictor at a time from m.01 &test="Chisq") # test its significance w/ a LRT```Let's fit the 2nd model with the interaction `AGREEMENT:NEGATION` deleted:```{r}summary(m.02<-update( # make m.02 an update of m.01, .~. # m.01, namely all of it (.~.), but then- AGREEMENT:NEGATION)) # remove this interactionanova(m.01, m.02, # compare m.01 to m.02test="Chisq") # using a LRTdrop1(m.02, # drop each droppable predictor at a time from m.02 &test="Chisq") # test its significance w/ a LRT```Let's fit the 3rd model with the main effect `NEGATION` deleted:```{r}summary(m.03<-update( # make m.03 an update of m.02, .~. # m.02, namely all of it (.~.), but then- NEGATION)) # remove this predictoranova(m.02, m.03, # compare m.02 to m.03test="Chisq") # using a LRTdrop1(m.03, # drop each droppable predictor at a time from m.03 &test="Chisq") # test its significance w/ a LRT```Let's fit the 4th model with the interaction `AGREEMENT:PERSONGRAM` deleted:```{r}summary(m.04<-update( # make m.04 an update of m.03, .~. # m.03, namely all of it (.~.), but then- AGREEMENT:PERSONGRAM)) # remove this interactionanova(m.03, m.04, # compare m.03 to m.04test="Chisq") # using a LRTdrop1(m.04, # drop each droppable predictor at a time from m.04 &test="Chisq") # test its significance w/ a LRT```Let's fit the 5th model with the interaction `AGREEMENT:EMPHPRON` deleted:```{r}summary(m.05<-update( # make m.05 an update of m.04, .~. # m.04, namely all of it (.~.), but then- AGREEMENT:EMPHPRON)) # remove this interactionanova(m.04, m.05, # compare m.04 to m.05test="Chisq") # using a LRTdrop1(m.05, # drop each droppable predictor at a time from m.05 &test="Chisq") # test its significance w/ a LRT```Let's fit the 6th model with the main effect `EMPHPRON` deleted:```{r}summary(m.06<-update( # make m.06 an update of m.05, .~. # m.05, namely all of it (.~.), but then- EMPHPRON)) # remove this predictoranova(m.05, m.06, # compare m.05 to m.06test="Chisq") # using a LRTdrop1(m.06, # drop each droppable predictor at a time from m.06 &test="Chisq") # test its significance w/ a LRT```Let's fit the 7th model with the predictor `PERSONGRAM` deleted:```{r}summary(m.07<-update( # make m.07 an update of m.06, .~. # m.06, namely all of it (.~.), but then- PERSONGRAM)) # remove this predictoranova(m.06, m.07, # compare m.06 to m.07test="Chisq") # using a LRTdrop1(m.07, # drop each droppable predictor at a time from m.07 &test="Chisq") # test its significance w/ a LRT```No, we can't delete any more predictors, we're done with model selection:```{r}summary(m.final <- m.07); rm(m.07); invisible(gc())(m.final.tscores <-summary(m.final)$coefficients[,"t value"])(m.final.pvalues <-2*pnorm(abs(m.final.tscores), lower.tail=FALSE))```Let's also compute the overall *p*-value for this model:```{r}anova(m.00, m.final, # compare the null model to m.finaltest="Chisq") # using a LRTpchisq( # compute the area under the chi-squared curveq=558.5191, # for this chi-squared valuedf=5, # at 5 dflower.tail=FALSE) # only using the right/upper tail/side```On to model quality: We make 'predictions' ...```{r}d <-cbind(d, # add to d the columns resulting frompredict( # predicting m.final, # from m.finaltype="prob")) # predicted probabilities# give those columns better namesnames(d)[9:15] <-paste0( # rename those columns to the result of pasting"PREDS.PP.", # "PREDS.PP."names(d)[9:15]) # in front of the existing namesd <-cbind(d, # add to d a columnPREDS.CAT=predict( # called PREDS.CAT, the predictions m.final, # from m.final, namelytype="class")) # categorical predictionshead(d) # check the result```... and then we evaluate them (again, confusion matrices here are not as good as for the previous kinds of models):```{r}(c.m <-table( # confusion matrix: cross-tabulate"OBS"=d$JUDGMENT, # observed judgments in the rows"PREDS"=d$PREDS.CAT)) # predicted judgments in the columnsc(tapply(as.character(d$JUDGMENT)==d$PREDS.CAT, d$JUDGMENT, mean), # accuracies for ..."overall"=mean(as.character(d$JUDGMENT)==d$PREDS.CAT)) # ... each level & overall```That means, we're getting 301+1121=1422 out of 4300 right, which is the accuracy of 0.3306977. But remember, this accuracy is not sensitive to how far off a prediction was: if the actual value is 2, then each of the 666 predictions of 3 is wrong but still much better than each of the 160 predictions of -3. Thus, let's now see how our final model compares to our simulated baseline:```{r}sum.of.preddiffs.from.obs <-"-"(as.numeric(d$PREDS.CAT),as.numeric(d$JUDGMENT)) %>% abs %>% sumop <-par(mar=c(4,4,1,1))sim.eval.plot.1()abline(v=sum.of.preddiffs.from.obs, lty=2, col="green")par(op)```In terms of significance, it does really well: the observed sum of absolute differences is so small it's never observed in the random reshuffles:```{r}sum(collector <= sum.of.preddiffs.from.obs)```At the same, the effect size is again not great. Our simulated baseline meant that, on average, we'd misestimate the right judgment by 2.49 points on our 7-point scale and now, with all our modeling efforts, we're still off by an average of 2.16 steps, not exactly great:```{r}median(collector) /4300sum.of.preddiffs.from.obs /4300```Which also means that our improvement is quite small, because it's only a 13.45% improvement in rank steps away from the right judgment:```{r}(2.49-2.155116) /2.49```Plus, we need the model's *R*^2^-values and compute both McFadden's *R*^2^ and the more widely-used version of Nagelkerke's *R*^2^:```{r}R2s(m.final)```Of course, if you paid attention to the book and worked through everything there, much of this you could have done with this:```{r}summarize.polr(m.final)```# Visual interpretationNow let's visualize the results of the two interactions based on the predictions; we begin with `AGREEMENT:SUBJORDER`:```{r}#| fig-width: 8agsu.d <-data.frame( # make agsu.d a data frame of agsu <-effect( # the effect"AGREEMENT:SUBJORDER", # of AGREEMENT:SUBJORDER m.final)) # in m.finalagsu.d[,1:9] # show just the predicted probabilitiesplot(agsu, # plot the effect of AGREEMENT:SUBJORDERtype="probability", # but plot predicted probabilitiesylim=c(0, 1), # w/ these y-axis limitsgrid=TRUE, # & w/ a gridlines=list(multiline=TRUE), # make all lines be in one panelconfint=list(style="auto"), # make the lines come with confidence bandslattice=list(key.args=list( # make the legend ...columns=5, # ... have 5 columns &cex.title=0))) # no titleplot(agsu, # plot the effect of AGREEMENT:SUBJORDERtype="probability", # but plot predicted probabilitiesylim=c(0, 1), # w/ these y-axis limitsx.var="SUBJORDER", # w/ this variable on the x-axisgrid=TRUE, # & w/ a gridlines=list(multiline=TRUE), # make all lines be in one panelconfint=list(style="auto"), # make the lines come with confidence bandslattice=list(key.args=list( # make the legend ...columns=5, # ... have 5 columns &cex.title=0))) # no title```Then, the other interaction, `AGREEMENT:NUMBERGRAM`:```{r}#| fig-width: 8agnu.d <-data.frame( # make agnu.d a data frame of agnu <-effect( # the effect"AGREEMENT:NUMBERGRAM", # of AGREEMENT:NUMBERGRAM m.final)) # in m.finalagnu.d[,1:9] # show just the predicted probabilitiesplot(agnu, # plot the effect of AGREEMENT:NUMBERGRAMtype="probability", # but plot predicted probabilitiesylim=c(0, 1), # w/ these y-axis limitsgrid=TRUE, # & w/ a gridlines=list(multiline=TRUE), # make all lines be in one panelconfint=list(style="auto"), # make the lines come with confidence bandslattice=list(key.args=list( # make the legend ...columns=5, # ... have 5 columns &cex.title=0))) # no titleplot(agnu, # plot the effect of AGREEMENT:NUMBERGRAMtype="probability", # but plot predicted probabilitiesylim=c(0, 1), # w/ these y-axis limitsx.var="NUMBERGRAM", # w/ this variable on the x-axisgrid=TRUE, # & w/ a gridlines=list(multiline=TRUE), # make all lines be in one panelconfint=list(style="auto"), # make the lines come with confidence bandslattice=list(key.args=list( # make the legend ...columns=5, # ... have 5 columns &cex.title=0))) # no title```# Write-upTo determine whether the rating of a Spanish construction involving the semantic role of an experiencer is predictable from* `AGREEMENT`, `NEGATION`, `EMPHPRON`, `SUBJORDER`, `NUMBERGRAM`, and `PERSONGRAM`;* any interaction of `AGREEMENT` with any of the other predictors,an ordinal model selection process was undertaken (using backwards stepwise model selection of all predictors and controls based on significance testing). The final model resulting from the elimination of ns predictors contained the interactions `AGREEMENT:SUBJORDER` and `AGREEMENT:NUMBERGRAM` and was highly significant (*LR*=558.52, *df*=5, *p*<0.0001) with a very low amount of explanatory/predictive power (McFadden's *R*^2^=0.036, Nagelkerke's *R*^2^=0.125); the overall accuracy of the model (in terms of making the one exactly correct prediction for each case) is 33.07%, but the model really only predicts the highest rating (86% accuracy/recall) and the lowest rating (48.5% accuracy/recall). The summary table for the model is shown below:| | Estimate | 95%-CI | *se* | *z* | *p*~2-tailed~ ||:-------------------------------------------------------|:---------|:---------------|:-----|:-------|:--------------|| Intercept~-3→-2~ | -1.50 | [-1.65, -1.34] | 0.08 | -19.06 | <0.0001 || Intercept~-2→-1~ | -0.74 | [-0.88, -0.59] | 0.07 | -9.90 | <0.0001 || Intercept~-1→0~ | -0.27 | [-0.41, -0.12] | 0.07 | -3.65 | 0.0003 || Intercept~0→1~ | 0.09 | [-0.05, 0.23] | 0.07 | 1.22 | 0.2213 || Intercept~1→2~ | 0.46 | [0.31, 0.60] | 0.07 | 6.23 | <0.0001 || Intercept~2→3~ | 1.35 | [1.20, 1.50] | 0.08 | 17.78 | <0.0001 || AGREEMENT~no→yes~ | 0.71 | [0.52, 0.90] | 0.10 | 7.34 | <0.0001 || SUBJORDER~gramsem→semgram~ | 0.27 | [0.11, 0.42] | 0.08 | 3.42 | 0.0006 || NUMBERGRAM~plural→sing~ | -0.55 | [-0.70, -0.39] | 0.08 | -6.94 | <0.0001 || AGREEMENT~no→yes~:SUBJORDER~gramsem→semgram~ | 0.40 | [0.18, 0.61] | 0.11 | 3.61 | 0.0003 || AGREEMENT~no→yes~:NUMBERGRAM~plural→sing~ | 0.47 | [0.25, 0.68] | 0.11 | 4.24 | <0.0001 |The way the two interactions in the model are correlated with `JUDGMENT` are as follows [show plots]:* with regard to `AGREEMENT:SUBJORDER`, + when `AGREEMENT` is *no*, all levels of `JUDGMENT` have relatively low predicted probabilities, but - when `SUBJORDER` is *gramsem*, `JUDGMENT` is predicted to be -3 (with a predicted probability of 0.231); - when `SUBJORDER` is *semgram*, `JUDGMENT` is predicted to be +3 (with a predicted probability of 0.201); + when `AGREEMENT` is *yes*, most levels of `JUDGMENT` have relatively low predicted probabilities, but - the model predicts 3, followed by 2 regardless of `SUBJORDER`, but - the predicted probability of `JUDGMENT` being 3 is much higher when `SUBJORDER` is *semgram* than when it is *gramsem*; + `AGREEMENT` being *yes* strongly boosts the predicted probability of `JUDGMENT` being 3;* with regard to `AGREEMENT:NUMBERGRAM`, + when `AGREEMENT` is *no*, all levels of `JUDGMENT` have relatively low predicted probabilities, but - when `NUMBERGRAM` is *plural*, `JUDGMENT` is predicted to be 3 (with a predicted probability of 0.228); - when `NUMBERGRAM` is *singular*, `JUDGMENT` is predicted to be -3 (with a predicted probability of 0.253); + when `AGREEMENT` is *yes*, most levels of `JUDGMENT` have relatively low predicted probabilities, but - the model predicts 3, followed by 2 regardless of `NUMBERGRAM`, but + `AGREEMENT` being *yes* strongly boosts the predicted probability of `JUDGMENT` being 3.# Excursus: could we have run a linear model?Many people run linear models in cases where, strictly speaking, one should use ordinal models (because, like here, the response variable is an ordinal ranking), and they do that most likely because (i) like me, they hate ordinal models for their complexity and/or (ii) they don't know ordinal models (well (enough)). Thus, the question might arise whether we could have done the same here and fit a much more straightforward linear model -- let's find out in two steps: in this session, we go through the model selection process; in the next session, we look at model diagnostics.Let's do some linear modeling:```{r}d$JUDGMENT <-as.numeric(as.character(d$JUDGMENT)) # or as.numeric(d$JUDGMENT)-4summary(m.01.lin <-lm( # summarize m.01.lin, which models JUDGMENT ~1+# JUDGMENT ~ an overall intercept (1)# AGREEMENT interacting w/ all other predictors AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),data=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing datadrop1(m.01.lin, # drop each droppable predictor at a time from m.01.lin &test="F") # test its significance w/ an F-test```Let’s fit the 2nd model with the interaction `AGREEMENT:NEGATION` deleted:```{r}summary(m.02.lin <-update( # make m.02.lin an update of m.01.lin, .~. # m.01.lin, namely all of it (.~.), but then- AGREEMENT:NEGATION)) # remove this interactionanova(m.01.lin, m.02.lin, # compare m.01.lin to m.02.lintest="F") # using an F-testdrop1(m.02.lin, # drop each droppable predictor at a time from m.02.lin &test="F") # test its significance w/ an F-test```Let’s fit the 3rd model with the main effect `NEGATION` deleted:```{r}summary(m.03.lin <-update( # make m.03.lin an update of m.02.lin, .~. # m.02.lin, namely all of it (.~.), but then- NEGATION)) # remove this main effectanova(m.02.lin, m.03.lin, # compare m.02.lin to m.03.lintest="F") # using an F-testdrop1(m.03.lin, # drop each droppable predictor at a time from m.03.lin &test="F") # test its significance w/ an F-test```Let’s fit the 4th model with the interaction `AGREEMENT:EMPHPRON` deleted:```{r}summary(m.04.lin <-update( # make m.04.lin an update of m.03.lin, .~. # m.03.lin, namely all of it (.~.), but then- AGREEMENT:EMPHPRON)) # remove this interactionanova(m.03.lin, m.04.lin, # compare m.03.lin to m.04.lintest="F") # using an F-testdrop1(m.04.lin, # drop each droppable predictor at a time from m.04.lin &test="F") # test its significance w/ an F-test```Let’s fit the 5th model with the main effect `EMPHPRON` deleted:```{r}summary(m.05.lin <-update( # make m.05.lin an update of m.04.lin, .~. # m.04.lin, namely all of it (.~.), but then- EMPHPRON)) # remove this main effectanova(m.04.lin, m.05.lin, # compare m.04.lin to m.05.lintest="F") # using an F-testdrop1(m.05.lin, # drop each droppable predictor at a time from m.05.lin &test="F") # test its significance w/ an F-test```Let’s fit the 6th model with the interaction `AGREEMENT:AGREEMENT:NUMBERGRAM` deleted:```{r}summary(m.06.lin <-update( # make m.06.lin an update of m.05.lin, .~. # m.05.lin, namely all of it (.~.), but then- AGREEMENT:NUMBERGRAM)) # remove this interactionanova(m.05.lin, m.06.lin, # compare m.05.lin to m.06.lintest="F") # using an F-testdrop1(m.06.lin, # drop each droppable predictor at a time from m.06.lin &test="F") # test its significance w/ an F-test```Now, we can’t delete any more predictors, we’re done with model selection:```{r}summary(m.final.lin <- m.06.lin); rm(m.06.lin); invisible(gc())```So, how do the models compare?| Predictor | Final ordinal model | Final linear model | Present & directly comparable ||:-----------------------|:--------------------|:-------------------|:------------------------------|| `AGREEMENT` | yes | yes | || `NEGATION` | | | || `EMPHPRON` | | | || `SUBJORDER` | yes | yes | || `NUMBERGRAM` | yes | | || `PERSONGRAM ` | | yes | || `AGREEMENT:NEGATION` | | | || `AGREEMENT:EMPHPRON` | | | || `AGREEMENT:SUBJORDER` | yes | yes | yes || `AGREEMENT:NUMBERGRAM` | yes | | || `AGREEMENT:PERSONGRAM` | | yes | |So let's make at least that one straightforward comparison:```{r}plot(effect("AGREEMENT:SUBJORDER", m.final.lin), ylim=c(-3, 3), grid=TRUE)plot(effect("AGREEMENT:SUBJORDER", m.final.lin), ylim=c(-3, 3), grid=TRUE, x.var="SUBJORDER")```With regard to this interaction at least, the the result from the final linear model is quite similar to the one in our final ordinal model:* when `AGREEMENT` is *yes*, then subjects rated the stimuli more highly than when `AGREEMENT` is *no* (as expected), but that difference is much higher when `SUBJORDER` is *semgram* than when it is *gramsem*;* from the reverse perspective, when `SUBJORDER` is *semgram*, then subjects rated the stimuli more highly than when `SUBJORDER` is *gramsem*, but that difference is much higher when `AGREEMENT` is *yes* than when it is *no*.However, if we compare the *R*^2^s, and we need to compare the ones that can be compared, sort of, then there's quite some difference:* McFadden's *R*^2^ for the final ordinal model was 0.0356, i.e. very small;* (McFadden's) *R*^2^ for the final linear model was 0.1219, i.e. small but considerably better.So, the results are mixed, but not particularly encouraging in the sense of allowing us to go with the linear model. We'll reach a more definitive verdict during the next session.# HomeworkTo prepare for next session, read *SFLWR*^3^, Section 5.6 & 5.7.# Session info```{r}sessionInfo()```