In an earlier session, we fit a linear model to answer the question whether the logged reaction time to a word (in ms) varies as a function of
the Zipf frequency of that word (ZIPFFREQ);
the language that word was presented in (LANGUAGE: english vs. spanish);
the speaker group that words was presented to (GROUP: english vs. heritage);
any pairwise interaction of these predictors;
the three-way interaction of these predictors.
Let’s reload those data and apply the same transformation(s) etc. there as we did earlier:
rm(list=ls(all.names=TRUE)); library(effects)str(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
d$LANGUAGE <-factor(d$LANGUAGE, # for didactic reasons only, I amlevels=levels(d$LANGUAGE)[2:1]) # changing the order of the levelsd$RT_log <-log2(d$RT)
And let’s refit our initial model from there:
summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+ ZIPFFREQ*LANGUAGE*GROUP, # RT ~ these predictors & their interactionsdata=d, na.action=na.exclude)) # those vars are in d, skip NAs
Call:
lm(formula = RT ~ 1 + ZIPFFREQ * LANGUAGE * GROUP, data = d,
na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-489.7 -141.2 -54.5 67.4 3388.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1145.59 46.97 24.392 < 2e-16 ***
ZIPFFREQ -82.48 10.15 -8.126 5.12e-16 ***
LANGUAGEenglish -432.15 69.57 -6.212 5.50e-10 ***
GROUPheritage -177.07 64.64 -2.739 0.006171 **
ZIPFFREQ:LANGUAGEenglish 51.65 14.97 3.450 0.000563 ***
ZIPFFREQ:GROUPheritage 23.49 14.00 1.677 0.093489 .
LANGUAGEenglish:GROUPheritage 306.90 97.24 3.156 0.001604 **
ZIPFFREQ:LANGUAGEenglish:GROUPheritage -40.88 20.94 -1.953 0.050891 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 250.4 on 7744 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.09514, Adjusted R-squared: 0.09432
F-statistic: 116.3 on 7 and 7744 DF, p-value: < 2.2e-16
Did we have enough data points for what we did there?
How many data points did we model? Careful, it’s notnrow(d)=8000 – it’s nrow(model.frame(m_01))=7752, namely nrow(d)=8000 minus the missing data points! With that, we ask ourselves
what is the limiting sample size? Generally speaking, m is the sample size and, thus, our critical value is between 775.2 and 387.6.
what is the number of predictors/coefficients we tried to estimate from the data p in this case? The most direct way of getting that number is summary(m_01)$fstatistic[2], which amounts to 7, which is of course the length of the model’s coefficient output minus 1: length(coef(m_01))-1=7.
Since p (7) is less than either m/10=775.2 or m/20=387.6, we’re fine.
1.1.2 Binary logistic regression
In session 05, we fit a binary logistic regression model to answer the question whether the choice of a genitive construction (of vs. s) varies as a function of
a length-based short-before-long effect, but, this time, if we hypothesize a short-before-long effect, maybe we should not just be looking at the length of the possessor, but how the length of the possessor compares to the length of the possessum;
the language that word was presented in (POR_ANIMACY: animate vs. collective vs. inanimate vs. locative vs. temporal);
whether the speakers are non-native speakers or native speakers of English (SPEAKER: nns vs. ns);
any pairwise interaction of these predictors;
the three-way interaction of these predictors.
Let’s reload those data and apply the same transformation(s) etc. there as we did earlier:
rm(list=ls(all.names=TRUE))str(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+ LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # GENITIVE ~ these predictorsdata=d, family=binomial, na.action=na.exclude)) # vars are in d, resp = binary, skip NAs
Did we have enough data points for what we did there?
How many data points did we model? This time it isnrow(d)=3600 and of course also nrow(model.frame(m_01))=3600. With that, we ask ourselves
what is the limiting sample size? Generally speaking, m is computed as min(table(d$GENITIVE)) and, thus, our critical value is between 88 and 44.
what is the number of predictors/coefficients we tried to estimate from the data p in this case? The most direct way of getting that number is m_01$df.null-m_01$df.residual, which amounts to 19, which is of course the length of the model’s coefficient output minus 1: length(coef(m_01))-1=19.
Since p (19) is less than either m/10=88 or m/20=44, we’re fine.
1.1.3 Ordinal logistic regression
In session 07, we fit an ordinal logistic regression model to answer the question(s) whether the rating of a Spanish construction involving the semantic role of an experiencer varies as a function of
whether there is the right subject-verb agreement (AGREEMENT: no vs. yes);
whether the clause is negated (NEGATION: affirmative vs. negative);
whether the clause contains an emphatic pronoun (EMPHPRON: no vs. yes);
what the order of the semantic and the grammatical subject is (SUBJORDER: gramsem vs. semgram);
what the number of the grammatical subject is (NUMPERGRAM: plural vs. singular);
what the person of the grammatical subject is (PERSONGRAM: first vs. second vs. third);
any interaction of a predictor with AGREEMENT.
Let’s reload those data and apply the same transformation(s) etc. there as we did earlier:
rm(list=ls(all.names=TRUE))library(car); library(effects); library(MASS); library(magrittr)str(d <-read.delim( # summarize d, the result of loadingfile="_input/agreementratings.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
summary(m_01 <-polr( # summarize m_01, which models JUDGMENT ~1+# JUDGMENT ~ an overall intercept (1)# AGREEMENT interacting w/ all other predictors AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),Hess=TRUE, # recommended for using summarydata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data
Did we have enough data points for what we did there?
How many data points did we model? It again isnrow(d)=4300 and of course also nrow(model.frame(m_01))=4300; for polr objects, we could also get this from m_01$n, which also returns 4300). With that, we ask ourselves
what is the limiting sample size? Generally speaking, m is computed as m_01$n - (1/m_01$n^2) * sum(table(d$JUDGMENT)^3)=4123.5 and, thus, our critical value is between m/10=412.4 and m/20=206.2.
what is the number of predictors/coefficients we tried to estimate from the data p in this case? We can get that directly from the model object with m_01$edf (edf for ‘effective degrees of freedom’) and we get 19.
Since p (19) is less than either m/10=412.4 or m/20=206.2, we’re fine.
1.2 Residuals (of a linear model)
1.2.1 Example 1
In an earlier session, we fit a linear model to answer the question whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words, FREQPMW). We first re-load the data and refit the model, …
rm(list=ls(all.names=TRUE))library(car); library(effects); library(magrittr)str(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+ FREQPMW, # RT ~ this predictordata=d, na.action=na.exclude)) # vars are in d, skip NAs
Call:
lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-396.3 -156.1 -68.9 72.4 3468.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 673.83892 3.65267 184.479 < 2e-16 ***
FREQPMW -0.14926 0.02546 -5.863 4.73e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 262.5 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.004416, Adjusted R-squared: 0.004287
F-statistic: 34.37 on 1 and 7750 DF, p-value: 4.734e-09
… and now we generate three model-diagnostic plots for this model to see whether this model was a ‘good one’:
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default
Obviously, these look terrible! what can we do? There are some ways to address this, some of which we already discussed and which include, but are not limited to, any one or more of the following:
doing something about the response variable before the model such as
transforming the response variable (e.g., logging or some other transformation);
restricting the response variable (e.g., picking shortest densest regions);
doing something about the predictor before the model such as
transforming the predictor (e.g., logging or some other transformation);
allowing the predictor to be (more) curved;
doing something about the data points that, after modeling, are accounted for worst, which could be done
by determining all data points’ (standardized) residuals with stdres or residuals and
excluding those with high absolute (standardized) residuals;
adding more predictors (to further reduce the deviance).
Let’s look at each one in isolation, beginning with transforming the response. As you can see, that is already somewhat useful:
# restricting the response variabled$RTLOG <-log2(d$RT)m_01.resptransf <-lm( # make the linear model m_01.resptransf: RTLOG ~1+ FREQPMW, # RTLOG ~ this predictordata=d, na.action=na.exclude) # vars are in d, skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.resptransf, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.resptransf), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default
Next we try restricting the response, which seems to work very well, even better than the log transformation:
table(keep <- d$RT>=338& d$RT<=1042, useNA="ifany") # restricting the response
FALSE TRUE <NA>
552 7200 248
m_01.resprestr <-lm( # make the linear model m_01.resprestr: RT ~1+ FREQPMW, # RT ~ this predictordata=d, subset=keep, # vars are in d, but use only this subsetna.action=na.exclude) # skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.resprestr, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.resprestr), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default
Now what about changing things on the predictor side of things? For that we could use the predictor ZIPFFREQ, which is a log-transformed version of FREQPMW, and the result is not very promising!
m_01.predtransf <-lm( # make the linear model m_01.predtransf: RT ~1+ ZIPFFREQ, # RT ~ this predictordata=d, na.action=na.exclude) # vars are in d, skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.predtransf, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.predtransf), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default
We could also allow the predictor to be curved (something we’ll deal with in more detail in 204), but that’s again not great:
m_01.predpoly <-lm( # make the linear model m_01.predpoly: RT ~1+poly(FREQPMW, 3), # RT ~ this predictordata=d, na.action=na.exclude) # vars are in d, skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.predpoly, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.predpoly), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default
Interim summary: working on the response seems more promising than working on the predictor, and of the two ways of dealing with the response, the restriction strategy seems to be better.
Now, what about trying to fix this after the first bad model? Let’s exclude all data points with absolute standardized residuals >2.5. It seems as of some of the previous approaches worked better than this one here:
m_01.stdres <-lm( # make the linear model m_01.stdres: RT ~1+ FREQPMW, # RT ~ this predictordata=d, subset=keep, # vars are in d, but use only this subsetna.action=na.exclude) # skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.stdres, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.stdres), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default
Finally, note also that we’re just using a single predictor here, meaning we’re asking that one predictor ZIPFFREQ to do a lot – if other predictors are included, the situation can often change quite a bit for the better. Thus, if one
restricts the response (as above), and
uses the log-transformed predictor ZIPFFREQand lets it be curved, and
adds the log-transformed control LENGTHand lets it be curved, …
m_01.multiple <-lm( # make the linear m_01.multiple: RT ~1+poly(ZIPFFREQ, 2) +# RT ~ ZIPFFREQ (curved)poly(LENGTH, 2), # + a control variable (also curved)data=d, subset=keep, # vars are in d, but use only this subsetna.action=na.exclude) # skip cases with NA/missing data (good habit)par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnplot(m_01.multiple, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.multiple), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default
… and the predictor and the control behave as hypothesized:
Remember that the residuals of a linear model are supposed to be normally distributed and to, when plotted against the fitted values, exhibit constant variance and no particular patterning. With that in mind, let’s check this for m_final_lin from last session, where we forced a linear model onto ordinal data. We first re-load the data and refit the model, …
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(main="Residuals vs. fitted values", pch=16, col="#00000010", axes=TRUE,xlab="Fitted values", xlim=c(-1, 2), x=jitter(predict(m_final_lin)),ylab="Residuals" , y=residuals(m_final_lin))axis(1, at=seq(-1, 2, 1)); axis(2); grid()lines(lowess(residuals(m_final_lin) ~predict(m_final_lin)), col="red")qqnorm(residuals(m_final_lin)); qqline(residuals(m_final_lin)); grid()hist(residuals(m_final_lin), breaks="FD", main="Residuals")par(mfrow=c(1, 1)) # reset to default
This doesn’t look great: there is structure in the first plot (which of course is in part due to the 7 discrete levels of the response) and both the middle and the right plot don’t exactly scream ‘the residuals are normally distributed’. Based on these diagnostic plots, I would advise against the adoption of the linear model here – yes, it is more straightforward and at least one effect is comparable, but the data don’t really make the linear model seem appropriate.
1.3 Influential observations
The previous part was concerned with a regression model that may not have passed its diagnostic tests. Apart from “removing data points that the regression can only deal with very badly” (i.e. working with residuals), we can also try and identify which points have a perhaps an unduly high influence on the regression line or predictions, e.g. by identifying data points with something like extreme dfbetas (or Cook’s D-values).
rm(list=ls(all.names=TRUE)); library(effects)str(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
Does that help for the linear model m_01 we fit with FREQPMW above?
summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+ FREQPMW, # RT ~ FREQPMWdata=d, na.action=na.exclude)) # those vars are in d, skip NAs
Call:
lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-396.3 -156.1 -68.9 72.4 3468.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 673.83892 3.65267 184.479 < 2e-16 ***
FREQPMW -0.14926 0.02546 -5.863 4.73e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 262.5 on 7750 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.004416, Adjusted R-squared: 0.004287
F-statistic: 34.37 on 1 and 7750 DF, p-value: 4.734e-09
# compute dfbetas for each coefficient & each observationsummary(dfbetas_m_01 <-dfbeta(m_01))
(Intercept) FREQPMW
Min. :-4.380e-01 Min. :-1.676e-03
1st Qu.:-2.170e-02 1st Qu.:-5.219e-05
Median :-6.234e-03 Median : 4.671e-06
Mean :-3.470e-06 Mean : 5.230e-08
3rd Qu.: 8.259e-03 3rd Qu.: 6.134e-05
Max. : 4.587e-01 Max. : 6.253e-03
# find the values that cut off 5% on each side:apply(dfbetas_m_01, 2, quantile, probs=c(0.05, 0.95))
# get ready tp keep only the non-extreme values:keep <- (dfbetas_m_01[,1]>=-0.04028195& dfbetas_m_01[,1] <=0.06873367) & (dfbetas_m_01[,2]>-0.000296758& dfbetas_m_01[,2] <=0.0001671488)
Seems like we could try and get a better fit if we eliminate the 10% values with the highest absolute dfbetas:
summary(m_01_inflnc <-lm( # make/summarize the linear model m_01_inflnc: RT ~1+ FREQPMW, # RT ~ this predictordata=d, subset=keep, # vars are in d, but use only this subsetna.action=na.exclude)) # skip NAs
Call:
lm(formula = RT ~ 1 + FREQPMW, data = d, subset = keep, na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-284.63 -107.34 -28.29 80.79 600.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 643.08196 2.32562 276.52 <2e-16 ***
FREQPMW -0.31856 0.02216 -14.37 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 141.5 on 6561 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.03053, Adjusted R-squared: 0.03038
F-statistic: 206.6 on 1 and 6561 DF, p-value: < 2.2e-16
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01_inflnc, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01_inflnc), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default
These plots for m_01_inflnc suggest some slightly confusing/surprising improvement:
the left plot for m_01_inflnc is very strongly patterned;
the two other plots have improved considerably.
Again, there simply aren’t always straightforward solutions for data like this …
2 Validation and overfitting
We are dealing with the same data set as in earlier sessions; as a reminder, the data are in _input/rts.csv and you can find information about the variables/columns in _input/rts.r. There, we addressed the question of whether the reaction time to a word (in ms) varies as a function of
the Zipf frequency of that word (ZIPFFREQ);
the language that word was presented in (LANGUAGE: english vs. spanish);
the speaker group that words was presented to (GROUP: english vs. heritage);
any pairwise interaction of these predictors;
the three-way interaction of these predictors.
At the end of our modeling process, which involved logging the response variable RT, we arrived at this final model:
rm(list=ls(all.names=TRUE))library(car); library(effects)str(d <-read.delim( # show the structure of d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors
d$RT_log <-log2(d$RT)summary(m_final <-lm(RT_log ~1+ ZIPFFREQ + LANGUAGE + GROUP + LENGTH + ZIPFFREQ:LENGTH + LANGUAGE:GROUP + LANGUAGE:LENGTH,data=d, na.action=na.exclude))
Call:
lm(formula = RT_log ~ 1 + ZIPFFREQ + LANGUAGE + GROUP + LENGTH +
ZIPFFREQ:LENGTH + LANGUAGE:GROUP + LANGUAGE:LENGTH, data = d,
na.action = na.exclude)
Residuals:
Min 1Q Median 3Q Max
-1.24964 -0.27871 -0.06224 0.21043 2.55420
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.002136 0.207740 43.334 < 2e-16 ***
ZIPFFREQ 0.007370 0.042986 0.171 0.86387
LANGUAGEspanish 0.257516 0.055524 4.638 3.58e-06 ***
GROUPheritage 0.102783 0.013426 7.655 2.16e-14 ***
LENGTH 0.113334 0.038903 2.913 0.00359 **
ZIPFFREQ:LENGTH -0.021360 0.008065 -2.648 0.00811 **
LANGUAGEspanish:GROUPheritage -0.205373 0.019252 -10.668 < 2e-16 ***
LANGUAGEspanish:LENGTH 0.018385 0.010480 1.754 0.07943 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4233 on 7744 degrees of freedom
(248 observations deleted due to missingness)
Multiple R-squared: 0.1227, Adjusted R-squared: 0.1219
F-statistic: 154.7 on 7 and 7744 DF, p-value: < 2.2e-16
The R-squared value for this is 0.1227 – but how much does this overestimate things because the correlation is based on predictions that are applied to the same data set that the model is based on? How could we do this better? We could
split our data up into 10 bins;
for each of the bins, we
fit a model based on 90% of the data not in the bin;
make predictions for the held-out remaining 10% of the data in the bin;
collect the resulting R2 in a data structure called collector;
compare the R2 from m_final to the distribution of collector:
# step 1:set.seed(123); d$RAND <-sample(1:10,size=nrow(d),replace=TRUE)# you could also do this to get a perfectly even distribution:# d$RAND <- sample(rep(1:10, length.out=nrow(d)))collector <-rep(-1, 10)# step 2:for (i in1:10) {# step 2a: fit the training model d_train <- d[d$RAND!=i,] m_train <-lm(RT_log ~1+ ZIPFFREQ + LANGUAGE + GROUP + LENGTH + ZIPFFREQ:LENGTH + LANGUAGE:GROUP + LANGUAGE:LENGTH,data=d_train, na.action=na.exclude)# step 2b: make predictions for the test data d_test <- d[d$RAND==i,] d_test$PRED <-predict(m_train, newdata=d_test)# step 2c: compute the multiple R-squared collector[i] <-cor(d_test$RT_log, d_test$PRED, use="complete.obs")^2}; rm(d_test, d_train, i, m_train); invisible(gc())# step 3:collector %>% summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.08413 0.10627 0.11137 0.12269 0.14717 0.16249
Our initial final model does not seem to overfit at all!
---title: "Ling 202: session 08: diagnostics/validation (key)"author: - name: "[Stefan Th. Gries](https://www.stgries.info)" affiliation: - UC Santa Barbara - JLU Giessen orcid: 0000-0002-6497-3958date: "2025-05-21 12:34:56"date-format: "DD MMM YYYY HH-mm-ss"editor: sourceformat: html: page-layout: full code-fold: false code-link: true code-copy: true code-tools: true code-line-numbers: true code-overflow: scroll number-sections: true smooth-scroll: true toc: true toc-depth: 4 number-depth: 4 toc-location: left monofont: lucida console tbl-cap-location: top fig-cap-location: bottom fig-width: 5 fig-height: 5 fig-format: png fig-dpi: 300 fig-align: center embed-resources: trueexecute: cache: false echo: true eval: true warning: false---# Model assumptions & diagnostics## Amount of data### Linear regressionIn an earlier session, we fit a linear model to answer the question whether the logged reaction time to a word (in ms) varies as a function of* the Zipf frequency of that word (`ZIPFFREQ`);* the language that word was presented in (`LANGUAGE`: *english* vs. *spanish*);* the speaker group that words was presented to (`GROUP`: *english* vs. *heritage*);* any pairwise interaction of these predictors;* the three-way interaction of these predictors.Let's reload those data and apply the same transformation(s) etc. there as we did earlier:```{r}rm(list=ls(all.names=TRUE)); library(effects)str(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd$LANGUAGE <-factor(d$LANGUAGE, # for didactic reasons only, I amlevels=levels(d$LANGUAGE)[2:1]) # changing the order of the levelsd$RT_log <-log2(d$RT)```And let's refit our initial model from there:```{r}summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+ ZIPFFREQ*LANGUAGE*GROUP, # RT ~ these predictors & their interactionsdata=d, na.action=na.exclude)) # those vars are in d, skip NAs```Did we have enough data points for what we did there?How many data points did we model? Careful, it's *not* `nrow(d)`=`r nrow(d)` -- it's `nrow(model.frame(m_01))`=`r nrow(model.frame(m_01))`, namely `nrow(d)`=`r nrow(d)` minus the missing data points! With that, we ask ourselves* what is the limiting sample size? Generally speaking, *m* is the sample size and, thus, our critical value is between `r nrow(model.frame(m_01))/10` and `r nrow(model.frame(m_01))/20`.* what is the number of predictors/coefficients we tried to estimate from the data *p* in this case? The most direct way of getting that number is `summary(m_01)$fstatistic[2]`, which amounts to `r summary(m_01)$fstatistic[2]`, which is of course the length of the model's coefficient output minus 1: `length(coef(m_01))-1`=`r length(coef(m_01))-1`.Since *p* (`r summary(m_01)$fstatistic[2]`) is less than either ^*m*^/~10~=`r nrow(model.frame(m_01))/10` or ^*m*^/~20~=`r nrow(model.frame(m_01))/20`, we're fine.### Binary logistic regressionIn session 05, we fit a binary logistic regression model to answer the question whether the choice of a genitive construction (*of* vs. *s*) varies as a function of* a length-based short-before-long effect, but, this time, if we hypothesize a short-before-long effect, maybe we should not just be looking at the length of the possessor, but how the length of the possessor compares to the length of the possessum;* the language that word was presented in (`POR_ANIMACY`: *animate* vs. *collective* vs. *inanimate* vs. *locative* vs. *temporal*);* whether the speakers are non-native speakers or native speakers of English (SPEAKER: *nns* vs. *ns*);* any pairwise interaction of these predictors;* the three-way interaction of these predictors.Let's reload those data and apply the same transformation(s) etc. there as we did earlier:```{r}rm(list=ls(all.names=TRUE))str(d <-read.delim( # summarize d, the result of loadingfile="_input/genitives.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd$LEN_PORmPUM_LOG <-log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)```And let's refit our initial model from there:```{r}summary(m_01 <-glm( # make/summarize the gen. linear model m_01: GENITIVE ~1+ LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # GENITIVE ~ these predictorsdata=d, family=binomial, na.action=na.exclude)) # vars are in d, resp = binary, skip NAs```Did we have enough data points for what we did there?How many data points did we model? This time it *is* `nrow(d)`=`r nrow(d)` and of course also `nrow(model.frame(m_01))`=`r nrow(model.frame(m_01))`. With that, we ask ourselves* what is the limiting sample size? Generally speaking, *m* is computed as `min(table(d$GENITIVE))` and, thus, our critical value is between `r min(table(d$GENITIVE))/10` and `r min(table(d$GENITIVE))/20`.* what is the number of predictors/coefficients we tried to estimate from the data *p* in this case? The most direct way of getting that number is `m_01$df.null-m_01$df.residual`, which amounts to `r m_01$df.null-m_01$df.residual`, which is of course the length of the model's coefficient output minus 1: `length(coef(m_01))-1`=`r length(coef(m_01))-1`.Since *p* (`r length(coef(m_01))-1`) is less than either ^*m*^/~10~=`r min(table(d$GENITIVE))/10` or ^*m*^/~20~=`r min(table(d$GENITIVE))/20`, we're fine.### Ordinal logistic regressionIn session 07, we fit an ordinal logistic regression model to answer the question(s) whether the rating of a Spanish construction involving the semantic role of an experiencer varies as a function of* whether there is the right subject-verb agreement (`AGREEMENT`: *no* vs. *yes*);* whether the clause is negated (`NEGATION`: *affirmative* vs. *negative*);* whether the clause contains an emphatic pronoun (`EMPHPRON`: *no* vs. *yes*);* what the order of the semantic and the grammatical subject is (`SUBJORDER`: *gramsem* vs. *semgram*);* what the number of the grammatical subject is (`NUMPERGRAM`: *plural* vs. *singular*);* what the person of the grammatical subject is (`PERSONGRAM`: *first* vs. *second* vs. *third*);* any interaction of a predictor with `AGREEMENT`.Let's reload those data and apply the same transformation(s) etc. there as we did earlier:```{r}rm(list=ls(all.names=TRUE))library(car); library(effects); library(MASS); library(magrittr)str(d <-read.delim( # summarize d, the result of loadingfile="_input/agreementratings.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd$JUDGMENT <-factor(d$JUDGMENT, ordered=TRUE)```And let's refit our initial model from there:```{r}summary(m_01 <-polr( # summarize m_01, which models JUDGMENT ~1+# JUDGMENT ~ an overall intercept (1)# AGREEMENT interacting w/ all other predictors AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),Hess=TRUE, # recommended for using summarydata=d, # those vars are in dna.action=na.exclude)) # skip cases with NA/missing data```Did we have enough data points for what we did there?How many data points did we model? It again *is* `nrow(d)`=`r nrow(d)` and of course also `nrow(model.frame(m_01))`=`r nrow(model.frame(m_01))`; for `polr` objects, we could also get this from `m_01$n`, which also returns `r m_01$n`). With that, we ask ourselves* what is the limiting sample size? Generally speaking, *m* is computed as `m_01$n - (1/m_01$n^2) * sum(table(d$JUDGMENT)^3)`=`r round(m_01$n - (1/m_01$n^2) * sum(table(d$JUDGMENT)^3), 1)` and, thus, our critical value is between ^*m*^/~10~=`r round((m_01$n - (1/m_01$n^2) * sum(table(d$JUDGMENT)^3))/10, 1)` and ^*m*^/~20~=`r round((m_01$n - (1/m_01$n^2) * sum(table(d$JUDGMENT)^3))/20, 1)`.* what is the number of predictors/coefficients we tried to estimate from the data *p* in this case? We can get that directly from the model object with `m_01$edf` (*edf* for 'effective degrees of freedom') and we get `r m_01$edf`.Since *p* (`r m_01$edf`) is less than either ^*m*^/~10~=`r round((m_01$n - (1/m_01$n^2) * sum(table(d$JUDGMENT)^3))/10, 1)` or ^*m*^/~20~=`r round((m_01$n - (1/m_01$n^2) * sum(table(d$JUDGMENT)^3))/20, 1)`, we're fine.## Residuals (of a linear model)### Example 1In an earlier session, we fit a linear model to answer the question whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words, `FREQPMW`). We first re-load the data and refit the model, ...```{r}rm(list=ls(all.names=TRUE))library(car); library(effects); library(magrittr)str(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorssummary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+ FREQPMW, # RT ~ this predictordata=d, na.action=na.exclude)) # vars are in d, skip NAs```... and now we generate three model-diagnostic plots for this model to see whether this model was a 'good one':```{r}#| fig-width: 9#| fig-show: holdpar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default```Obviously, these look *terrible*! what can we do? There are some ways to address this, some of which we already discussed and which include, but are not limited to, any one or more of the following:* doing something about the response variable *before the model* such as + transforming the response variable (e.g., logging or some other transformation); + restricting the response variable (e.g., picking shortest densest regions);* doing something about the predictor *before the model* such as + transforming the predictor (e.g., logging or some other transformation); + allowing the predictor to be (more) curved;* doing something about the data points that, *after modeling*, are accounted for worst, which could be done + by determining all data points' (standardized) residuals with `stdres` or `residuals` and + excluding those with high absolute (standardized) residuals;* adding more predictors (to further reduce the deviance).Let's look at each one in isolation, beginning with transforming the response. As you can see, that is already somewhat useful:```{r}#| fig-width: 9#| fig-show: hold# restricting the response variabled$RTLOG <-log2(d$RT)m_01.resptransf <-lm( # make the linear model m_01.resptransf: RTLOG ~1+ FREQPMW, # RTLOG ~ this predictordata=d, na.action=na.exclude) # vars are in d, skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.resptransf, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.resptransf), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default```Next we try restricting the response, which seems to work very well, even better than the log transformation:```{r}#| fig-width: 9#| fig-show: holdtable(keep <- d$RT>=338& d$RT<=1042, useNA="ifany") # restricting the responsem_01.resprestr <-lm( # make the linear model m_01.resprestr: RT ~1+ FREQPMW, # RT ~ this predictordata=d, subset=keep, # vars are in d, but use only this subsetna.action=na.exclude) # skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.resprestr, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.resprestr), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default```Now what about changing things on the predictor side of things? For that we could use the predictor `ZIPFFREQ`, which is a log-transformed version of `FREQPMW`, and the result is not very promising!```{r}#| fig-width: 9#| fig-show: holdm_01.predtransf <-lm( # make the linear model m_01.predtransf: RT ~1+ ZIPFFREQ, # RT ~ this predictordata=d, na.action=na.exclude) # vars are in d, skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.predtransf, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.predtransf), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default```We could also allow the predictor to be curved (something we'll deal with in more detail in 204), but that's again not great:```{r}#| fig-width: 9#| fig-show: holdm_01.predpoly <-lm( # make the linear model m_01.predpoly: RT ~1+poly(FREQPMW, 3), # RT ~ this predictordata=d, na.action=na.exclude) # vars are in d, skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.predpoly, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.predpoly), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default```Interim summary: working on the response seems more promising than working on the predictor, and of the two ways of dealing with the response, the restriction strategy seems to be better.Now, what about trying to fix this after the first bad model? Let's exclude all data points with absolute standardized residuals >2.5. It seems as of some of the previous approaches worked better than this one here:```{r}#| fig-width: 9#| fig-show: holdtable(keep <-abs(stdres(m_01)) <2.5, useNA="ifany")m_01.stdres <-lm( # make the linear model m_01.stdres: RT ~1+ FREQPMW, # RT ~ this predictordata=d, subset=keep, # vars are in d, but use only this subsetna.action=na.exclude) # skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01.stdres, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.stdres), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default```Finally, note also that we're just using a single predictor here, meaning we're asking that one predictor `ZIPFFREQ` to do a lot -- if other predictors are included, the situation can often change quite a bit for the better. Thus, if one* restricts the response (as above), and* uses the log-transformed predictor `ZIPFFREQ` **and** lets it be curved, and* adds the log-transformed control `LENGTH` **and** lets it be curved, ...then the diagnostics look fairly good ...```{r}#| fig-width: 9#| fig-show: holdtable(keep <- d$RT>=338& d$RT<=1042, useNA="ifany")m_01.multiple <-lm( # make the linear m_01.multiple: RT ~1+poly(ZIPFFREQ, 2) +# RT ~ ZIPFFREQ (curved)poly(LENGTH, 2), # + a control variable (also curved)data=d, subset=keep, # vars are in d, but use only this subsetna.action=na.exclude) # skip cases with NA/missing data (good habit)par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnplot(m_01.multiple, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01.multiple), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default```... and the predictor and the control behave as hypothesized:```{r}plot(effect("poly(ZIPFFREQ, 2)", m_01.multiple), ylim=c(338, 1042), grid=TRUE)plot(effect("poly(LENGTH, 2)", m_01.multiple), ylim=c(338, 1042), grid=TRUE)```As you can see, there may be no simple solution ...### Example 2Remember that the residuals of a linear model are supposed to be normally distributed and to, when plotted against the fitted values, exhibit constant variance and no particular patterning. With that in mind, let's check this for `m_final_lin` from last session, where we forced a linear model onto ordinal data. We first re-load the data and refit the model, ...```{r}str(d_ord <-read.delim("_input/agreementratings.csv", stringsAsFactors=TRUE))summary(m_final_lin <-lm(JUDGMENT ~ AGREEMENT * (SUBJORDER + PERSONGRAM) + NUMBERGRAM, data=d_ord))```... and then we plot diagnostic plots:```{r}#| fig-width: 9#| fig-show: holdpar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(main="Residuals vs. fitted values", pch=16, col="#00000010", axes=TRUE,xlab="Fitted values", xlim=c(-1, 2), x=jitter(predict(m_final_lin)),ylab="Residuals" , y=residuals(m_final_lin))axis(1, at=seq(-1, 2, 1)); axis(2); grid()lines(lowess(residuals(m_final_lin) ~predict(m_final_lin)), col="red")qqnorm(residuals(m_final_lin)); qqline(residuals(m_final_lin)); grid()hist(residuals(m_final_lin), breaks="FD", main="Residuals")par(mfrow=c(1, 1)) # reset to default```This doesn't look great: there is structure in the first plot (which of course is in part due to the 7 discrete levels of the response) and both the middle and the right plot don't exactly scream 'the residuals are normally distributed'. Based on these diagnostic plots, I would advise against the adoption of the linear model here -- yes, it is more straightforward and at least one effect is comparable, but the data don't really make the linear model seem appropriate.## Influential observationsThe previous part was concerned with a regression model that may not have passed its diagnostic tests. Apart from "removing data points that the regression can only deal with very badly" (i.e. working with residuals), we can also try and identify which points have a perhaps an unduly high influence on the regression line or predictions, e.g. by identifying data points with something like extreme dfbetas (or Cook's *D*-values).```{r}rm(list=ls(all.names=TRUE)); library(effects)str(d <-read.delim( # summarize d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factors```Does that help for the linear model `m_01` we fit with `FREQPMW` above?```{r}summary(m_01 <-lm( # make/summarize the linear model m_01: RT ~1+ FREQPMW, # RT ~ FREQPMWdata=d, na.action=na.exclude)) # those vars are in d, skip NAs# compute dfbetas for each coefficient & each observationsummary(dfbetas_m_01 <-dfbeta(m_01))# find the values that cut off 5% on each side:apply(dfbetas_m_01, 2, quantile, probs=c(0.05, 0.95))# get ready tp keep only the non-extreme values:keep <- (dfbetas_m_01[,1]>=-0.04028195& dfbetas_m_01[,1] <=0.06873367) & (dfbetas_m_01[,2]>-0.000296758& dfbetas_m_01[,2] <=0.0001671488)```Seems like we could try and get a better fit if we eliminate the 10% values with the highest absolute dfbetas:```{r}#| fig-width: 9#| fig-show: holdsummary(m_01_inflnc <-lm( # make/summarize the linear model m_01_inflnc: RT ~1+ FREQPMW, # RT ~ this predictordata=d, subset=keep, # vars are in d, but use only this subsetna.action=na.exclude)) # skip NAspar(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columnsplot(m_01_inflnc, which=1:2) # plot model-diagnostic plots 1 & 2hist(residuals(m_01_inflnc), main="Residuals") # add a histogram of the residualspar(mfrow=c(1, 1)) # reset to default```These plots for `m_01_inflnc` suggest some slightly confusing/surprising improvement:* the left plot for `m_01_inflnc` is very strongly patterned;* the two other plots have improved considerably.Again, there simply aren't always straightforward solutions for data like this ...# Validation and overfittingWe are dealing with the same data set as in earlier sessions; as a reminder, the data are in [_input/rts.csv](_input/rts.csv) and you can find information about the variables/columns in [_input/rts.r](_input/rts.r). There, we addressed the question of whether the reaction time to a word (in ms) varies as a function of* the Zipf frequency of that word (`ZIPFFREQ`);* the language that word was presented in (`LANGUAGE`: *english* vs. *spanish*);* the speaker group that words was presented to (`GROUP`: *english* vs. *heritage*);* any pairwise interaction of these predictors;* the three-way interaction of these predictors.At the end of our modeling process, which involved logging the response variable `RT`, we arrived at this final model:```{r}rm(list=ls(all.names=TRUE))library(car); library(effects)str(d <-read.delim( # show the structure of d, the result of loadingfile="_input/rts.csv", # this filestringsAsFactors=TRUE)) # change categorical variables into factorsd$RT_log <-log2(d$RT)summary(m_final <-lm(RT_log ~1+ ZIPFFREQ + LANGUAGE + GROUP + LENGTH + ZIPFFREQ:LENGTH + LANGUAGE:GROUP + LANGUAGE:LENGTH,data=d, na.action=na.exclude))```The R-squared value for this is 0.1227 -- but how much does this overestimate things because the correlation is based on predictions that are applied to the same data set that the model is based on? How could we do this better? We could1. split our data up into 10 bins;2. for each of the bins, we a. fit a model based on 90% of the data not in the bin; b. make predictions for the held-out remaining 10% of the data in the bin; c. collect the resulting *R*^2^ in a data structure called `collector`;3. compare the *R*^2^ from `m_final` to the distribution of `collector`:```{r}# step 1:set.seed(123); d$RAND <-sample(1:10,size=nrow(d),replace=TRUE)# you could also do this to get a perfectly even distribution:# d$RAND <- sample(rep(1:10, length.out=nrow(d)))collector <-rep(-1, 10)# step 2:for (i in1:10) {# step 2a: fit the training model d_train <- d[d$RAND!=i,] m_train <-lm(RT_log ~1+ ZIPFFREQ + LANGUAGE + GROUP + LENGTH + ZIPFFREQ:LENGTH + LANGUAGE:GROUP + LANGUAGE:LENGTH,data=d_train, na.action=na.exclude)# step 2b: make predictions for the test data d_test <- d[d$RAND==i,] d_test$PRED <-predict(m_train, newdata=d_test)# step 2c: compute the multiple R-squared collector[i] <-cor(d_test$RT_log, d_test$PRED, use="complete.obs")^2}; rm(d_test, d_train, i, m_train); invisible(gc())# step 3:collector %>% summary```Our initial final model does not seem to overfit at all!# Session info```{r}sessionInfo()```