Ling 202: session 08: diagnostics/validation (key)

Author

Affiliations

Stefan Th. Gries

UC Santa Barbara

JLU Giessen

Published

21 May 2025 12-34-56

1 Model assumptions & diagnostics

1.1 Amount of data

1.1.1 Linear regression

In an earlier session, we fit a linear model to answer the question whether the logged reaction time to a word (in ms) varies as a function of

the Zipf frequency of that word (ZIPFFREQ);
the language that word was presented in (LANGUAGE: english vs. spanish);
the speaker group that words was presented to (GROUP: english vs. heritage);
any pairwise interaction of these predictors;
the three-way interaction of these predictors.

Let’s reload those data and apply the same transformation(s) etc. there as we did earlier:

rm(list=ls(all.names=TRUE)); library(effects)
str(d <- read.delim(       # summarize d, the result of loading
   file="_input/rts.csv",  # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

'data.frame':   8000 obs. of  9 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
 $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
 $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
 $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
 $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
 $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
 $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...

d$LANGUAGE <- factor(d$LANGUAGE,   # for didactic reasons only, I am
   levels=levels(d$LANGUAGE)[2:1]) # changing the order of the levels
d$RT_log <- log2(d$RT)

And let’s refit our initial model from there:

summary(m_01 <- lm(          # make/summarize the linear model m_01:
   RT ~ 1 + ZIPFFREQ*LANGUAGE*GROUP, # RT ~ these predictors & their interactions
   data=d, na.action=na.exclude))    # those vars are in d, skip NAs


Call:
lm(formula = RT ~ 1 + ZIPFFREQ * LANGUAGE * GROUP, data = d,
    na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-489.7 -141.2  -54.5   67.4 3388.8

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)
(Intercept)                             1145.59      46.97  24.392  < 2e-16 ***
ZIPFFREQ                                 -82.48      10.15  -8.126 5.12e-16 ***
LANGUAGEenglish                         -432.15      69.57  -6.212 5.50e-10 ***
GROUPheritage                           -177.07      64.64  -2.739 0.006171 **
ZIPFFREQ:LANGUAGEenglish                  51.65      14.97   3.450 0.000563 ***
ZIPFFREQ:GROUPheritage                    23.49      14.00   1.677 0.093489 .
LANGUAGEenglish:GROUPheritage            306.90      97.24   3.156 0.001604 **
ZIPFFREQ:LANGUAGEenglish:GROUPheritage   -40.88      20.94  -1.953 0.050891 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 250.4 on 7744 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.09514,   Adjusted R-squared:  0.09432
F-statistic: 116.3 on 7 and 7744 DF,  p-value: < 2.2e-16

Did we have enough data points for what we did there?

How many data points did we model? Careful, it’s not nrow(d)=8000 – it’s nrow(model.frame(m_01))=7752, namely nrow(d)=8000 minus the missing data points! With that, we ask ourselves

what is the limiting sample size? Generally speaking, m is the sample size and, thus, our critical value is between 775.2 and 387.6.
what is the number of predictors/coefficients we tried to estimate from the data p in this case? The most direct way of getting that number is summary(m_01)$fstatistic[2], which amounts to 7, which is of course the length of the model’s coefficient output minus 1: length(coef(m_01))-1=7.

Since p (7) is less than either ^m/₁₀=775.2 or ^m/₂₀=387.6, we’re fine.

1.1.2 Binary logistic regression

In session 05, we fit a binary logistic regression model to answer the question whether the choice of a genitive construction (of vs. s) varies as a function of

a length-based short-before-long effect, but, this time, if we hypothesize a short-before-long effect, maybe we should not just be looking at the length of the possessor, but how the length of the possessor compares to the length of the possessum;
the language that word was presented in (POR_ANIMACY: animate vs. collective vs. inanimate vs. locative vs. temporal);
whether the speakers are non-native speakers or native speakers of English (SPEAKER: nns vs. ns);
any pairwise interaction of these predictors;
the three-way interaction of these predictors.

Let’s reload those data and apply the same transformation(s) etc. there as we did earlier:

rm(list=ls(all.names=TRUE))
str(d <- read.delim(       # summarize d, the result of loading
   file="_input/genitives.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

'data.frame':   3600 obs. of  9 variables:
 $ CASE         : int  2 3 4 5 6 7 8 9 10 11 ...
 $ GENITIVE     : Factor w/ 2 levels "of","s": 1 1 1 1 2 2 2 2 2 2 ...
 $ SPEAKER      : Factor w/ 2 levels "nns","ns": 1 1 1 1 1 1 1 1 1 1 ...
 $ MODALITY     : Factor w/ 2 levels "spoken","written": 1 1 1 1 1 1 1 1 1 1 ...
 $ POR_LENGTH   : int  13 22 11 26 8 7 6 5 5 5 ...
 $ PUM_LENGTH   : int  7 7 8 4 4 3 36 3 5 5 ...
 $ POR_ANIMACY  : Factor w/ 5 levels "animate","collective",..: 2 1 1 2 1 1 1 1 1 1 ...
 $ POR_FINAL_SIB: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 1 ...
 $ POR_DEF      : Factor w/ 2 levels "definite","indefinite": 1 1 1 1 1 1 1 1 1 1 ...

d$LEN_PORmPUM_LOG <- log2(d$POR_LENGTH)-log2(d$PUM_LENGTH)

And let’s refit our initial model from there:

summary(m_01 <- glm( # make/summarize the gen. linear model m_01:
   GENITIVE ~ 1 + LEN_PORmPUM_LOG*POR_ANIMACY*SPEAKER, # GENITIVE ~ these predictors
   data=d, family=binomial, na.action=na.exclude)) # vars are in d, resp = binary, skip NAs


Call:
glm(formula = GENITIVE ~ 1 + LEN_PORmPUM_LOG * POR_ANIMACY *
    SPEAKER, family = binomial, data = d, na.action = na.exclude)

Coefficients:
                                                Estimate Std. Error z value
(Intercept)                                      0.65632    0.09135   7.184
LEN_PORmPUM_LOG                                 -0.70601    0.08154  -8.659
POR_ANIMACYcollective                           -1.77888    0.16150 -11.015
POR_ANIMACYinanimate                            -4.32698    0.21686 -19.953
POR_ANIMACYlocative                             -2.30616    0.27227  -8.470
POR_ANIMACYtemporal                             -1.32138    0.22536  -5.863
SPEAKERns                                       -0.08278    0.17204  -0.481
LEN_PORmPUM_LOG:POR_ANIMACYcollective           -0.46680    0.15054  -3.101
LEN_PORmPUM_LOG:POR_ANIMACYinanimate             0.32099    0.17729   1.811
LEN_PORmPUM_LOG:POR_ANIMACYlocative             -0.29249    0.25885  -1.130
LEN_PORmPUM_LOG:POR_ANIMACYtemporal              0.44522    0.18375   2.423
LEN_PORmPUM_LOG:SPEAKERns                        0.14245    0.16001   0.890
POR_ANIMACYcollective:SPEAKERns                  0.79493    0.29353   2.708
POR_ANIMACYinanimate:SPEAKERns                  -0.28854    0.51213  -0.563
POR_ANIMACYlocative:SPEAKERns                   -0.50011    0.49785  -1.005
POR_ANIMACYtemporal:SPEAKERns                    0.16161    0.41275   0.392
LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns -0.14283    0.28943  -0.493
LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns  -0.65372    0.43153  -1.515
LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns   -0.19485    0.50679  -0.384
LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns   -0.53061    0.38091  -1.393
                                                Pr(>|z|)
(Intercept)                                     6.75e-13 ***
LEN_PORmPUM_LOG                                  < 2e-16 ***
POR_ANIMACYcollective                            < 2e-16 ***
POR_ANIMACYinanimate                             < 2e-16 ***
POR_ANIMACYlocative                              < 2e-16 ***
POR_ANIMACYtemporal                             4.54e-09 ***
SPEAKERns                                        0.63039
LEN_PORmPUM_LOG:POR_ANIMACYcollective            0.00193 **
LEN_PORmPUM_LOG:POR_ANIMACYinanimate             0.07022 .
LEN_PORmPUM_LOG:POR_ANIMACYlocative              0.25849
LEN_PORmPUM_LOG:POR_ANIMACYtemporal              0.01539 *
LEN_PORmPUM_LOG:SPEAKERns                        0.37331
POR_ANIMACYcollective:SPEAKERns                  0.00677 **
POR_ANIMACYinanimate:SPEAKERns                   0.57315
POR_ANIMACYlocative:SPEAKERns                    0.31511
POR_ANIMACYtemporal:SPEAKERns                    0.69538
LEN_PORmPUM_LOG:POR_ANIMACYcollective:SPEAKERns  0.62166
LEN_PORmPUM_LOG:POR_ANIMACYinanimate:SPEAKERns   0.12980
LEN_PORmPUM_LOG:POR_ANIMACYlocative:SPEAKERns    0.70062
LEN_PORmPUM_LOG:POR_ANIMACYtemporal:SPEAKERns    0.16362
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4004.3  on 3599  degrees of freedom
Residual deviance: 2404.3  on 3580  degrees of freedom
AIC: 2444.3

Number of Fisher Scoring iterations: 7

Did we have enough data points for what we did there?

How many data points did we model? This time it is nrow(d)=3600 and of course also nrow(model.frame(m_01))=3600. With that, we ask ourselves

what is the limiting sample size? Generally speaking, m is computed as min(table(d$GENITIVE)) and, thus, our critical value is between 88 and 44.
what is the number of predictors/coefficients we tried to estimate from the data p in this case? The most direct way of getting that number is m_01$df.null-m_01$df.residual, which amounts to 19, which is of course the length of the model’s coefficient output minus 1: length(coef(m_01))-1=19.

Since p (19) is less than either ^m/₁₀=88 or ^m/₂₀=44, we’re fine.

1.1.3 Ordinal logistic regression

In session 07, we fit an ordinal logistic regression model to answer the question(s) whether the rating of a Spanish construction involving the semantic role of an experiencer varies as a function of

whether there is the right subject-verb agreement (AGREEMENT: no vs. yes);
whether the clause is negated (NEGATION: affirmative vs. negative);
whether the clause contains an emphatic pronoun (EMPHPRON: no vs. yes);
what the order of the semantic and the grammatical subject is (SUBJORDER: gramsem vs. semgram);
what the number of the grammatical subject is (NUMPERGRAM: plural vs. singular);
what the person of the grammatical subject is (PERSONGRAM: first vs. second vs. third);
any interaction of a predictor with AGREEMENT.

Let’s reload those data and apply the same transformation(s) etc. there as we did earlier:

rm(list=ls(all.names=TRUE))
library(car); library(effects); library(MASS); library(magrittr)
str(d <- read.delim(       # summarize d, the result of loading
   file="_input/agreementratings.csv", # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

'data.frame':   4300 obs. of  8 variables:
 $ CASE      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ JUDGMENT  : int  3 -3 3 3 -3 2 -2 2 -3 3 ...
 $ AGREEMENT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
 $ NEGATION  : Factor w/ 2 levels "affirmative",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EMPHPRON  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ SUBJORDER : Factor w/ 2 levels "gramsem","semgram": 1 1 1 1 1 1 1 1 2 2 ...
 $ NUMBERGRAM: Factor w/ 2 levels "plural","singular": 1 1 1 1 1 1 1 1 1 1 ...
 $ PERSONGRAM: Factor w/ 3 levels "first","second",..: 3 3 3 3 3 3 3 3 3 3 ...

d$JUDGMENT <- factor(d$JUDGMENT, ordered=TRUE)

And let’s refit our initial model from there:

summary(m_01 <- polr(     # summarize m_01, which models
   JUDGMENT ~ 1 +         # JUDGMENT ~ an overall intercept (1)
   # AGREEMENT interacting w/ all other predictors
   AGREEMENT * (NEGATION+EMPHPRON+SUBJORDER+NUMBERGRAM+PERSONGRAM),
   Hess=TRUE,             # recommended for using summary
   data=d,                # those vars are in d
   na.action=na.exclude)) # skip cases with NA/missing data

Call:
polr(formula = JUDGMENT ~ 1 + AGREEMENT * (NEGATION + EMPHPRON +
    SUBJORDER + NUMBERGRAM + PERSONGRAM), data = d, na.action = na.exclude,
    Hess = TRUE)

Coefficients:
                                   Value Std. Error t value
AGREEMENTyes                     1.00762    0.23047  4.3720
NEGATIONnegative                -0.07963    0.07825 -1.0177
EMPHPRONyes                      0.01820    0.07869  0.2313
SUBJORDERsemgram                 0.26155    0.07833  3.3390
NUMBERGRAMsingular              -0.28824    0.12844 -2.2441
PERSONGRAMsecond                 0.12173    0.10676  1.1402
PERSONGRAMthird                  0.41046    0.16241  2.5273
AGREEMENTyes:NEGATIONnegative    0.06444    0.10980  0.5869
AGREEMENTyes:EMPHPRONyes        -0.14410    0.11020 -1.3077
AGREEMENTyes:SUBJORDERsemgram    0.39130    0.11081  3.5311
AGREEMENTyes:NUMBERGRAMsingular  0.24572    0.18523  1.3266
AGREEMENTyes:PERSONGRAMsecond   -0.05795    0.15453 -0.3750
AGREEMENTyes:PERSONGRAMthird    -0.36910    0.23059 -1.6007

Intercepts:
      Value   Std. Error t value
-3|-2 -1.2230  0.1653    -7.4002
-2|-1 -0.4605  0.1635    -2.8162
-1|0   0.0101  0.1632     0.0617
0|1    0.3675  0.1634     2.2493
1|2    0.7353  0.1637     4.4917
2|3    1.6316  0.1652     9.8743

Residual Deviance: 15099.88
AIC: 15137.88

Did we have enough data points for what we did there?

How many data points did we model? It again is nrow(d)=4300 and of course also nrow(model.frame(m_01))=4300; for polr objects, we could also get this from m_01$n, which also returns 4300). With that, we ask ourselves

what is the limiting sample size? Generally speaking, m is computed as m_01$n - (1/m_01$n^2) * sum(table(d$JUDGMENT)^3)=4123.5 and, thus, our critical value is between ^m/₁₀=412.4 and ^m/₂₀=206.2.
what is the number of predictors/coefficients we tried to estimate from the data p in this case? We can get that directly from the model object with m_01$edf (edf for ‘effective degrees of freedom’) and we get 19.

Since p (19) is less than either ^m/₁₀=412.4 or ^m/₂₀=206.2, we’re fine.

1.2 Residuals (of a linear model)

1.2.1 Example 1

In an earlier session, we fit a linear model to answer the question whether the reaction time to a word (in ms) varies as a function of the frequency of that word (normalized to per million words, FREQPMW). We first re-load the data and refit the model, …

rm(list=ls(all.names=TRUE))
library(car); library(effects); library(magrittr)
str(d <- read.delim(       # summarize d, the result of loading
   file="_input/rts.csv",  # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

'data.frame':   8000 obs. of  9 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
 $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
 $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
 $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
 $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
 $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
 $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...

summary(m_01 <- lm(               # make/summarize the linear model m_01:
   RT ~ 1 + FREQPMW,              # RT ~ this predictor
   data=d, na.action=na.exclude)) # vars are in d, skip NAs


Call:
lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-396.3 -156.1  -68.9   72.4 3468.1

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 673.83892    3.65267 184.479  < 2e-16 ***
FREQPMW      -0.14926    0.02546  -5.863 4.73e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 262.5 on 7750 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.004416,  Adjusted R-squared:  0.004287
F-statistic: 34.37 on 1 and 7750 DF,  p-value: 4.734e-09

… and now we generate three model-diagnostic plots for this model to see whether this model was a ‘good one’:

par(mfrow=c(1, 3))    # make the plotting window have 1 rows & 3 columns
plot(m_01, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m_01), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1))    # reset to default

Obviously, these look terrible! what can we do? There are some ways to address this, some of which we already discussed and which include, but are not limited to, any one or more of the following:

doing something about the response variable before the model such as
- transforming the response variable (e.g., logging or some other transformation);
- restricting the response variable (e.g., picking shortest densest regions);
doing something about the predictor before the model such as
- transforming the predictor (e.g., logging or some other transformation);
- allowing the predictor to be (more) curved;
doing something about the data points that, after modeling, are accounted for worst, which could be done
- by determining all data points’ (standardized) residuals with stdres or residuals and
- excluding those with high absolute (standardized) residuals;
adding more predictors (to further reduce the deviance).

Let’s look at each one in isolation, beginning with transforming the response. As you can see, that is already somewhat useful:

# restricting the response variable
d$RTLOG <- log2(d$RT)
m_01.resptransf <- lm(   # make the linear model m_01.resptransf:
   RTLOG ~ 1 + FREQPMW,  # RTLOG ~ this predictor
   data=d, na.action=na.exclude) # vars are in d, skip NAs
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columns
plot(m_01.resptransf, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m_01.resptransf), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1)) # reset to default

Next we try restricting the response, which seems to work very well, even better than the log transformation:

table(keep <- d$RT>=338 & d$RT<=1042, useNA="ifany") # restricting the response


FALSE  TRUE  <NA>
  552  7200   248

m_01.resprestr <- lm(    # make the linear model m_01.resprestr:
   RT ~ 1 + FREQPMW,     # RT ~ this predictor
   data=d, subset=keep,  # vars are in d, but use only this subset
   na.action=na.exclude) # skip NAs
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columns
plot(m_01.resprestr, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m_01.resprestr), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1)) # reset to default

Now what about changing things on the predictor side of things? For that we could use the predictor ZIPFFREQ, which is a log-transformed version of FREQPMW, and the result is not very promising!

m_01.predtransf <- lm(           # make the linear model m_01.predtransf:
   RT ~ 1 + ZIPFFREQ,            # RT ~ this predictor
   data=d, na.action=na.exclude) # vars are in d, skip NAs
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columns
plot(m_01.predtransf, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m_01.predtransf), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1)) # reset to default

We could also allow the predictor to be curved (something we’ll deal with in more detail in 204), but that’s again not great:

m_01.predpoly <- lm(             # make the linear model m_01.predpoly:
   RT ~ 1 + poly(FREQPMW, 3),    # RT ~ this predictor
   data=d, na.action=na.exclude) # vars are in d, skip NAs
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columns
plot(m_01.predpoly, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m_01.predpoly), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1)) # reset to default

Interim summary: working on the response seems more promising than working on the predictor, and of the two ways of dealing with the response, the restriction strategy seems to be better.

Now, what about trying to fix this after the first bad model? Let’s exclude all data points with absolute standardized residuals >2.5. It seems as of some of the previous approaches worked better than this one here:

table(keep <- abs(stdres(m_01)) < 2.5, useNA="ifany")


FALSE  TRUE  <NA>
  229  7523   248

m_01.stdres <- lm(       # make the linear model m_01.stdres:
   RT ~ 1 + FREQPMW,     # RT ~ this predictor
   data=d, subset=keep,  # vars are in d, but use only this subset
   na.action=na.exclude) # skip NAs
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columns
plot(m_01.stdres, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m_01.stdres), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1)) # reset to default

Finally, note also that we’re just using a single predictor here, meaning we’re asking that one predictor ZIPFFREQ to do a lot – if other predictors are included, the situation can often change quite a bit for the better. Thus, if one

restricts the response (as above), and
uses the log-transformed predictor ZIPFFREQ and lets it be curved, and
adds the log-transformed control LENGTH and lets it be curved, …

then the diagnostics look fairly good …

table(keep <- d$RT>=338 & d$RT<=1042, useNA="ifany")


FALSE  TRUE  <NA>
  552  7200   248

m_01.multiple <- lm(            # make the linear m_01.multiple:
   RT ~ 1 + poly(ZIPFFREQ, 2) + # RT ~ ZIPFFREQ (curved)
   poly(LENGTH, 2),             # + a control variable (also curved)
   data=d, subset=keep,         # vars are in d, but use only this subset
   na.action=na.exclude)        # skip cases with NA/missing data (good habit)
par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 column
plot(m_01.multiple, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m_01.multiple), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1)) # reset to default

… and the predictor and the control behave as hypothesized:

plot(effect("poly(ZIPFFREQ, 2)", m_01.multiple), ylim=c(338, 1042), grid=TRUE)

plot(effect("poly(LENGTH, 2)", m_01.multiple), ylim=c(338, 1042), grid=TRUE)

As you can see, there may be no simple solution …

1.2.2 Example 2

Remember that the residuals of a linear model are supposed to be normally distributed and to, when plotted against the fitted values, exhibit constant variance and no particular patterning. With that in mind, let’s check this for m_final_lin from last session, where we forced a linear model onto ordinal data. We first re-load the data and refit the model, …

str(d_ord <- read.delim("_input/agreementratings.csv", stringsAsFactors=TRUE))

'data.frame':   4300 obs. of  8 variables:
 $ CASE      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ JUDGMENT  : int  3 -3 3 3 -3 2 -2 2 -3 3 ...
 $ AGREEMENT : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...
 $ NEGATION  : Factor w/ 2 levels "affirmative",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ EMPHPRON  : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ SUBJORDER : Factor w/ 2 levels "gramsem","semgram": 1 1 1 1 1 1 1 1 2 2 ...
 $ NUMBERGRAM: Factor w/ 2 levels "plural","singular": 1 1 1 1 1 1 1 1 1 1 ...
 $ PERSONGRAM: Factor w/ 3 levels "first","second",..: 3 3 3 3 3 3 3 3 3 3 ...

summary(m_final_lin <- lm(JUDGMENT ~ AGREEMENT * (SUBJORDER + PERSONGRAM) + NUMBERGRAM, data=d_ord))


Call:
lm(formula = JUDGMENT ~ AGREEMENT * (SUBJORDER + PERSONGRAM) +
    NUMBERGRAM, data = d_ord)

Residuals:
    Min      1Q  Median      3Q     Max
-4.8413 -1.8413  0.4113  1.4952  3.6185

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)
(Intercept)                   -0.37752    0.14480  -2.607 0.009160 **
AGREEMENTyes                   1.47360    0.12867  11.453  < 2e-16 ***
SUBJORDERsemgram               0.32528    0.09289   3.502 0.000467 ***
PERSONGRAMsecond               0.20719    0.11839   1.750 0.080177 .
PERSONGRAMthird                0.64459    0.16053   4.015 6.03e-05 ***
NUMBERGRAMsingular            -0.24100    0.10872  -2.217 0.026693 *
AGREEMENTyes:SUBJORDERsemgram  0.32447    0.12975   2.501 0.012427 *
AGREEMENTyes:PERSONGRAMsecond -0.11173    0.15619  -0.715 0.474410
AGREEMENTyes:PERSONGRAMthird  -0.73507    0.16125  -4.559 5.29e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.101 on 4291 degrees of freedom
Multiple R-squared:  0.1219,    Adjusted R-squared:  0.1202
F-statistic: 74.43 on 8 and 4291 DF,  p-value: < 2.2e-16

… and then we plot diagnostic plots:

par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columns
plot(main="Residuals vs. fitted values", pch=16, col="#00000010", axes=TRUE,
   xlab="Fitted values", xlim=c(-1, 2), x=jitter(predict(m_final_lin)),
   ylab="Residuals"                   , y=residuals(m_final_lin))
   axis(1, at=seq(-1, 2, 1)); axis(2); grid()
   lines(lowess(residuals(m_final_lin) ~ predict(m_final_lin)), col="red")
qqnorm(residuals(m_final_lin)); qqline(residuals(m_final_lin)); grid()
hist(residuals(m_final_lin), breaks="FD", main="Residuals")
par(mfrow=c(1, 1)) # reset to default

This doesn’t look great: there is structure in the first plot (which of course is in part due to the 7 discrete levels of the response) and both the middle and the right plot don’t exactly scream ‘the residuals are normally distributed’. Based on these diagnostic plots, I would advise against the adoption of the linear model here – yes, it is more straightforward and at least one effect is comparable, but the data don’t really make the linear model seem appropriate.

1.3 Influential observations

The previous part was concerned with a regression model that may not have passed its diagnostic tests. Apart from “removing data points that the regression can only deal with very badly” (i.e. working with residuals), we can also try and identify which points have a perhaps an unduly high influence on the regression line or predictions, e.g. by identifying data points with something like extreme dfbetas (or Cook’s D-values).

rm(list=ls(all.names=TRUE)); library(effects)
str(d <- read.delim(       # summarize d, the result of loading
   file="_input/rts.csv",  # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

'data.frame':   8000 obs. of  9 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
 $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
 $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
 $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
 $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
 $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
 $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...

Does that help for the linear model m_01 we fit with FREQPMW above?

summary(m_01 <- lm(               # make/summarize the linear model m_01:
   RT ~ 1 + FREQPMW,              # RT ~ FREQPMW
   data=d, na.action=na.exclude)) # those vars are in d, skip NAs


Call:
lm(formula = RT ~ 1 + FREQPMW, data = d, na.action = na.exclude)

Residuals:
   Min     1Q Median     3Q    Max
-396.3 -156.1  -68.9   72.4 3468.1

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 673.83892    3.65267 184.479  < 2e-16 ***
FREQPMW      -0.14926    0.02546  -5.863 4.73e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 262.5 on 7750 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.004416,  Adjusted R-squared:  0.004287
F-statistic: 34.37 on 1 and 7750 DF,  p-value: 4.734e-09

# compute dfbetas for each coefficient & each observation
summary(dfbetas_m_01 <- dfbeta(m_01))

  (Intercept)            FREQPMW
 Min.   :-4.380e-01   Min.   :-1.676e-03
 1st Qu.:-2.170e-02   1st Qu.:-5.219e-05
 Median :-6.234e-03   Median : 4.671e-06
 Mean   :-3.470e-06   Mean   : 5.230e-08
 3rd Qu.: 8.259e-03   3rd Qu.: 6.134e-05
 Max.   : 4.587e-01   Max.   : 6.253e-03

# find the values that cut off 5% on each side:
apply(dfbetas_m_01, 2, quantile, probs=c(0.05, 0.95))

    (Intercept)       FREQPMW
5%  -0.04028195 -0.0002967588
95%  0.06873367  0.0001671488

# get ready tp keep only the non-extreme values:
keep <- (dfbetas_m_01[,1]>=-0.04028195  & dfbetas_m_01[,1] <= 0.06873367)   &
        (dfbetas_m_01[,2]> -0.000296758 & dfbetas_m_01[,2] <= 0.0001671488)

Seems like we could try and get a better fit if we eliminate the 10% values with the highest absolute dfbetas:

summary(m_01_inflnc <- lm( # make/summarize the linear model m_01_inflnc:
   RT ~ 1 + FREQPMW,       # RT ~ this predictor
   data=d, subset=keep,    # vars are in d, but use only this subset
   na.action=na.exclude))  # skip NAs


Call:
lm(formula = RT ~ 1 + FREQPMW, data = d, subset = keep, na.action = na.exclude)

Residuals:
    Min      1Q  Median      3Q     Max
-284.63 -107.34  -28.29   80.79  600.41

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 643.08196    2.32562  276.52   <2e-16 ***
FREQPMW      -0.31856    0.02216  -14.37   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 141.5 on 6561 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.03053,   Adjusted R-squared:  0.03038
F-statistic: 206.6 on 1 and 6561 DF,  p-value: < 2.2e-16

par(mfrow=c(1, 3)) # make the plotting window have 1 rows & 3 columns
plot(m_01_inflnc, which=1:2) # plot model-diagnostic plots 1 & 2
hist(residuals(m_01_inflnc), main="Residuals") # add a histogram of the residuals
par(mfrow=c(1, 1)) # reset to default

These plots for m_01_inflnc suggest some slightly confusing/surprising improvement:

the left plot for m_01_inflnc is very strongly patterned;
the two other plots have improved considerably.

Again, there simply aren’t always straightforward solutions for data like this …

2 Validation and overfitting

We are dealing with the same data set as in earlier sessions; as a reminder, the data are in _input/rts.csv and you can find information about the variables/columns in _input/rts.r. There, we addressed the question of whether the reaction time to a word (in ms) varies as a function of

the Zipf frequency of that word (ZIPFFREQ);
the language that word was presented in (LANGUAGE: english vs. spanish);
the speaker group that words was presented to (GROUP: english vs. heritage);
any pairwise interaction of these predictors;
the three-way interaction of these predictors.

At the end of our modeling process, which involved logging the response variable RT, we arrived at this final model:

rm(list=ls(all.names=TRUE))
library(car); library(effects)
str(d <- read.delim( # show the structure of d, the result of loading
   file="_input/rts.csv",  # this file
   stringsAsFactors=TRUE)) # change categorical variables into factors

'data.frame':   8000 obs. of  9 variables:
 $ CASE    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ RT      : int  748 408 483 1266 636 395 765 1297 849 315 ...
 $ LENGTH  : int  5 5 6 4 6 6 8 4 5 6 ...
 $ LANGUAGE: Factor w/ 2 levels "english","spanish": 2 2 2 2 2 2 2 2 2 2 ...
 $ GROUP   : Factor w/ 2 levels "english","heritage": 2 2 2 2 2 2 2 2 2 2 ...
 $ CORRECT : Factor w/ 2 levels "correct","incorrect": 1 1 1 1 1 1 1 1 1 1 ...
 $ FREQPMW : int  19 78 233 133 14 67 34 114 2 20 ...
 $ ZIPFFREQ: num  4.28 4.89 5.37 5.12 4.15 ...
 $ SITE    : Factor w/ 3 levels "a","b","c": 2 2 2 2 2 2 3 2 2 1 ...

d$RT_log <- log2(d$RT)
summary(m_final <- lm(RT_log ~ 1 +
   ZIPFFREQ + LANGUAGE + GROUP + LENGTH + ZIPFFREQ:LENGTH +
   LANGUAGE:GROUP + LANGUAGE:LENGTH,
   data=d, na.action=na.exclude))


Call:
lm(formula = RT_log ~ 1 + ZIPFFREQ + LANGUAGE + GROUP + LENGTH +
    ZIPFFREQ:LENGTH + LANGUAGE:GROUP + LANGUAGE:LENGTH, data = d,
    na.action = na.exclude)

Residuals:
     Min       1Q   Median       3Q      Max
-1.24964 -0.27871 -0.06224  0.21043  2.55420

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                    9.002136   0.207740  43.334  < 2e-16 ***
ZIPFFREQ                       0.007370   0.042986   0.171  0.86387
LANGUAGEspanish                0.257516   0.055524   4.638 3.58e-06 ***
GROUPheritage                  0.102783   0.013426   7.655 2.16e-14 ***
LENGTH                         0.113334   0.038903   2.913  0.00359 **
ZIPFFREQ:LENGTH               -0.021360   0.008065  -2.648  0.00811 **
LANGUAGEspanish:GROUPheritage -0.205373   0.019252 -10.668  < 2e-16 ***
LANGUAGEspanish:LENGTH         0.018385   0.010480   1.754  0.07943 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4233 on 7744 degrees of freedom
  (248 observations deleted due to missingness)
Multiple R-squared:  0.1227,    Adjusted R-squared:  0.1219
F-statistic: 154.7 on 7 and 7744 DF,  p-value: < 2.2e-16

The R-squared value for this is 0.1227 – but how much does this overestimate things because the correlation is based on predictions that are applied to the same data set that the model is based on? How could we do this better? We could

split our data up into 10 bins;
for each of the bins, we
1. fit a model based on 90% of the data not in the bin;
2. make predictions for the held-out remaining 10% of the data in the bin;
3. collect the resulting R² in a data structure called collector;
compare the R² from m_final to the distribution of collector:

# step 1:
set.seed(123); d$RAND <- sample(
   1:10,
   size=nrow(d),
   replace=TRUE)
# you could also do this to get a perfectly even distribution:
# d$RAND <- sample(rep(1:10, length.out=nrow(d)))
collector <- rep(-1, 10)
# step 2:
for (i in 1:10) {
   # step 2a: fit the training model
   d_train <- d[d$RAND!=i,]
   m_train <- lm(RT_log ~ 1 +
      ZIPFFREQ + LANGUAGE + GROUP + LENGTH +
      ZIPFFREQ:LENGTH + LANGUAGE:GROUP + LANGUAGE:LENGTH,
      data=d_train, na.action=na.exclude)
   # step 2b: make predictions for the test data
   d_test  <- d[d$RAND==i,]
   d_test$PRED <- predict(m_train, newdata=d_test)
   # step 2c: compute the multiple R-squared
   collector[i] <- cor(d_test$RT_log, d_test$PRED, use="complete.obs")^2
}; rm(d_test, d_train, i, m_train); invisible(gc())
# step 3:
collector %>% summary

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.08413 0.10627 0.11137 0.12269 0.14717 0.16249

Our initial final model does not seem to overfit at all!

3 Session info

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  compiler  methods
[8] base

other attached packages:
[1] MASS_7.3-65    car_3.1-3      effects_4.2-2  carData_3.0-5  STGmisc_1.02
[6] Rcpp_1.0.14    magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] Matrix_1.7-3       jsonlite_2.0.0     survey_4.4-2       splines_4.5.0
 [5] boot_1.3-31        yaml_2.3.10        fastmap_1.2.0      lattice_0.22-7
 [9] Formula_1.2-5      knitr_1.50         rbibutils_2.3      htmlwidgets_1.6.4
[13] nloptr_2.2.1       insight_1.3.0      nnet_7.3-20        minqa_1.2.8
[17] DBI_1.2.3          rlang_1.1.6        xfun_0.52          estimability_1.5.1
[21] cli_3.6.5          Rdpack_2.6.4       digest_0.6.37      grid_4.5.0
[25] rstudioapi_0.17.1  lme4_1.1-37        nlme_3.1-168       reformulas_0.4.1
[29] evaluate_1.0.3     mitools_2.4        survival_3.8-3     abind_1.4-8
[33] colorspace_2.1-1   rmarkdown_2.29     tools_4.5.0        htmltools_0.5.8.1